Text+ User Story

Flensburg Corpus of Low German Literature / Dialect Literature Corpus

Robert Langhanke (Europa-Universität Flensburg)

DFG subject areas: 104 Linguistics, 105 Literary Studies

Text+ data domain: Collections

Motivation

For quite some time now I have been dealing with the manifold information content of older and newer dialect literature. In addition to the literary and content-related impulses, there is also an informative linguistic database, which documents space-creating grammar and lexicology in a different way than standard oriental literature. The source-critical restrictions that apply because of the written media tradition, since at most reflexes of a dialectical orality can be documented, do not devalue the material. Rather, it is to be understood as a conscious linguistic formation of the dialectal, which, according to my conviction, does not or only to a limited extent stand in opposition to the reality of the spoken language, to which it is theoretically and practically committed. Also, independently of the necessary demarcation to linguistic sources literal conversions of the dialect always represent an independent and considerable volume of data, which possesses its own intensity due to its large language consciousness. 

Special attention must be paid to the data from the establishment of a dialect literature at the end of the 18th century until the first half of the 20th century, since naturally no or only selected audio material is available for this period, which is why it is essential to compile as broad a list as possible of the various written language source genres. In addition, the rights to the corresponding literary material are generally free. 

The description of historical grammatical and lexical structures is therefore dependent on written language sources that appear in the form of many different types of texts, each of which is relevant in a specific way to certain questions. Several corpora are therefore already providing digitally processed older text material. For example, the individual linguistics and historical linguistics departments have corresponding databases in which text types of the older language levels have been recorded using complex database systems. The focus is on the annotation-supported preparation of texts for grammatical problems. So far, dialect literature has hardly been recorded accordingly, so that the Flensburg corpus of Low German literature / dialect literature corpus addresses this. 

Objectives

Smaller and larger corpora of older dialect literary texts are the existing research basis of individual studies. For numerous studies since the 19th century dialect literary source material was evaluated and processed. Slide dictionaries also include this data stock. However, more widely accessible corpora have not been created, and there is usually no digital processing. I would like to take this as a starting point, oriented on the standards of larger corpora for the older language sections. In this way, a dynamically expandable text corpus is created, which is open to dialectological and other linguistic as well as literary-scientific questions. The usability of the text data depends on their digital preparation and technical connectivity. It seems to me to be of primary importance to involve heterogeneous user groups. 

A particular challenge is the basic preparation of the corpus, i.e. the creation of editable digital versions of the selected texts, since even partially automated processes require constant correction. 

I consider the access to Low German literature as a first step. Subsequently and in parallel, further areas of dialect literature are to be developed as a corpus basis, since the linguistic, literary historical and literary value of the data for each dialect region is to be determined and made accessible to the same extent. There are clear differences in the quantity of the text volume, which is also pre-recorded bibliographically to varying degrees. However, questions of linguistic and literary text quality are not decisive in the first step and are subject to the always necessary source criticism, which must be supplemented later. Literary translations of a dialect that are worthy of criticism also contain specific information. 

I thus recognize the design of the corpus as a dynamic platform that enables linguistic, literary and cultural studies as well as literary, cultural and language historical questions to be posed. From a regional perspective, corresponding subcorpora such as the “Schleswig-Holstein Corpus”, will be created. 

Solutions 

There are two conditions that must be met when preparing the corpus of dialect literature. On the one hand, it must be adequately applicable to different scientific questions and must be able to face different evaluation procedures. On the other hand, its creation must be made possible by barrier-free access in order to create a broad base of contributors, since the Flensburg corpus of Low German literature / dialect literature integrates the potential of citizen science and crowdsourcing. 

In order to integrate the greater general interest in the subject matter and to be able to use the commitment to Low German and other dialectal language forms and their texts for the corpus project, I envision barrier-free input masks that allow the transfer of text and at the same time offer the option of a corrective revision for further users. The path from text input and correction to making the text visible to the general public and making it available for further editing must be transparent and already allow the visibility of partial results. 

An appropriate input mask, which continuously documents progress and should be ready for further processing of the texts, can quickly bring about the success of the editors and at the same time give the research community the opportunity to intervene in the process both in an evaluating and correcting manner. The common basis of a successively expanded and reliably documented corpus enables a variety of subsequent uses and progressive data optimization and adaptation to changing standards of data presentation. 

Challenges 

The corpus project outlined above faces a great number of challenges. The desired work process involving numerous helping forces from heterogeneous work contexts can only be successful if sufficient processing time and correction procedures are planned. The identification of suitable and legally available text material as well as its transcription and further processing by adding basic and supplementary information requires precisely documented and repeatedly checked work according to an established pattern. 

Heterogeneous orthographic conversions of the original texts, which must be preserved at all costs, must be brought together via annotated and transferred further versions for the following research steps. 

Review by Community 

Uses of the corpus data material must show, whether the information given in a first step is sufficient for valuable evaluations of the material. The degree of evaluation of the corpus within the corpus structure is to be expanded step by step and is also dependent on the cooperation of the research community from the student seminar initiative to larger projects, whose work and results should be documented and marked on the corpus itself in order to enable subsequent uses to follow on from these results. This form of productive and corrective use and evaluation could prove to be particularly dynamic and beneficial for the corpus project.