Text+ User Story

Annotated corpus of Low German contemporary texts

Michael Elmentaler (University of Kiel)

DFG subject area: 104 Linguistics

Text+ data domain: Collections

Motivation

Various surveys from recent years show where and how many people still speak “Platt” (Low German). However, little is known about the linguistic characteristics of modern Low German. This is due to the fact that traditional dialect grammars predominantly describe an older language status (ca. 1880–1960), whereas the current “Plattdeutsche Grammatik” (2017) from the SASS series is mainly normative; it specifies what is considered right or wrong, but does not describe how Low German is actually used. In order to investigate this, we need a balanced collection of authentic Low German texts representative for the modern use of language, covering the various regions of northern Germany. 

Objectives 

The compilation of a corpus of recent Low German narrative texts is intended to close this research gap. Preparatory work for this has been carried out since April 2019 at the Niederdeutsche Abteilung (Low German Department) at the University of Kiel, resulting in the editing of more than 90 Low German prose texts, which were entered in 1995 for the writing competition “Vertell doch mal” [“Tell us about”] (organized by NDR, Radio Bremen and the Ohnsorg Theater in Hamburg). The stories were written by non-professional authors and are therefore particularly well suited for our purposes, as they predominantly reflect a linguistic conduct that is relatively close to everyday life. Since the manuscripts from the competition are present in typewritten or handwritten form, they must first be transcribed and encoded into computer files to allow further processing and analysis. The original spelling is preserved in order to also be able to evaluate regional features in grammar or pronunciation. The texts cover the West and East Low German dialect areas in the German states of Schleswig-Holstein, Niedersachsen, Hamburg, Bremen and Mecklenburg-Vorpommern (for the geographical distribution of the texts processed so far, see the map in the file “Korpus niederdeutscher Gegenwartstexte – Karte”). 

In order to enable computer searches, all texts are to be annotated according to the category system of the STTS tagging conventions. In this way, special features of Low German grammar can be researched specifically and broadly for the first time, e.g. verbal connections such as sitten gahn “to sit down”, the periphrastic construction with doon (“to do”) (… wat he dat kopen deit “if he does buy it”) or the double negation (Ik glööv dat nie nicht “I never do not believe it”). Some grammatical phenomena may also show regional differences between the Low German dialects, e.g. in the use of the auxiliary verbs (Ik bün lopen vs. Ik heff lopen “I am run” vs. “I have run”). In addition, regionalisms in the vocabulary of the text corpus, sometimes even phonetic differences (snieden/schnieden “to cut”, wedder/weer/weller/werrer “again”) can be grasped more precisely than before. Besides such regional differences, the diachronic changes that have occurred under High German influence are also of interest (e.g. Ik gah na’n Dokter > Ik gah to’n Dokter “I am going to the doctor”, conjunction wat “if” > of, preposition achter “behind” > hinner). 

Solution 

The limited resources of the Low German Department won’t allow for a further expansion of the corpus. However, an expansion would be of great interest in two regards. First, it would make sense to significantly increase the number of annotated texts in order to achieve an even better regional distribution and areal density and to improve the statistical power of the corpus. Second, it would be desirable to record the current status of (written) dialects in addition to the 1995 time period – e.g. by analyzing the texts submitted into the 2020 competition – in order to be able to observe recent language changes. 

Since the competition is still being held with great success, suitable text material from lay writers is available in sufficient numbers. According to the organizers, about 1000 stories were submitted in 2020; a total of about 45,000 narrative texts have been archived. 

The combined application of the annotation tool EXMARaLDA and the evaluation programs CoMa and EXAKT has proven itself to be a valuable instrument for the preparation of the texts.