Text+ User Story

Development of Old French text corpora via modern language levels

Achim Stein (Universität Tübingen)

DFG subject area: 104 Linguistics

Text+ data domain: Lexical Resources

Motivation 

The content-related indexing of texts in old language levels is an essential question in many philologies. In Romance Studies research at the University of Stuttgart, Old French text corpora are semantically indexed in order to examine the argument structures. 

Historical dictionaries in digital and semantically enriched form are necessary for this research, since they provide the knowledge about the argument structures. The semantic-lexical word network GermaNet is used both for the creation of the digitized dictionary and for the translation of the dictionary into English. 

Objectives 

Digitization and enrichment of a historical dictionary to study the argument structures of old French text corpora.  

Solutions 

  • Extraction of Senses in an OCR output of historical dictionaries, for this purpose: Old French     dictionary, published since 1925 (available), contains German Sense descriptions of the Old French lemmas 
  • Linking of these senses with available semantic networks via GermaNet, thereby also linking with the English WordNet and other word networks. 
  • Analysis of argument structures via modern language representation and their mapping to Old French. 

Challenges 

  • The complete retro-digitization of dictionaries is too expensive;  
  • OCR alone is too imprecise and does not allow a complete and meaningful digitization of resources 
  • The normalization of the definitions in the historical dictionary is complex 
  • Only the mapping of historical dictionaries to modern, semantic networks allows a meaningful use of the resource 

Evaluation by Community 

“Preserving Semantic Information from Old Dictionaries: Linking Senses of the Altfranzosisches Wörterbuch to WordNet”: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec‑1.374.pdf