Text+ User Story

Non-European Lexicography

Felix Rau, Gabriele Schwierzt, Nikolaus Himmelmann (Universität Köln)

DFG subject areas: 104 Linguistics

Text+ data domain: Lexical Resources

Motivation

The words used by a speech community provide extremely valuable cues to its history, its cultural practices and priorities and its interaction with the environment it lives in. Words allow the reconstruction of major stages in the history of groups who speak genetically related languages. They also typically reflect the contact history a group has gone through, sometimes showing points of contacts going 1000 years back or more. But they also provide different systematizations of flora and fauna and evidence for communalities and differences in conceptualizing the world. Do all languages have a word for ‘give’? What are alternative ways of expressing transfer events? Do all languages have roughly the same number of verbs? Which domains of everyday experience are captures by large sets of words allowing for fine-grained distinctions, where are only rough distinctions being made?

To answer such questions one needs large databases of lexical data from many different speech communities, ideally including different layers of annotations for grammatical properties, lexical fields, social significance, etc. It should be possible to publish structurally complex, diverse lexical resources sustainably. This service should provide the resources via APIs. The resources should be accessible through a REST, GraphQL and where applicable maybe a SPARQL API. Dictionary entries should be possible to be referenced by stable addresses – URLs and ideally PID. This set up facilitates linking instances of words in corpora from dictionaries as well as dictionary entries from corpora.

As a researcher in the field of typology and linguistic diversity research, I want to improve understanding of grammatical and lexicological understanding by creating rich lexical resources with word frequency information (digital frequency dictionaries). (104–01 Allgemeine und Vergleichende Sprachwissenschaft, Typologie, Außereuropäische Sprachen)

There are many lexical resources in the field of language documentation, typology, and linguistic diversity research. These resources were and are being compiled in linguistic fieldwork. Unfortunately, these resources are currently rarely made accessible online.

There is currently no option to make structurally complex lexical resources (dictionaries and lexical databases) digitally available and link them to multimodal and text corpora from the same language. This problem is particularly pressing for lexical resources compiled during fieldwork on under-studied, structurally diverse languages.

Lexical resources in the field of language documentation, typology, and linguistic diversity research are mostly bi-lingual (sometimes tri-lingual) and can have complex entry structures. The field relies on software and workflows that require resources to be in LIFT or Toolbox Lexical Database formats, but also use other formats from digital lexicography and NLP, such as TEI or OntoLex, as well as project specific XML formats and CSV files.

Objectives

To fulfil this user story, the infrastructure should provide a possibility to publish structurally complex, diverse lexical resources sustainably. This service should provide the resources via APIs.

Solution

The resources should be accessible through a REST, GraphQL and where applicable maybe a SPARQL API. Dictionary entries should be possible to be referenced by stable addresses – URLs and ideally PID. This set up facilitates linking instances of words in corpora from dictionaries as well as dictionary entries from corpora.

Challenges

The structural diversity of lexical resources and the diversity of formats needs to be managed as it poses the biggest issues for the design of APIs and the data structure in the backend.