Text+ User Story

Integration of lexical data

Sonja Bosch (University of South Africa), Dirk Goldhahn, Thomas Eckart (University of Leipzig)

DFG subject area: 104 Linguistics

Text+ data domain: Lexical Resources

Motivation

We are part of a lexicographic unit of a university institute and are taking first steps towards e‑lexicography. We have various lexical data sets in different, mostly indigenous languages. In our case, multilingualism means that languages of several language families with sometimes very different word structures are the basis of our research and thus, depending on the data set, the structure of the lexicon entries sometimes differs significantly. The starting point of individual data sets ranges from analogue media (index cards, printed encyclopedias), which are currently or soon to be digitized, to already digital encyclopedia data in various formats. We are often dealing with Excel files or other legacy data in tabular form. The lack of IT training of our specialist scientists has so far proved to be a stumbling block in terms of using other approaches to data storage and processing. 

Objectives 

Our first goal is to process the various data sets. For this purpose, our non-IT specialists need technical support. An important aspect will also be the quality assurance, which will ideally be supported by technical means. 

It is also very important for us to convert the data into an open format that is widely used in the community in order to create a basis for future data exchange and collaborative work. The requirements resulting from the different language characteristics must also be taken into account. It should be possible to connect and integrate the resulting resources with relevant external data sets and services. 

In the medium term, we would like to see our data integrated into an open and future-proof platform that enables data provision in accordance with FAIR principles. 

With our data sets, speakers of minority languages will be able to exercise the freedom to use their mother tongue in accordance with the forthcoming UNESCO “Decade of Indigenous Languages (2022–2032)”.

Solution 

Text+ provides the expertise of experienced modelers of lexical resources to assist in the analysis of existing data and the technical conversion to a new format. Modelling is done using established standard formats, such as TEI, LMF or models of the Linked Data Community (such as Ontolex/Lemon). Text+ supports the necessary adaptation of these formats to preserve language-specific features. 

Text+ also provides hosting solutions in which the new resources are kept long-term and made available via community relevant interfaces and portals. The integration and aggregation with resources of other scientists or departments through user-friendly portals or interfaces is also an important support. 

Challenges 

For our part, we are already doing initial preparatory work in this field. Of the previous collaborations, the one with the German Vocabulary Project / the CLARIN Centre Leipzig has proved to be particularly successful. The work has shown that such projects can only be successful through long-term cooperation and constant support in iterative processes, for reference: Sonja Bosch, Thomas Eckart, Bettina Klimek, Dirk Goldhahn and Uwe Quasthoff: Preparation and Usage of Xhosa Lexicographical Data for a Multilingual, Federated Environment. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki (Japan), 2018. 

Review by Community 

Previous work with Leipzig has shown that the feedback loop must be a central aspect of such processes in order to ensure both high data quality and appropriate preparation and accessibility. 

Accordingly, we would be very interested in using and testing the services offered by Text+ in the context of the further digital processing of our resources.