Text+ User Story

Routines and best practise for the curation/integration of research data coming from non-european affiliations

Timm Lehmberg (Universität Hamburg)

DFG subject area: 104 Linguistics

Text+ data domain: Collections

Motivation

Our approach aims at the development of routines and methods for the digitization and curation of the Tomsk Toponyme Archive and for its integration into the Text+ infrastructure (and thus correlation with object language resources provided by Text+).   

Exemplary Index Card

The Archive holds a structured large-scale collection of approximately 90.000 handwritten index cards stored in index boxes at Tomsk State Pedagogical University (TSPU). The cards hold Information on spatial entities in the Northern Eurasian Area, defined by at least a name (written in cyrilic characters), geographic coordinates (longitude and latitude) and an acronyme to define the entitiy type (i. e. р. for река → river, оз. for озеро → lake etc.), for an example see fig 1. 

Fig 1: exemplary index card 

Toponyme data of this extent is considered to be a valuable source of information on the cultural history of the region in focus. The extensive structured documentation of the naming of rivers and lakes (hydronyms), mountains and valleys (oronyms), settlements, etc. enables researchers of various disciplines to prove hypotheses concerning  the presence of language and dialect communities on a particular territory in a certain period of time and thus migration processes and structures, contact phenomena, distribution of particular cultures and many more.  

Both as a single resource and in correlation with object language data (in this exemplary case i. e. language documentation data coming from the Northern Eurasian Area) the resource provides multiple application scenarios for linguistic, typological, and anthropological studies of the region in focus.  

From a technical-methodical point of view the resource has great potential in the sense that it combines all three areas (manuscript) editions, lexical resources and collections and beyond that contains spatial data in the form of unified geocoordinates.  

After duplex scanning a selection of index cards following the DFG Practical Guidelines on Digitization at the partner site in Tomsk, OCR, data modelling and publication are intended to be performed collaboratively by both project sites (Hamburg and Tomsk).  

Objectives 

- Providing legal advice concerning intellectual property right and data protection issues for usage agreements and data transfer between European and non-European research affiliations.  

- Access to language resources provided by Text+ data centres- Providing expertise and support from data centres involved in the OCR-D1 initiative.

Solution

Contributing to open existing authentication infrastructures for non-European research affiliations.  

Review by community

- The approach will allow for an evaluation of existing and evolving ORC‑D methods.