Data and Competence Centres

The Text+ data domains are organised in thematic Clusters, which will provide comprehensive coverage of research data. The Clusters will bundle all activities related to specific subtypes of data and research methods in a data domain in accordance with the needs and research priorities of their specific communities of interest. They will engage in a continuous dialogue with Humanities scholars and offer data, software, and community services for a broad range of scientific disciplines in the Humanities whose research data focus on language and text.

Initially, the following eight Clusters will focus on Ancient Cultures, Anthropology, Classics, Comparative Literary Studies, Computational Linguistics, Language and Literary Studies for European and non-European Philologies, Medieval Studies, Philosophy, and Religious Studies.

A Cluster is typically built of at least one, often more data centres and additional competence centres. These are presented in the following overview.

Academy of Sciences and Humanities in Hamburg (Akademie der Wissenschaften in Hamburg, AdWHH), (Hamburg Centre for) Interdisciplinarity and Linguistic Diversity in Language Data (Exploration)

Data centre in the following Clusters of the data domain Collections: Contemporary Language; Historical Text

Since its foundation in 2004, the AdWHH promotes interdisciplinary research into societally significant issues relating to the future and fundamental scientific problems. Beyond that, the AdWHH currently coordinates five long-term research projects in the context of the Academies’ Programme (which is coordinated by the Union of the German Academies of Sciences and Humanities), each with a strong focus on the digital curation and analysis of unique and diverse language material. One prominent example to be mentioned is the project DGS-Korpus which aims at the comprehensive collection of sign language data and its compilation in the form of the Public DGS Corpus.

In order to provide a solid base for the long term-availability of diverse linguistic resources to worldwide research communities and the interested public, the AdWHH currently prepares a joint initiative with the Center for Sustainable Research Data Management (Zentrum für nachhaltiges Forschungsdatenmanagement, FDM). 

As a central operational unit at the University of Hamburg, the FDM, beyond other things, provides a local technical infrastructure (including a data repository) for sustainable research data management.

Expertise/resources to be provided with the Text+ infrastructure (to be specified/discussed with the heads of HH Langzeitvorhaben) will come from:

  • Beta maṣāḥǝft
    A systematic study of the Christian manuscript tradition of Ethiopia and Eritrea.
  • DGS-Korpus
    Systematically captures and documents German Sign Language (Deutsche Gebärdensprache, DGS) in all its diversity and creates an electronic dictionary based on the corpus data.
  • Etymologika
    Critical edition, translation and commentary of the Greek encyclopedia “Etymologicum Gudianum”. Research on the rich manuscript production of Greek-Byzantine etymological encyclopedias and presentation of the results in a printed and extensive digital version.
  • INEL Corpus
    Indigenous Northern Eurasian Languages (INEL): Providing language resources for indigenous languages and creating a digital research infrastructure for the use of these resources. Deeply annotated, glossed and for the most part audio-aligned corpora of the Dolgan, Kamas and Selkup language. During the designated funding period corpora of further languages (i.e. Evenki, Nenets) will follow.
  • Formulae – Litterae – Chartae
    Exploration and critical edition of the early medieval formulae in addition to facilitation of access to these via a digital research infrastructure enabling an exploration of formulaic writing in Western Europe prior to the development of the ars dictaminis based on letters and charters.

Berlin-Brandenburg Academy of Sciences and Humanities (Berlin-Brandenburgische Akademie der Wissenschaften, BBAW)

Data centre in the following Clusters of the data domain Collections: Historical Texts (Coord.); Contemporary Language

The Research Centre Language at the BBAW hosts various text collections and special corpora, primarily documenting the (historical) German language. Among these, the German Text Archive (Deutsches Textarchiv, DTA) is the largest single corpus of New High German covering the 16th through the early 20th century, comprising more than 350 million tokens in 1.34 million digitized pages. It is well-established and widely used within the community, by corpus and computational linguists, literature scholars, historians, cultural scientists, and other research domains. The DTA includes thoroughly annotated full text transcriptions of prints, newspapers and journals, as well as handwritten documents of multiple genres and text types. The transcriptions are compliant with the Text Encoding Initiative’s Extensible Markup Language (TEI-XML). External contributors can integrate additional text resources into the DTA infrastructure as DTA-Extensions (DTAE), a workflow that covers all data curation steps from capturing to annotating, rendering and publishing transcriptions and metadata. Within the CLARIAH-DE project, the whole ‘Digital Library’ from the TextGrid Repository will be integrated into the BBAW infrastructure after careful curation, amendment, necessary corrections and enhancement of the legacy data, thereby joining the two largest scientifically annotated literary corpora.

All texts are encoded following the DTA Base Format (DTABf), a pure subset of the TEI Guidelines, resulting in fully standardized and truly interoperable documents. The DTABf is recommended by the German Research Association (Deutsche Forschungsgemeinschaft, DFG) and by CLARIN-D, and has already been re-used by more than 30 projects in Germany and abroad. A set of tools and services assists with the preparation, processing and analysis of the data beforehand, while the web-based platform Deutsches Textarchiv – Qualitätssicherung (DTAQ) supports the collaborative quality assurance. DTAQ provides various search and retrieval facilities, data analysis and visualisation tools. Various output formats are generated for download and re-use in other contexts.

The DTA is closely connected to and fully accessible within a shared infrastructure with the Digital Dictionary of the German Language (Digitales Wörterbuch der deutschen Sprache, DWDS), resulting in a corpus base spanning more than 500 years, from the 16th century until recent times. Additionally, integrated special corpora also hosted by the Research Centre Language cover even earlier periods, e.g. the Reference Corpus of Middle High German. As one of the coordinators of the DFG-funded Initiative for Optical Character Recognition Development (OCR-D), and by being a prolific provider of Ground Truth data as well as format recommendations for the OCR process, the Research Centre Language at the BBAW has helped to build the expertise and shape the infrastructure for the pending full text-digitisation of the comprehensive VD 16, 17, 18 (Verzeichnisse der im deutschen Sprachbereich erschienenen Drucke; Indexes of prints published in the German language area) as well as 19th century collections.

The corpus infrastructure set up at the BBAW ensures the long-term availability, persistent addressability, and versioning of the data via the CoreTrustSeal-certified CLARIN Repository. In the context of these activities, the Research Centre Language is established as a competence centre for historical texts and data as well as format specifications and standardisation activities, tools and services related to that domain. Furthermore, the BBAW has offered consultation and instruction in the context of CLARIN-D with respect to the associated tools, workflows, and procedures to more than 50 cooperation projects.

German National Library (Deutsche Nationalbibliothek, DNB)

Data centre in the following Cluster of the data domain Collections: Unstructured Text (Coord.)

The DNB is Germany’s central archival library. It collects, documents and archives all publications and sound recordings issued in Germany since 1913 together with works that were compiled in the German language or relate to Germany. In accordance with its legal mandate, the DNB is building up a large, constantly growing digital collection and will integrate this into Text+ in compliance with the legal framework. This collection is already inhomogeneous in itself and ranges from contemporary German-language literature to all daily newspapers, scientific articles from German publishers as well as kiosk and consumer literature. It also includes a number of special collections, such as the archive and library of the Börsenverein des Deutschen Buchhandels e.V. or the collection of the German Exile Archive 1933–1945 with Digital Exile Press. The DNB facilitates research projects in a wide range of disciplines by providing the digital collection of 21st century texts as flexibly as possible and by supporting projects in corpus formation.

The access to most of the objects in the DNB’s holdings is restricted due to copyright. Beyond the use of full texts, more flexible access options must be developed on the basis of the legal framework. Together with the Scientific Coordination Committee, the DNB will participate in the development of a set of derived text formats, such as N-grams and others.

The DNB will play an active role in the further development of techniques to link collections with other locally and thematically separated data sets from Text+ via Linked Open Data (LOD) and especially via authority files such as the Integrated Authority File (Gemeinsame Normdatei, GND) or via lexical resources. It will also further develop the GND in view of the needs of the scientific communities. Together with the Leibniz Institute for the German Language, Mannheim (IDS), the DNB will be a central hub for the vast number of legal issues arising from the use and publication of text-based data.

Leibniz Institute for the German Language, Mannheim (Leibniz-Institut für Deutsche Sprache, IDS)

Data centre in the following Clusters of the data domain Collections: Contemporary Language (Coord.); Historical Texts

The IDS is Germany’s central scientific institution for the documentation and research of the German language in its contemporary usage and recent history. For its mission of documenting, archiving, and researching the linguistic variety, structure, and use of the German language, the IDS has been developing the most important collections of contemporary German. In the area of written language, the German Reference Corpus (Das Deutsche Referenzkorpus, DeReKo) contains 46,9 billion words from many different genres including newspapers, scientific text and works of fiction, but also from computer-mediated communication from chat and the Usenet as well as Wikipedia. In the area of spoken language, the Archive for Spoken German (Archiv für Gesprochenes Deutsch, AGD) offers 34 corpora with more than 4,000 hours of audio and audiovisual recordings, containing e.g. resources on dialects or ‘colloquial’ variation and the language of emigrants to Israel and of German-speaking minorities in Namibia or Russia. The FOLK corpus (Forschungs- und Lehrkorpus Gesprochenes Deutsch, which means Research and Teaching Corpus of Spoken German) provides a stratified selection of a large variety of spoken German, regrouping specific subcorpora (such as the “Wendekorpus” on German reunification, or the “GeWiss corpus” of academic speech).

The IDS constantly develops tools and interfaces to enable users to query and analyse the corpora: for spoken corpora, the Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD) is the central interface with about 12,000 registered users, for written corpora, COSMAS II (Corpus Search, Management, and Analysis System, developed since the 1990s) is being phased out in favour of the corpus analysis platform KorAP. KorAp is optimized for large, multiply annotated corpora and complex search mechanisms and supports several query languages. The latter share the same user base of over 54,000 registered users.

The IDS has been involved in CLARIN/CLARIN-D since the beginning of the project and has contributed significantly to CLARIN’s Federated Content Search and the development of the Virtual Collection Registry. It has also been active in the development of standards for collections, with involvement in the International Standards Organization’s working group on linguistic annotation (ISO/TC 37/SC 4/WG 6) and the Text Encoding Initiative’s (TEI) Special Interest Group on TEI for Linguists. The IDS has also been hosting the CLARIN legal help desk, developing legal and ethical standards for text collections. Moreover, it attended to the CLARIN Working Group on German Philology.

Ludwig-Maximilians-University (LMU) Munich, Bavarian Archive for Speech Signals (Bayerisches Archiv für Sprachsignale, BAS)

Data centre in the following Cluster of the data domain Collections: Contemporary Language

The BAS is hosted by the Institute of Phonetics and Speech Processing at LMU Munich. It was founded in 1995 with the aim of providing access to speech data and speech processing services both for speech technology development and research. Since then, it has developed into a hub of research regarding speech collections and the corresponding research infrastructure.

The BAS has its own technical infrastructure within the institute. It has close ties with the Linguistic Data Consortium (LDC), hosted by the University of Pennsylvania, and the European Language Resources Association (ELRA). It has been a member of CLARIN-D since 2010, where it is in the knowledge area of contemporary speech data. Furthermore, the BAS is a CLARIN-B centre, certified by the CoreTrustSeal, and has been actively providing services focusing on speech in research infrastructures.

The resources provided by the BAS fall into three main categories:

  • a repository of speech databases
  • a suite of web-based services for speech processing
  • various stand-alone tools for data collection and analysis.

The BAS repository currently contains more than 40 collections of speech data in several languages (German, English, Japanese, Italian, etc.) These collections were either created in-house, or by industrial or academic projects, e.g. Verbmobil, SmartKom. In recent years, a number of resources created by third parties have been added to the repository, e.g. the Spoken Word Corpus for Studies on Auditory Processing of Speech and Emotional Prosody (Gesprochenes Wortkorpus für Untersuchungen zur auditiven Verarbeitung von Sprache und emotionaler Prosodie, WaSeP) and the Karl-Eberhard Corpus from Tübingen. The resources provided by the BAS are unique and important for any research on spoken language in Germany and abroad.

The BAS’s best-known web service is, without doubt, WebMAUS, a multilingual aligner of text and speech. Other services include grapheme-to-phoneme conversion, pronunciation dictionaries, audio enhancement and pipeline services that provide pre-defined processing chains for speech data. The tools developed by the BAS include SpeechRecorder for scripted audio recordings, and the EMU Speech Database Management System.

Saarland University, SLUni, Department of Language Science and Technology

Data centre in the following Clusters of the data domain Collections: Contemporary Language; Historical Texts

As a competence centre, SLUni specializes in register corpora, multilingual corpora and translation corpora. In addition, SLUni maintains a CoreTrustSeal-certified CLARIN-D data centre.

The focus of the data centre is on multilingual corpora and corpus tools. So far, more than 100 data resources have been archived in the repository of SLUni. The resources can be found through the Virtual Language Observatory. In addition, a selection of the archived corpora is searchable through the Federated Content Search.

From a Text+ perspective, two diachronic English corpora are particulary noteworthy:

  • Royal Society Corpus (RSC)
    The open part of the RSC contains scientific articles from the years 1665 to 1920, which were published in the Transactions and Proceedings of the Royal Society of London. The corpus comprises 78.6 million tokens and was richly annotated on the text, sentence and token levels.
  • Old Bailey Corpus (OBC)
    The corpus is based on the proceedings of London’s central criminal court and documents spoken English from two centuries (1720 to 1913). The OBC comprises 24.4 million tokens and its texts were enhanced with sociobiographical and pragmatic annotations.

Due to their free license, size and wide usage in research, these data resources are particulary relevant for inclusion in Text+. Furthermore, the repository of SLUni hosts translation corpora, e.g. EuroParl-UdS and EPIC-UdS, as well as a number of Slavic resources.

Göttingen State and University Library (Niedersächsische Staats- und Universitätsbibliothek Göttingen, SUB)

Data centre in the following Cluster of the data domain Collections: Unstructured Text (Coord.)

With its current holdings of about 9 million media units, the SUB ranks among the largest libraries in Germany. Several digital text collections of the Göttingen Digitisation Centre are of particular interest for Humanities research. The SUB coordinates the digitisation project VD18 (Verzeichnis der im deutschen Sprachraum erschienenen Drucke des 18. Jahrhunderts, that is Index of the 18th century prints published in the German language area) and is partner in VD17 which both contain not only digitized prints in German but also in many other European languages and beyond. The VD17 and VD18 collections contain rare printed works such as literary anthologies, travel journals, chronicles, religious or scientific documents. The focus on travel journals, scientific documents, and press in the text collections of the SUB acts as a bridge to other discipline-relevant collections such as Americana (literature about North America), Itineraria (travel journals from the 16th-20th century), Antiquitates & Archaeologica, Scientific History and Scientific Journals (18th-20th century). These collections are mostly relevant for Philology, Culture and Arts, Philosophy and History, Anthropology, Religious Studies, Political Science, and Media Studies.

These text collections (over 13 million digitised pages) are images (TIFF, JPG, PDF) and provide valuable material for machine learning procedures for Optical Character Recognition (OCR) and other image processing. They are partly used in the Optical Character Recognition Development (OCR-D) project (in which the SUB is partner in cooperation with GWDG). The fixed plan for the coming years is to achieve machine-readable full text for the older prints as well (VD17, VD18). The collection of scientific journals (17th-21st century, available in DigiZeitschriften) consists mostly of full text digitised material. This collection in particular contains interesting and challenging material of multimodal (text-image, performative) and multilingual texts. For all digitised text collections, the SUB provides standardised metadata (bibliographical and structural metadata, e.g. IIIF-Manifest, METS).

Apart from the Göttingen Digitisation Centre, the DARIAH-DE Coordination Office at the SUB maintains and further develops the TextGrid Repository, a recognized and valuable resource for Literary Studies (Philology, Comparative Literature) and expands it continuously. It is a CoreTrustSeal certified community-curated repository and open for the ingest of new data (in different languages). The TextGrid Repository contains the Digital Library with titles of world literature by more than 600 authors, and other text collections in the standardised Text Encoding Initiative’s Extensible Markup Language (TEI-XML).

Regarding the TextGrid Repository the COST Action ‘Distant Reading for European Literary History’ considers contributing several collections of novels first published between 1840 and 1920 in at least six different European languages that have been encoded inTEI-XML.

For Indology, the Göttingen Register of Electronic Texts in Indian Languages (GRETIL) contains one of the most valuable resources for the discipline in Germany in Sanskrit, Pali, Prakrit, New Indo-Aryan, Dravidian Languages, Old Javanese and Tibetan in full text, freely available for download.

All these examples of digitisation, encoding and archiving of digital text collections rely on deep expertise in Information Science and Digital Humanities to build a sustainable digital infrastructure for the Arts and Humanities on a national and European level (DARIAH ERIC, CLARIAH-DE, SSHOC).

University of Duisburg-Essen (UniDUE)

Data centre in the following Cluster of the data domain Collections: Contemporary Language

The data resources of the UniDUE within Text+ include collections of spoken language as provided in manuscripts and minutes of political discourse. The hallmark corpus of the PolMine Project is a digital collection of parliamentary debates in the German Bundestag (Corpus GermaParl). It is a driver for text-based research in Political Science on policy and politics. As the language resources prepared at Duisburg are linguistically annotated and adhere to the guidelines of the Text Encoding Initiative (TEI), they are also highly relevant for research in Linguistics and Contemporary History. Currently, the data is disseminated using various long-time repositories as well as via the project’s web environment.

Complementing collections, the UniDUE offers associated software tools. The polmineR package, implemented in the statistical programming language R and available via the Comprehensive R Archive Network (CRAN), warrants that an environment for analysing parliamentary debates is functional and fully interoperable. Tools to integrate the analysis of parliamentary speech, including interactive visualisations, are available from the outset and are easily adjusted to meet the requirements of individual research projects. The PolMine Project is highly active in an evolving multilingual research community on parliamentary spoken language. Members of the team are involved in European collaborations for providing parliamentary data for research in Political Science and Linguistics (Parla-CLARIN).

University of Hamburg (UniHH), Hamburg Centre for Language Corpora (HZSK)

Data centre in the following Cluster of the data domain Collections: Contemporary Language

The HZSK is located at the Institute for German Studies within the Department of Language, Literature and Media at the University of Hamburg. It provides an institutional basis to ensure the sustainable usability of linguistic primary research data that exceeds temporary research projects. As an association of members mainly formed by different faculties and institutions of the University of Hamburg, the HZSK supports the consistency and coordination of computer-assisted empirical research and teaching of linguistics as well as adjacent disciplines affiliated to the University of Hamburg beyond the temporal boundaries of individual research projects.

Over the past decade, the HZSK has served to consolidate and coordinate numerous projects based in various departments of the University of Hamburg. The research data from these projects has been curated and made available to a broad user community in the HZSK Repository. This digital research infrastructure has been developed with respect to standards and best practices of digital research and has been certified with a CoreTrustSeal.

Furthermore, the HZSK closely cooperates with the newly established Centre for Sustainable Research Data Management (Zentrum für nachhaltiges Forschungsdatenmanagement, FDM) at the University of Hamburg. The FDM has its repository in operation since 2019 and will in the future also be able to collect data from the HZSK Repository. In cooperation with the FDM, the HZSK will keep serving as a competence centre that provides advice on the coordination and curation of research data and offers training, e.g. on digital user tools. The HZSK Repository hosts more than 50 corpora, the majority belonging to the centre’s thematic scope of multilingual oral as well as written data and data from lesser-resourced or endangered languages. Apart from a vast number of (child) language acquisition corpora and other corpora focussing on individual aspects of multilingualism, other highly relevant topics regarding the societal aspects of multilingualism are also covered, e.g. by the corpora Interpreting in Hospitals (Dolmetschen im Krankenhaus, DiK) and the Community Interpreting Database (ComInDat). Through data deposits from completed external projects in collaboration with the HZSK, the collection is steadily growing, with upcoming deposits such as corpora of multilingual communication in institutions (e.g. schools, businesses, NGOs).

University of Cologne (UniK), Data Center for the Humanities (DCH)

Data centre in the following Cluster of the data domain Collections: Contemporary Language

The DCH is a competence centre for sustainable research data management (RDM) within the Humanities. The UniK is internationally visible in the field of digital humanities/eHumanities research. As an institution of the faculty, the DCH is characterized by its close proximity to research and is actively involved in teaching in the relevant degree programs in digital humanities and (linguistic) information processing at the University of Cologne. The DCH is a CLARIN Centre and part of the accredited distributed CLARIN Knowledge-Centre for linguistic diversity and language documentation. As the research data centre of the Faculty of Arts and Humanities, the DCH assumes responsibility for the institutional securement, provision and long-term archiving of all digital resources entrusted to it. The centre provides data archiving and publication services, in particular for audio-visual data and lexical resources. The DCH cooperates closely with the computing centre of the University of Cologne and makes use of the computing centre’s infrastructure. As a research data management competence centre, RDM consultation and metadata are particular areas of expertise of the DCH. Furthermore, the DCH puts special emphasis on language data from the Global South and on collaboration with institutions and scholars from the Global South.

The Language Archive Cologne (LAC) is a repository for audio-visual data with a focus on language recordings. The LAC is a member of the Digital Endangered Languages and Musics Archives Network (DELAMAN) and has particular expertise in data from endangered and non-European languages as well as in non-European oral literature recordings. The LAC is integrated into the CLARIN infrastructure and conforms with the technical standards of this European research data infrastructure. The DCH has extended experience and expertise in lexical resources for non-European languages. With Kosh, the DCH provides a generic infrastructure to publish any XML-based lexical resources via standardized Application Programming Interfaces (APIs). The Cologne South Asian Languages and Texts (C-SALT) Sanskrit dictionaries are the largest resource for the classical South Asian language Sanskrit and the Critical Pāli Dictionary is one of the largest lexical resources for this Buddhist liturgical language. The Representational State Transfer (REST) and GraphQL APIs provided by the DCH allow the connection of the resources with text editions or linguistic corpora.

University of Tübingen (UniTÜ)

Data centre in the following Cluster of the data domain Collections: Contemporary Language (Coord.); Historical Texts

The data resources of the UniTÜ include corpora for spoken language and written texts that are annotated at different levels of linguistic annotation of morphology, syntax, and semantics. Such corpora are essential for data-driven research in both theoretical and computational linguistics. The annotations include different grammatical frameworks and adhere to widely-used encoding standards in the community and to the International Standards Organization’s encoding standards (ISO). These resources are housed in the TALAR data repository, which has been CTS certified and which has developed standardized protocols for the data ingest of external data resources.

The Tübingen DDC is home to a collection of widely-used syntactically annotated corpora, the so-called TüBa treebanks for German, English and Japanese. In addition, the Tübingen Archive of Language Resources (TALAR) includes a large number of externally developed treebanks in the Universal Dependencies framework.  All linguistically annotated corpora housed by UniTÜ can be searched and visualized by the web application Tübingen Annotated Data Retrieval Application (TüNDRA) and are also accessible via CLARIN’s Federated Content Search. In addition to linguistically annotated corpora, the UniTÜ offers data services in the form of vector-space word representations and associated software tools. Furthermore, it offers software services for the incremental annotation of external text corpora via the virtual research environment WebLicht. This tool allows, inter alia, the automatic enrichment of text corpora with named-entity recognition on the basis of deep learning tools. WebLicht can therefore be utilized as a tool for the automatic enrichment of unstructured data and the subsequent linkage with authority data and linked open data.

Berlin-Brandenburg Academy of Sciences and Humanities (Berlin-Brandenburgische Akademie der Wissenschaften, BBAW), Centre for Digital Lexicography of the German Language (Zentrum für digitale Lexikographie der deutschen Sprache, ZDL) and Research Centre for Primary Sources of the Ancient World (RCAW, Zentrum Grundlagenforschung Alte Welt)

Data centre in the following Clusters of the data domain Lexical Resources: German Dictionaries in a European Context; Non-Latin Scripts

The ZDL at the BBAW will provide comprehensive lexical resources for German, both contemporary and historical.The resources have a uniform structure, conformant with the Text Encoding Initiative’s (TEI) standards, and are connected via a common lemma list. This list is freely available and will serve as a hub for the integration of other resources into the Text+ Data Domain Lexical Resources. The portfolio of the ZDL further comprises large contemporary text corpora that are linked to the lexical resources of the Digital Dictionary of the German Language (Digitales Wörterbuch der deutschen Sprache, DWDS), including large reference corpora and large web corpora. Based on these corpora, services for lexicometric statistics are provided, including timelines for lexical items and word occurrence statistics both from a synchronic and a diachronic perspective. By its participation in the project e-Humanities – Centre for Historical Lexicography (eHumanities – Zentrum für Historische Lexikographie, ZHistLex), the BBAW has developed a prototype for the integration of various language stage dictionaries that are produced at other academies (Old High German, Middle High German and Early New High German). The portal and search facilities via Application Programming Interfaces (APIs) will help to make these resources available for investigations of long scale language changes. The software service Cascaded Analysis Broker (CAB) for spelling normalization will enhance the interlinking of historical dictionaries as well as the interlinking of dictionaries to historical text resources. As far as copyright restrictions allow, the resources hosted in the BBAW CLARIN Centre are accessible for a larger public, including scientific users, via dwds.de and zdl.org. With more than 1 million visits per month, they are the most visited academic web sites for lexical resources in Germany.

The centre RCAW includes nine eminent long-term projects producing digital text data in different ancient languages and scripts. Some of them deal with Greek manuscripts, such as

some with documentary sources from the classical world, such as

and with the European reception of ancient objects since early modern times, such as

The database Thesaurus Linguae Aegyptiae is the world’s leading data resource on (Pre-Coptic) Ancient Egyptian lexemes and transliterated texts. As a publication platform made available on the Internet by the Project “Structure and Transformation in the Vocabulary of the Egyptian Language”, it provides the world’s largest electronic corpus (1.4 million tokens) of Egyptian texts annotated with translation, commentary, and metadata. It is consistently lemmatised with a comprehensive lexicon of the Egyptian language through its diachronic phases. The data are available under an Open Access Licence. They are used around the world by more than 7,500 registered users. The website as well as an API that will be implemented will provide easier access to these data for academic users as well as the wider public.

Leibniz Institute for the German Language, Mannheim (Leibniz-Institut für Deutsche Sprache, IDS), Department of Lexical Studies (IDS-Lexik)

Data centre in the following Clusters of the data domain Lexical Resources: German Dictionaries in a European Context; Non-Latin Scripts

The dictionaries of the IDS are a unique resource for academic lexicography of German with users from all over the world. With their different thematic and content-related focuses (neologisms, discourse lexicology, foreign words, loanwords, collocations, verb valency, grammatical words etc.), they are a necessary scientific complement to more general offers with less specialisation (Duden dictionaries, Centre for Digital Lexicography of the German Language). Furthermore, the lexicographic portals of the IDS are exemplary for new ways in the visualization and processing of lexicographic data and for more experimental formats, such as lexical data in connection with statistical corpus analyses on specifically limited subject areas (e.g., neologisms). They also deal with current topics such as lexical changes in the corona crisis. The lexical resources of the IDS are all accessible online via the portals Online Vocabulary Information System German (Online-Wortschatz-Informationssystem Deutsch, OWID, OWIDplus), Loanword Portal German (Lehnwortportal Deutsch), and Grammatical Information System (Grammatisches Informationssystem, grammis). In addition to these lexical resources, IDS contributes to Text+ with internationally renowned expertise in research into dictionary use and in the novel (e.g., graph-based) storage and visualisation of lexical data. As of 2019, the IDS dictionary platforms were used by more than 25,000 different users.

Saxon Academy of Sciences and Humanities (Sächsische Akademie der Wissenschaften, SAW)

Data centre in the following Clusters of the data domain Lexical Resources: German Dictionaries in a European Context; Born-Digital Lexical Resources; Non-Latin Scripts

The SAW operates a variety of dictionary projects dealing with historical and contemporary lexical data. In the area of born-digital lexical data, its Leipzig Corpora Collection (LCC) is an important provider of monolingual dictionaries for hundreds of languages, focusing on statistics-based text analysis and the promotion of lesser-resourced languages. The project, which originally was set up by the University of Leipzig, is being carried on by the SAW. The resources of the SAW include historical and contemporary lexical data for different stages of the German language and a large collection of monolingual dictionaries based on publicly available text material that has been collected since the 1990s. Currently, the LCC contains more than 400 corpora and dictionaries in more than 250 languages. The data are made available via a Web portal and RESTful Web services (REST stands for Representational State Transfer), of which many are integrated in the CLARIN infrastructure. The LCC, together with its subproject German Vocabulary (Deutscher Wortschatz), is one of the most important online resources in the field of the lexicography of modern languages and has an impact beyond the academic field. It provides reliable text and lexicographical data for hundreds of languages, which then serves as training material for established Natural Language Processing (NLP) tools such as Apache OpenNLP or as an online reference resource (e.g. in projects such as Wiktionary). The LCC sets a strong focus on improving the availability of digital resources for under-resourced languages. In collaboration with external language experts, it supports the preparation and hosting of lexical data sets in a modern research environment. The LCC is also active in the use, standardization, and adaptation of Linked Data formats for lexical resources.

University of Cologne (UniK), Data Center for the Humanities (DCH)

Data centre in the following Clusters of the data domain Lexical Resources: Born-Digital Lexical Resources; Non-Latin Scripts

The DCH in Cologne is a competence centre for sustainable research data management (RDM) within the Humanities. The DCH is a certified CLARIN centre and part of the accredited distributed CLARIN Knowledge-Centre for linguistic diversity and language documentation. The DCH has extensive experience and expertise in lexical resources for non-European and ancient languages. The Cologne South Asian Languages and Texts (C-SALT) Sanskrit dictionaries, for instance, are the largest resource for the classical South Asian language Sanskrit and the Critical Pāli Dictionary is one of the largest lexical resources for this Buddhist liturgical language. The Representational State Transfer (REST) and GraphQL APIs provided by the DCH allow the connection of the resources with text editions or linguistic corpora. The Kosh dictionary server infrastructure provides a generic infrastructure to publish any XML-based lexical resources via standardized APIs. In the course of Text+, the DCH will make these resources available and will further develop the aforementioned APIs. The main focus of the DCH in Text+ is set on non-Latin scripts. However, the DCH contributes to the Cluster Born-Digital Lexical Resources with a (multimodal) lexical database for data gathered from fieldwork studies (Language Archive Cologne).

University of Trier (UniTR), Trier Center for Digital Humanities (TCDH)

Data centre in the following Clusters of the data domain Lexical Resources: German Dictionaries in a European Context

The TCDH has more than twenty years of experience in planning, coordinating, and implementing projects in the field of full-text digitization, standardized data encoding and digital publication of dictionaries and reference works. One particular focus lies on the modelling, indexing and provision of important historical dictionaries. Numerous projects at the TCDH have produced digital data resources for the first edition and revision of the German Dictionary by Jacob and Wilhelm Grimm, the central Middle High German dictionaries, the Old High German dictionary, the dictionaries of the West Middle German regional languages and the Goethe dictionary, among others. The meta-linguistic Text Encoding Initiative (TEI) compliant coding of the data enables highly-specific, even keyword-independent, research. They are interlinked with each other within the framework of the Trier Dictionary Network Platform (www.woerterbuchnetz.de) by means of open application interfaces. The TCDH has built a dense national and international network in the field of digital lexicography and cooperates with all German academies of science. In particular, the TCDH is the only German partner in the European joint project ELEXIS (European Lexographic Infrastructure), in the context of which an open, standards-based framework for the online publication of dictionaries and reference works is being developed.

University of Tübingen (UniTÜ)

Data centre in the following Clusters of the data domain Lexical Resources: Born-Digital Lexical Resources

The lexical resources offered by the Tübingen Data and Competence Centre are strongly connected to and interoperable with other types of lexical and textual resources represented in Text+. The valence dictionary of German verbs has been derived from large text corpora and is therefore linked with corpus data. GermaNet is a lexical database of word senses for nouns, verbs, and adjectives of contemporary German that is directly connected via an interlingual index to wordnets of more than fifty languages of the world. Besides other wordnets, GermaNet has been linked to other digitally born resources such as Wikipedia and Wiktionary. Taken together, they provide a principled basis for assessing lexical similarity and dissimilarity. These two notions are essential for psycho- and neurolinguistic research, as well as for topic modelling and semantic search in a wide range of disciplines ranging from Computer Science applications to research in Literary Studies such as author identification and genre classification as well as semantic search for dictionary data or for large metadata collections. Besides academia, GermaNet is in high demand for industrial applications. Moreover, the linking of word senses via semantic relations provides an ideal starting point for converting wordnet data to linked open data formats and for easy integration into knowledge graphs. These data formats will play a central role not only in Text+, but also in the National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur, NFDI) as a whole. Accordingly, the mapping of GermaNet to linked open data and knowledge graphs will provide considerable added value for Text+ and will offer a direct data bridge to other NFDI consortia.

Academy of Sciences and Literature, Mainz (Akademie der Wissenschaften und der Literatur, Mainz, AdWMZ)

Data centre in the following Clusters of the data domain Editions: Ancient and Medieval Texts

With a strong focus on digital methods and infrastructures in the Humanities and Cultural Studies, the AdWMZ connects long-term foundational research in language and literature with scholarship in Musicology, Art History and Archaeology.

The Digital Academy (Digitale Akademie, DA), the Digital Humanities (DH) research department of the AdWMZ, is involved in a broad range of digital editions dealing with textual sources and materials from antiquity to the Avant-garde. The DA’s research activities focus on data modeling and the creation of web portals, sustainable research software engineering, current web technologies, and the application of Linked Open Data (LOD) and graph technologies to open up new analysis and re-use scenarios in textual and cultural scholarship. The AdWMZ is the hosting institution of NFDI4Culture and acts as a bridge between research communities dealing with textual and object-related editions and data publications. Moreover, the AdWMZ is one of the co-founding institutions of the DH master’s programme in Mainz and contributes with a DH professorship, regular DH lectures and an international DH summer school to the education and training of young academic professionals.

The AdWMZ additionally contributes to Text+ with numerous digital editions, text collections and software applications. Thematically, these range from large medieval and early modern text corpora such as the Regesta Imperii, the Augsburger Baumeisterbücher and Die Deutschen Inschriften to editions focusing on sources of the 19th and 20th century such as DER STURM and the Hans Kelsen Werke. The AdWMZ also hosts and curates overarching research information systems such as AGATE (a European gateway for the Academies of Sciences and Humanities), or the portal for small disciplines (Portal Kleine Fächer, together with the Johannes Gutenberg-University). Technically, the AdWMZ will provide Text+ with solutions for creating integrated edition and web portals, annotation software for graph-based digital editions and LOD applications allowing for semantic enrichment and linking of digital scholarly editions.

Tools and resources (selection):

Software:

Resources (Digital Editions & Portals):

Berlin-Brandenburg Academy of Sciences and Humanities (Berlin-Brandenburgische Akademie der Wissenschaften, BBAW)

Data centre in the following Cluster of the data domain Editions: Ancient and Medieval Texts

For about 20 years, the Digital Humanities unit TELOTA – IT/DH (TELOTA: “The Electronic Life Of The Academy”) of the BBAW has been planning, implementing and hosting numerous digital scholarly editions. One particular focus of TELOTA lies on research software development for digital editions. In this context, the department is primarily dedicated to the contribution to standards for text encoding, the development of user-friendly tools for the creation of digital editions, Application Programming Interface (API) design, and sustainable publishing solutions.

The digital scholarly editions of the BBAW represent various disciplines such as Philology in general, Philosophy, Theology and History (especially the History of Science). Some exemplary editions of the BBAW with a high national and international reputation are:

Furthermore, a special focus of TELOTA are digital editions of correspondence. TELOTA is actively involved in the development and adaptation of text encoding standards based on the recommendations of the Text Encoding Initiative (TEI) and the DTA Base Format (DTABf). A core tool used by Humanities researchers at the BBAW to create digital editions is the user-friendly editing software ediarum, which has been developed by TELOTA since 2012. Another key application offered by TELOTA is correspSearch, a web service to connect scholarly editions of correspondence.

Through its staff, the BBAW is actively involved in various national and international associations that concern the Text+ data domain Editions, among them the working group eHumanities of the Union of the German Academies of Sciences and Humanities, the working group Research Software Engineering of the professional association Digital Humanities in the German-speaking world (Digital Humanities im deutschsprachigen Raum, DHd), the TEI special interest group “correspondence”, and the Institute for Documentology and Scholarly Editing (IDE).

Darmstadt Cooperation (DACo): Technical University of Darmstadt, University and State Library Darmstadt, University of Applied Sciences Darmstadt

Data centre in the following Clusters of the data domain Editions: Early Modern, Modern, and Contemporary Texts

The DACo consists of three partners with a long tradition of institutional and personal cooperation in research, infrastructure development, teaching and training in the field of textual scholarship, digital editions and beyond: the Institute for Linguistics and Literary Studies, the University and State Library Darmstadt (USLDA), both at the Technical University Darmstadt (TUDa), and the chair for Information Science/Digital Library at the University of Applied Sciences Darmstadt. They are among the founders of TextGrid and part of the DARIAH-DE resp. CLARIAH-DE consortium. The Darmstadt Cooperation universities have signed a contract to cooperate closely in the future with cooperative dissertations, which allows a joint supervision of young researchers. The University and State Library operates the institutional repository for all research data generated of or worked with at the Technical University and founded the centre for digital editions (Zentrum für digitale Editionen in Darmstadt, ZEiD) in 2019. The ZEiD runs and supports a number of digital editions which include long-term projects funded by the Academies Programme, projects funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) and small projects from single researchers without funding and others with a special emphasis on letters. Members are involved in the Text Encoding Initiative’s Consortium, the project Hessian research data infrastructures (Hessische Forschungsdateninfrastrukturen, HeFDI) or engaged in the working group on eHumanities of the Union of the German Academies of Sciences and Humanities. For almost 20 years, they have been developing Master’s and Bachelor’s level programmes in which textual scholarship, research data management, data science and data literacy play a central role, and have organized numerous Digital Humanities workshops.

Herzog August Library Wolfenbüttel (Herzog August Bibliothek Wolfenbüttel, HAB)

Data centre in the following Cluster of the data domain Editions: Ancient and Medieval Texts

The HAB is a non-university study and research institution for European cultural history of the medieval and early modern periods. As a library, it holds the collections brought together by the dukes from the Wolfenbüttel line of the house of Braunschweig-Lüneburg since the 16th century. They consist of about 11,800 manuscripts, 2700 of which are from the Middle Ages, about 400,000 books printed before 1830, more than 20,000 printed graphics and other special collections. The HAB is a centre for manuscript cataloguing funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) and is part of the project for a retrospective German national library (AG Sammlung Deutscher Drucke), in which it covers the 17th century. The HAB has been digitizing its collections on a big scale for many years. In particular, it is committed to the digitization of medieval manuscripts, to the DFG-funded mass digitization of printed books and to the digitization of graphics (Virtuelles Kupferstichkabinett). The HAB is part of important information infrastructure projects in Germany. Along with Leipzig University Library, it conducts the Specialised Information Service for Book Studies, Library and Information Sciences (Fachinformationsdienst Buch‑, Bibliotheks- und Informationswissenschaft, FID BBI). It is also a partner in the DFG-funded Initiative for Optical Character Recognition Development (OCR-D).

The library collections enable all kind of research on European cultural history of the Middle Ages and the early modern period. The HAB supports and organizes such research in many ways: by actively facilitating exchange and networking among international scholars, by providing the best possible conditions and support for their research, by offering fellowships for doctoral and post-doctoral students, by teaching at universities and by conducting own research projects on innovative fields of medieval and early modern studies, many of which rely on third-party funding. The main areas of research are cultural translation, cultures of knowledge, history of religion and piety and image politics.

Research at the HAB is tightly connected to developments in Digital Humanities. The HAB has long-term experience in digital editing, hosting and developing its own editing infrastructure, the Wolfenbüttel Digital Library (Wolfenbütteler Digitale Bibliothek, WDB), which is being used for numerous projects. Outstanding examples are the DFG-funded long term-projects for the edition of the works and letters by the reformer Andreas Bodenstein von Karlstadt, the diaries of Duke Christian II of Anhalt-Bernburg and the travel and collection accounts by the art agent Philipp Hainhofer.

The HAB is also a part of the Marbach Weimar Wolfenbüttel Research Association, which explores the German literary tradition by common research projects and by developing a virtual research space for finding and analysing the digital collections of the HAB, the German Literature Archive (Deutsches Literaturarchiv Marbach), and the Klassik Stiftung Weimar. By participating in the National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur, NFDI), the HAB aims at making its own expertise in the Digital Humanities, especially in digital editing, more visible and at becoming part of a persistent digital infrastructure which offers essential services to its researchers.

German National Academy of Sciences Leopoldina (Leopoldina)

Data centre in the following Cluster of the data domain Editions: Early Modern, Modern, and Contemporary Texts

The Centre for Science Studies (Zentrum für Wissenschaftsforschung, ZfW) at the German National Academy of Sciences Leopoldina is responsible for independent scientific research at the Academy, especially on the history and reflection of science, and maintains the necessary infrastructure for this. At the ZfW, various activities are combined to form a competence focus in the field of digital editing.

The ZfW manages several hybrid and digital edition projects in its Research Area IV: Scientific Editions and Digital Research. Typical sources are scientific correspondence and publications as well as collections and their objects, which need to be analysed and presented.

Among the projects are

A digital sustainability concept for the ZfW and the projects it supervises was adopted in spring 2020. The Centre also organizes events in the field of digital editions. The “Winter School Digital Editions” has been held annually since 2019, in cooperation with the Institute for Documentology and Scholarly Editing. In addition, the competences of the staff involved in ZfW projects are strengthened by internal training courses. In addition, the ZfW is represented in various working groups related to digital editions, including the Working Group for Newspapers and Journals of the professional association Digital Humanities in the German-speaking world (Digital Humanities im deutschsprachigen Raum, DHd), and the working group Digital Data Collections and Text Corpora of the “Digital Information” Initiative as part of the Alliance of Science Organizations in Germany.

North Rhine-Westphalian Academy of Sciences, Humanities and the Arts (Nordrhein-Westfälische Akademie der Wissenschaften und der Künste, NRWAW)

Data centre in the following Clusters of the data domain Editions: Ancient and Medieval Texts (Coord.); Early Modern, Modern, and Contemporary Texts

The Academy, founded by the Federal State of North Rhine-Westphalia in 1970, is an association of the state‘s leading researchers and brings together all forms of creative discovery, be they scientific, scholarly or artistic. It is a member of the Union of the German Academies of Sciences and Humanities and cooperates with the Union Académique Internationale (UAI) in international research projects.

A central task of the Academy is the promotion and supervision of long-term basic research, which usually cannot be carried out in this way at universities or other research institutions. The Academy currently supervises 13 long-term research projects, many of them dealing with the textual heritage and with editions in all their aspects, ranging from ancient to modern material and from German language to non-Latin scripts. The projects include, among others, the Averroes Edition, the edition of Frankish Capitularies, the reconstruction of the Greek New Testament, the edition of Minor and Fragmentary Historians of Late Antiquity, the genetic edition of the literary works of Arthur Schnitzler and the digitisation and edition of Niklas Luhmann’s Card Index.

The Academy’s coordinating office for Digital Humanities acts as a centralised competence centre for all the Academy’s long-term projects to ensure state-of-the-art digital methods and to cover the complete project life cycle, with a special emphasis on editorial methodologies and technologies. The coordinating office is located at the Cologne Center for eHumanities and collaborates closely with the Data Center for the Humanities, both based at the University of Cologne, and is actively involved in research and teaching in the field of Digital Humanities and Information Processing at the University of Cologne. The Academy currently holds the position as spokesperson of the working group on eHumanities of the Union of the German Academies of Sciences and Humanities. The participation in the National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur, NFDI) bespeaks the long-term strategic alignment of the Academy’s research. As co-applicant institution, the Academy is responsible for the task area Editions in Text+ and will contribute its long-standing experience in digitisation and in the creation and maintenance of a broad range of different types of editions. The coordinating office of the Academy contributes ample experience in consultation, planning, carrying out and hosting digital research projects as well as in data management, archiving and the implementation of sustainability measures. Special emphasis is placed on the science-driven development of research infrastructures in close interaction with research and innovation in Digital Humanities and Information Science.

Salomon Ludwig Steinheim Institute for German-Jewish History (Salomon Ludwig Steinheim-Institut für deutsch-jüdische Geschichte, STI)

Data centre in the following Cluster of the data domain Editions: Ancient and Medieval Texts

The STI is an affiliated institute of the University of Duisburg-Essen and member of the Johannes Rau Research Association (Johannes-Rau-Forschungsgemeinschaft) in North Rhine-Westphalia.

The STI researches the history and culture of Jews in the German-speaking world from the Middle Ages to the present. The dense network of relationships between Jewish and general society is examined from perspectives of Religious and Social History, Literature, Cultural and Linguistic Studies, especially with reference to inner-Jewish and Hebrew sources. In addition to German-Jewish History and Jewish Studies, there is an application-oriented focus on methods of Digital Humanities (DH).

The digital portfolio includes numerous editions, prosopographic and bibliographic works, collections of photographs, image archives, letters and diaries. A long-term project of the STI is the edition of Jewish epitaphs. 37,000 Hebrew and German epitaphs from 218 Jewish cemeteries, each with transcriptions, translations, object descriptions, annotations and commentaries have been published as digital editions (Text Encoding Initiative’s Extensible Markup Language TEI-XML, Creative Commons Licensing).

The digital editions and collections of the STI are closely related to research activities. Therefore, the Institute participates in the development of research platforms and infrastructure components that allow web publishing, digital annotation, retrieval, visualization, analysis and interlinking of the data. Consequently, the institute has fundamental expertise in the processing of Hebrew (right-to-left, RTL) and EpiDoc and TEI files (Tübingen System of Text Processing tools, TUSTEP). It also has long-term practice in the XML transformation languages XSLT and XQuery (u.a. Saxon), in retrieval platforms such as Solr, and in tools and technologies such as eXist-db and BaseX XML databases, Apache Cocoon (XML web development framework), and Mediawiki / Wikibase (Resource Description Framework RDF, SPARQL Protocol and RDF Query Language). In-house developments such as Epidat, domain-specific STI Linked Data Service, Judaica Search Engine and a bibliographical system which supports geo-references and authority control, are based on these competencies, as well as the cooperation with Europeana or PEACE Portal. Against this background, the STI has experience in the requirements, implementation and use of research infrastructures in the Humanities. The STI is a long-standing member of TextGrid and the International TUSTEP User Group (ITUG). Members of the institute are consulted on various digital edition projects (e.g. as members of the advisory board). The activities include scientific contributions, blog posts, lectures on applications and methods of DH and the organization of trainings and workshops on digital editions. The STI takes part in DH working groups. In this context, it is particularly committed to the interoperability and networking of resources based on standards and authority control, and actively participates in the opening of the Integrated Authority File GND (GND for Cultural Data) as well as the scientific use of Wikibase in the Humanities.

Göttingen State and University Library (Niedersächsische Staats- und Universitätsbibliothek Göttingen, SUB)

Data centre in the following Clusters of the data domain Editions: Early Modern, Modern and Contemporary Texts (Coord.); Ancient and Medieval Texts

The SUB has been involved in the creation of digital editions as an information technology and information science provider for over 15 years. This includes both the participation in numerous third-party funded editorial projects and the development and provision of generic tools for the creation and publication of digital editions (TextGrid, SADE, TextAPI). Furthermore, the SUB has extensive experience in teaching digital editing skills and tool usage through training, workshops, and summer schools.

Digital editions developed at the SUB cover all archetypes (diplomatic, historical-critical, and genetic editions) and comprise a vast array of disciplines:

Some of these projects are hybrid editions that are published both as web portals and as print publications. For this requirement, the SUB has developed an adaptable and re-usable toolchain for the creation of prepress files based on the Text Encoding Initiative’s Extensible Markup Language (TEI-XML) data (bdnPrint). 

In addition, the SUB has consolidated its expertise in the area of digital editions (data modeling, software development, project acquisition and management) by establishing the Service Digitale Editionen, financed by in-house funds. This unit provides consulting services on a local, national, and international level.  The SUB is also involved in numerous standardization committees relevant to the creation of digital editions, such as the TEI Consortium, the IIIF Consortium, Dublin Core Governing Board, MODS Editorial Committee, LIDO – CIDOC.

German National Library (Deutsche Nationalbibliothek, DNB)

The DNB is Germany’s central archival library. It collects, documents and archives all publications and sound recordings issued in Germany since 1913 together with works that were compiled in the German language or relate to Germany. In accordance with its legal mandate, the DNB is building up a large, constantly growing digital collection and will integrate this into Text+ in compliance with the legal framework. This collection is already inhomogeneous in itself and ranges from contemporary German-language literature to all daily newspapers, scientific articles from German publishers as well as kiosk and consumer literature. It also includes a number of special collections, such as the archive and library of the Börsenverein des Deutschen Buchhandels e.V. or the collection of the German Exile Archive 1933–1945 with Digital Exile Press. The DNB facilitates research projects in a wide range of disciplines by providing the digital collection of 21st century texts as flexibly as possible and by supporting projects in corpus formation.

The DNB will play an active role in the further development of techniques to link collections with other locally and thematically separated data sets from Text+ via Linked Open Data (LOD) and especially via authority files such as the Integrated Authority File (Gemeinsame Normdatei, GND) or via lexical resources. It will also further develop the GND in view of the needs of the scientific communities.

Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)

The GWDG is the data and IT service centre for the University of Göttingen and the Max-Planck-Society and provides a wide range of high-availability services for teaching and research. Apart from this, the GWDG is considerably involved in Humanities-related research infrastructure projects, such as DARIAH-DE, CLARIAH-DE and the European Open Science Cloud (EOSC). In DARIAH-DE, the GWDG is the technical coordinator, and  not only responsible for the German user community but for the European users as well. This includes the provision of services such as Authentication and Authorization Infrastructure (AAI) and Persistent Identifiers (PID). Beyond this, the GWDG is offering consultation to researchers at the Göttingen Campus, for instance with regard to information security or research data management.

Leibniz Institute for the German Language, Mannheim (Leibniz-Institut für Deutsche Sprache, IDS)

The IDS in Mannheim, Germany, founded in 1964, is the leading national centre which researches and documents the German language in its contemporary usage and recent history. The mission of the IDS is to document, archive, and research the linguistic variety, structure, and use of the German language. Recently, the Forum for the German Language (Forum deutsche Sprache) has been initiated by the IDS and its partners. The IDS is also widely regarded as a hub of international German Linguistics, and recognised as a leading centre for fundamental research. In 2019, the department Digital Linguistics was formed, whose constituent programme areas were both evaluated as excellent by the Leibniz Association. This new department will host Text+.

Moreover, the IDS develops practical tools and operates computational infrastructure supporting empirical research as well as produces reference works (e.g. grammars and dictionaries) and digital language resources (especially large corpora and analysis software) in close contact with its designated community of linguists of German. The IDS predominantly pursues long-term projects, developing new research foci through competitively acquired third-party funds.  As a link between universities and other academic partners, the IDS serves as a coordinator and supporter of long-term joint research projects, such as CLARIN-D, and participates in the board of directors of the European Research Infrastructure Consortium CLARIN, and international committees with regard to technology and organisation, such as the Text Encoding Initiative and the International Standards Organization.

The IDS contributes its experience with both fundamental research as well as resource and tool development, its tradition of linking these two areas in the perspective of specific research projects and research questions, and, moreover, its contributions to distributed research infrastructures. As applicant institution, the IDS will handle the Text+ budget and integrate its consortium of stakeholders. Taking the lead for the task area Administration, the IDS will be responsible for the disbursement of project funds to the co-applicant and participant institutions and will operate the Scientific Office of Text+. The IDS is also one of the central hubs in the Text+ Clusters, with two areas of specialisation.

Jülich Supercomputing Centre (JSC)

The JSC at Forschungszentrum Jülich has been operating the first German supercomputing centre since 1987, and with the Jülich Institute for Advanced Simulation it is continuing the long tradition of scientific computing at Jülich. Computing time at the highest performance level is made available to researchers in Germany and Europe by means of an independent peer-review process. About 200 experts and contacts for all aspects of supercomputing and simulation sciences work in JSC. One focus of the JSC is the field of federated systems and data. Here, in addition to the European open source software UNICORE, application environments and community specific services for distributed data and computing infrastructures are developed together with users. The federated development approach respects the autonomy of the user groups and centres.

Saxon Academy of Sciences and Humanities (Sächsische Akademie der Wissenschaften, SAW)

The SAW is responsible for more than 20 running long-term research projects in the Humanities and is strongly engaged in providing services and support for the Humanities in the use of digital resources and tools. Starting March 2021, the services presently provided by the CLARIN-D and CLARIAH-DE team at the Computer Science Department of the University of Leipzig will be sustainably continued at the SAW. Therefore, it will acquire long-term experiences with developing technical infrastructures for the Humanities and will continue the work package for the coordination of technical development in CLARIAH-DE. The SAW will especially contribute its expertise in the area of search and retrieval in distributed environments, metadata infrastructure and semantic web technology, as well as quality assurance of services and data to the Text+ infrastructure.

Göttingen State and University Library (Niedersächsische Staats- und Universitätsbibliothek Göttingen, SUB)

The SUB is one of the largest libraries in Germany and a leader in the development of digital libraries. It hosts several digital collections of substantial importance as resources for research in Text+, which are provided by the Göttingen Digitisation Centre. Together with the German National Library (DNB), the SUB manages the specialist department Library Data of the German Digital Library and coordinates the activities of DINI-AG KIM. It is the coordinator of DARIAH-DE, member of the National coordinator committee of DARIAH-ERIC and is coordinating CLARIAH-DE together with UniTÜ. The SUB provides a DOI service for the Humanities in co-operation with DataCite, which has already registered over 40,000 data sets, as well as local, national, and international support for the creation of digital editions by an in-house unit (Service Digitale Editionen). On the international level, the SUB is scientific coordinator of OpenAIRE, partner in the European plug-in to the Research Data Alliance and partner in the EOSC-project SSHOC (Social Sciences and Humanities Open Cloud).

In Infrastructure​/​Operations the SUB will focus on community services and cross-cutting topics. In particular, it will contribute to the metadata infrastructure in order to increase interoperability and re-usability of the data in Text+. The SUB is part of numerous standardisation committees, such as the Text Encoding Initiative’s Consortium, the Dublin Core Governing Board, Metadata Object Description Schema Editorial Committee (MODS/MADS), the International Image Interoperability Framework Consortium (IIIF), CIDOC Conceptual Reference Model-SIG and LIDO Working Group (Lightweight Information Describing Objects). The SUB is significantly involved in the development and advancement of various metadata standards, for instance, by its involvement in the specification of the METS/MODS Application Profile for Digitised Prints, which is the de-facto description standard for material digitised in German libraries.

Technical University of Dresden, Centre for Information Services and High Performance Computing (TUDD)

The Centre for Information Services and High Performance Computing provides expertise and resources in the Infrastructure/Operations Task Area of Text+. As part of this, it provides access to the Data Analytics infrastructure of the HRSK-II/HPC-DA.

University of Bamberg (Otto-Friedrich-Universität Bamberg, UniBA)

The focus areas of the Media Informatics Group at the University of Bamberg are information retrieval, data management and Digital Humanities research infrastructures. The group has been partaking in DARIAH-DE since 2011 and is partner in CLARIAH-DE. Based on DARIAH-DE and CLARIAH-DE and the implementation of funded and unfunded application scenarios (e.g. with the research association Marbach Weimar Wolfenbüttel, Germanisches Nationalmuseum), the group implemented the DARIAH-DE Data Federation Architecture (DFA), which serves as a key enabler for interoperability and findability of research data. Being the primary DFA component for establishing interoperability between heterogeneous data sources, the Data Modeling Environment (DME) and the Generic Search based on it will be the starting point for corresponding applications, adaptations and further developments in the context of Text+.

List of institutions and their shorts as PDF.