Data from the Community

Offers for research data for integration within the context of Text+

Running Call: Direct link to submit offers for research data (German online form)

Together with you, we would like to sift through the rich treasure of research data and possible applications offered by the humanities and other disciplines and contribute productively to the NFDI Text+ initiative. It is precisely through the diversity and quantity of research data and a wide range of research questions that a stable and multi-faceted basis for a research data infrastructure is to be created.

Within the Text+ initiative, three data domains are distinguished: Collections, Lexical Resources and Editions. In these data domains, you can enrich the initiative with your offers of research data, research questions and research tasks.

The humanities community in all its diversity is the most important reference point for Text+ and we would like to represent it with your help. We would therefore like to invite you to participate and to help shape the NFDI.

Your research data can complement the Text+ offer and can be included in the pool of representative data for Text+. The governance of Text+ intends that the Scientific Coordination and further development of the three data domains of Text+ shall be exclusively in the hands of expert scientists. This further development will be based on the data pool. In order to obtain a structured overview of the data stocks relevant to the community, we have created a questionnaire.

Any feedback given until August 16, 2020 could be included in the preparation of the application. Later submissions are possible using the online form. For further questions please contact office[at]text-plus.org.


Results of the Data Call

Below, you see some of the data that were proposed by the community for integration in the Text+ infrastructure.

Tailored Corpora and Topic Models for Japanese Parliamentary Minutes (provisional title)

Institution(s)German Institute for Japanese Studies
Data Domain(s)Collections, Lexical Ressources, Software Services
Language(s)Japanese
Modality/ieswritten
Disciplines (DFG)102104106111
Disciplines (descriptive)Japanese Studies, East Asian Studies, History (Especially Conceptual History), Political Science, Linguistics

The Japanese Diet has made all speeches since 1947 public through an XML interface. The enormous number of texts allows for tracing the evolution of concepts (if understood as corresponding to Japanese words) in discourse if methods of distant reading are employed. At the German Institute for Japanese Studies (DIJ), such analyses are carried out using LDA topic models. In addition to the models, systematic analysis also uses metadata that is provided through the public interface (like date, committee, name of the speaker) and custom data structures based on this (like networks).

Soldatenbriefe des 18. und 19. Jahrhunderts

Institution(s)Justus Liebig University Giessen, Institute for German Studies
Data Domain(s)Editions, Collections
Language(s)German (differentiated into: North German, North Upper German, East Central German, East Upper German, West Central German, West Upper German)
Modality/ieswritten
Disciplines (DFG)101–02102104105106108
Disciplines (descriptive)Linguistics, Literary Studies, Cultural Studies, Edition Philology, History, Military History, (Contemporary) Political History; (potentially needs processing) Corpus Linguistics, Computational Linguistics

170 “soldiers’ letters” from the years 1745 to 1872. The letters come to a small extent from older but reliable editions (no longer protected by copyright), but mainly from our own archive work, and were therefore transcribed and published for the first time by the editor, Marko Neumann, based on the manuscript. These are available on the website of the Universitätsverlag Winter based in Heidelberg and can be downloaded from there free of charge. This corpus is extremely valuable from both a linguistic and a historical perspective, in particular from a cultural, literary and military-historical perspective.

For the hurdles (from a legal point of view) and difficulties (with a view to the form of publication and especially the data format) for the re-use of this valuable data, cf. the user story belonging to this data offer “Soldiers letters of the 18th and 19th centuries: From the PDF edition to reusable, interoperable research data”.

Erschließung des Korrespondenz der Constance de Salm (1767–1845)

Institution(s)German Historical Institute Paris
Data Domain(s)Editions
Language(s)German
Modality/ieswritten
Disciplines (DFG)102104105
Disciplines (descriptive)History, Linguistics, Literary Studies

The resource contains the metadata with which the correspondence of Constance de Salm (around 11,000 letters) in a long-term project of the German Historical Institute Paris was indexed in terms of content and form. In addition to information on recipients and senders (such as person, place and date), the content is indexed using keywords.

JudaicaLink

Institution(s)Hochschule der Medien, Stuttgart
Data Domain(s)Lexical Ressources, Software Services
Language(s)German, English, Hebrew, Russian
Modality/ieswritten
Disciplines (DFG)102106107
Disciplines (descriptive)Jewish studies, theology, history

JudaicaLink is an RDF-based knowledge graph that integrates data from different sources in the field of Jewish studies. This mainly includes lexicons and biographies, there are links to GDN, DBpedia/Wikipedia, Wikidata and many other data sources. The data is used in the Specialized Information Service Jewish Studies to create cross-links in metadata and currently also full texts.

LTA – Latin Text Archive (mit FFL Frankfurt Latin Lexicon)

Institution(s)Goethe University Frankfurt am Main
Data Domain(s)Editions, Collections, Lexical Ressources
Language(s)Latin
Modality/ieswritten
Disciplines (DFG)101102104105107113
Disciplines (descriptive)Theology, History, Historical Cultural Studies, Legal Studies, Romance studies, Medieval Latin Philology, Linguistics

LTA is a freely accessible, web-based analytical archive of (mostly) high quality critical editions of Latin texts, annotated with complex metadata, fully lemmatized, and linked to a full-form lexicon (FLL). It provides diachronic organized reference corpora according to text genres and text mining tools for individual corpus building and analysis. Size and diversity of the content allow researchers to build reliable diachronic corpora (e.g. from only one text type) for analysis. It covers text production in Latin-speaking Europe from 400 to 1500, but it will be extended continuously. Technically, LTA is based on DTA (Deutsches Textarchiv).

Medizinische Gutachten des 17. und 18. Jahrhunderts

Institution(s)Katholische Universität Eichstätt-Ingolstadt
Data Domain(s)Collections
Language(s)German with Latin and Greek insertions
Modality/ieswritten
Disciplines (DFG)102103
Disciplines (descriptive)Linguistics (especially Historical Linguistics, Text Linguistics, Special Language Research), History Of Medicine, History Of Science, Legal History, Cultural History

The resource is a text corpus containing 150 transcribed medical reports from the 17th and 18th centuries from printed medical casebooks. The texts are available as plain text files, have rudimentary annotations (line break and page break), but only brief bibliographical information (no TEI-compliant header!).

Dokumente und Materialien zur ostmitteleuropäischen Geschichte

Institution(s)Herder Institute for Historical Research on East Central Europe — Institute of the Leibniz Association
Data Domain(s)Editions
Language(s)multilingual
Modality/ieswritten
Disciplines (DFG)102
Disciplines (descriptive)Primarily: University Teaching (lecturers, students of historical sciences), secondarily: interested public

The digital edition offers thematic modules for university teaching on East Central European history in their temporal depth and spatial breadth, so that thematic modules on medieval history are offered as well as on contemporary history. All text sources, but also other materials such as statistics, are offered in the respective original language, in German translation and, if possible, as a scanned image from the original source to ensure citability; further materials such as maps, images, a selective bibliography with literature in Western languages ​​and a chronology are also provided offered for orientation. All modules are subject to a double-blind peer review process. The offer is constantly being expanded and revised and an English-language version is currently being developed.

New Testament Virtual Manuscript Room (NTVMTR) – ECM digital

Institution(s)Westfälische Wilhelms-Universität Münster, Institute for New Testament Textual Research
Data Domain(s)Editions, Collections
Language(s)Greek, German, English
Modality/ieswritten
Disciplines (DFG)102104105107
Disciplines (descriptive)Philologies, Theology, Edition Philology, Papyrology

In the Virtual Manuscript Room (VMR), the conventional short list of Greek manuscripts is supplemented by all the information available on the individual manuscripts. Above all, as far as the owning institutions agree, photos of the manuscripts are made available via an appropriate website. For this purpose, the microfilm holdings of the INTF are scanned (in some cases new photos are also obtained) and the content is indexed in such a way that they are linked to the transcripts of the manuscripts which were created in the INTF. There are also links pointing to photos and information elsewhere on the internet. The VMR has meanwhile been expanded into an interactive edition platform and serves as a working basis for the Editio Critica Maior (ECM) of the Greek New Testament created in the INTF, but can also be used for other text-critical editions of handwritten works. At the moment the ECM and VMR are growing together into an interactive critical edition of the New Testament.

Niedersorbische Textkorpora

Institution(s)Sorbian Institute, Bautzen
Data Domain(s)Collections
Language(s)Lower Sorbian
Modality/ieswritten
Disciplines (DFG)101–02102104105106108
Disciplines (descriptive)Linguistics, History, Cultural Studies, Computational Linguistics, Digital Humanities

We distinguish an “old” and a “new” text corpus (under construction). Both corpora are connected by different access methods. The data basis for the latter currently (2020) comprises around 43 million tokens. The texts are annotated step by step (including normalization / lemmatization). The search does not require in-depth knowledge of historical writing and the variety of forms, but it currently does not access a lot of texts. The old text corpus comprises more than 23 million tokens, of which around 15 million are available online. The texts are not annotated and hardly processed, the old corpus only provides the original spelling. In addition, the texts have not been corrected so that (copying) errors are to be expected.

APWCF, APWCD: Linguistisches Korpus der Acta Pacis Westphalicae, französische und deutsche Korrespondenzen

Institution(s)University of Potsdam, Chair for Romance Linguistics (French and Italian)
Data Domain(s)Collections
Language(s)Predominantly German and French, Italian, partly Latin, numerous insertions in other languages
Modality/ieswritten
Disciplines (DFG)102103104
Disciplines (descriptive)Linguistics, history, legal history, history of international relations, cultural history

The resource is a linguistic corpus based on the digital edition of Acta Pacis Westphalicae (APW). For this re-use and the non-commercial publication of the data, the institutions holding the respective rights have given written permission. From a linguistic point of view, this resource, originally prepared for historiography, is extremely valuable: the German, French, often multilingual technical or informal text types represent different registers of an important phase of language change. The edition criteria show minimal interference with the original text. The following resources are available in the corpus so far (marking of metadata, separation of text data):

  • French correspondence from the Acta Pacta Westphalicae (1644–1647), annotated with TreeTagger parameters for classical French from the PRESTO project (DFG / ANR), 2,640,000 tokens
  • German correspondence from the Acta Pacta Westphalicae (1643–1648), approx. 835,000 tokens, not yet annotated

Berliner Papyrusdatenbank

Institution(s)Ägyptisches Museum und Papyrussammlung – Staatliche Museen zu Berlin
Data Domain(s)Collections
Language(s)German, Ancient Greek, Latin
Modality/ieswritten
Disciplines (DFG)101104106113
Disciplines (descriptive)as part of the worldwide papyrological database network, the Berlin papyrus database is of central importance for all subjects in section 101 “Ancient Cultures” of the DFG subject structure (especially ancient history, classical philology, Egyptology) and far beyond (e.g. comparative linguistics, religious studies, law and the like).

The resource is a constantly expanding and updated database of the Greek and Latin-speaking holdings of the Berlin papyrus collection, which is the largest of its kind in Germany and one of the five largest in the world. In addition to the metadata (e.g. content, dating, origin, publications and acquisition history) and high-resolution images, links to further information from other databases and projects are available.

Bibliothek für Bildungsgeschichtliche Forschung (BBF)

Institution(s)DIPF | Leibniz Institute for Research and Information in Education
Data Domain(s)Editions
Language(s)Deutsch
Modality/ieswritten
Disciplines (DFG)102109
Disciplines (descriptive)History, Education, …

The Library for Educational History Research provides Friedrich Fröbel’s letters and the correspondence between Eduard Spranger and Käthe Hadlich as an online edition. It contains 6,251 documents from the years 1799–1852 (Fröbel edition) and 1903–1960 (Spranger-Hadlich). The letters can be accessed via a index of persons and years. The texts are marked according to the guidelines of the Text Encoding Initiative (TEI). The migration of the edition to the TEI-Publisher is currently being prepared.

CrossAsia ITR (Integriertes Textrepositorium)

Institution(s)CrossAsia and Specialized Information Service Asia, Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
Data Domain(s)Collections, Software Services
Language(s)Chinese, English, Japanese, German, Dutch, French, Spanish, Korean, Thai, Lao
Modality/ieswritten
Disciplines (DFG)103106
Disciplines (descriptive)all humanities and social sciences with a reference to Asia, in particular Asian studies (Sinology, Japanese Studies, Korean Studies, Southeast Asian Studies, Central Asian Studies (Tibetology, Mongolian Studies, Uyghur Studies), South Asian Studies, Indology), or regional studies with a reference to Asia, religious studies (Buddhology), East Asian art history, etc.

The “Integrated Text Repository” CrossAsia ITR securely and sustainably archives image and text data from the databases licensed for Specialized Information Service Asia and CrossAsia, for which hosting, indexing and text mining rights could be obtained, as well as texts and image data such as photographs in the public domain together with their indexing data, with the aim of being able to offer them on an equal footing in accordance with the FAIR principles. The the current version (August 2020) contains full texts of approx. 335,000 titles with 53 million pages from 26 different, predominantly Chinese-language and English-language databases subject to license as well as public domain texts from the Asia collection of the digitized collections of the SBB-PK in Western and Asian languages. A list of resources can be found here.

Database of Cross-Linguistic Colexifications (CLICS)

Institution(s)Max Planck Institute for the Science of Human History
Data Domain(s)Lexical Ressources, Software Services
Language(s)multilingual
Modality/ieswritten, transcribed
Disciplines (DFG)104110206
Disciplines (descriptive)historical linguistics, linguistic typology, psychology, neuroscience

The original Database of Cross-Linguistic Colexifications (CLICS) has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ – the third installment of CLICS – exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.

Website

Datenbank mit Nachweisen romanistischer Forschungsdaten

Institution(s)Specialized Information Service for Romance Studies
Data Domain(s)Software Services
Language(s)German, subject indexing partly also in French
Modality/iesspoken, written
Disciplines (DFG)101–02104105106108
Disciplines (descriptive)Romance studies (Literary Studies, Linguistics, Cultural and Media Studies, didactics)

Database based on Academic LinkShare, in which i.a. Dublin Core research data are formally and factually described. Subject indexing includes the allocation of GND keywords, DDC main classes, classifications according to region and resource type, as well as abstracts. Excerpts can be generated as required and presented separately on individual websites. The data has now also been integrated into the index of the search portal of the Specialized Information Service (currently still on a test basis).

Datenbank und Meldeformular für romanistische Forschungsdaten

Institution(s)Specialized Information Service for Romance Studies, romanistik​.de e.V.
Data Domain(s)Software Services
Language(s)Deutsch
Disciplines (DFG)104105
Disciplines (descriptive)Primarily Romance studies or disciplines in which data with Romance relevance arise

The registration form developed by romanistik​.de, the AG Digitale Romanistik (digital Romance Studies) and the Specialized Information Service Romanistik (Romance Studies) on the communication platform romanistik.de, which allows to draw attention to one’s own research data as well as to traditional publications. Reported research data can then be found on the platform and are also advertised in the romanistik.de newsletter.

Digitale und Retrodigitalisierte niedersorbische Wörterbücher

Institution(s)Sorbian Institute, Bautzen
Data Domain(s)Lexical Ressources
Language(s)Lower Sorbian, German
Modality/ieswritten
Disciplines (DFG)101–02104105106108
Disciplines (descriptive)Sorbian Studies, Slavic Studies, Lexicography, Linguistics, Cultural Studies

Uniform web version of four retro-digitized Lower Sorbian-German dictionaries based on finely granular, semantic-structurally modeled XML files. Also: Digital active German-Lower Sorbian dictionary. These lexical resources are increasingly linked to one another via common search interfaces.

Europäische Religionsfrieden Digital (EuReD) – Digitale Quellenedition frühneuzeitlicher Religionsfrieden

Institution(s)Leibniz-Institut für Europäische Geschichte (IEG); Universitäts- und Landesbibliothek Darmstadt
Data Domain(s)Editions
Language(s)German, Latin, French, English, Czech, Hungarian, Polish, Italian, Dutch, Danish, Swedish, possibly Russian
Modality/ieswritten
Disciplines (DFG)102103104105107
Disciplines (descriptive)Historical peace research; Cultural history; Legal History; History of Church and Theology; Edition Studies

The open-access and open-source edition provides for the first time the textual basis for comparative research on Early Modern religious peace-making in Europe. It includes detailed introductions and text-critial as well as explanatory commentaries and covers the period from 1485 (Peace of Kuttenberg) to 1788 (so-called Woellner Edict of Religion). The basis of the edition is formed by the texts of the religious peace treaties as they were first published and read (editio princeps). The edition is born-digital using XML/TEI-p5 standards.

Fallada-Archiv

Institution(s)Karlsruhe Institute of Technology, Institute for German Studies
Data Domain(s)Editions, Collections
Language(s)German, partly English
Modality/ieswritten
Disciplines (descriptive)Literary Studies, Edition Philology, Reception Research, Magazine Research, Cultural Studies, Text Linguistics, History, Sociology

The corpus consists of:

  • Bibliography regarding the German author Hans Fallada (1893–1947), which lists all primary texts, adaptations, reviews and the current state of research.
  • Digital copies of hardly accessible journalistic and literary contributions by Fallada as well as the first prints of his novels, which have appeared in serial form in various newspapers and magazines.
  • Digitized contemporary reviews of Fallada’s work.

The corpus of these texts is currently distributed across many different archives and libraries and most of them are not digitally accessible.

GEI-Digital – Die digitale Schulbuch-Bibliothek

Institution(s)Georg-Eckert-Institut – Leibniz-Institut für internationale Schulbuchforschung
Data Domain(s)Editions, Collections
Language(s)German
Modality/ieswritten
Disciplines (DFG)102103104105109111
Disciplines (descriptive)Educational Media Research, (Modern) History, Cultural History, Literary Studies, Sociology, German Philology, (Historical) Linguistics

GEI-Digital provides free access to digitised historical German textbooks from before 1918 though digital images, OCRed full text and extensive metadata. It enables targeted full-text searches within existing digitised collections. It currently contains more than 6,300 volumes of German language textbooks, mainly reading primers and the subjects “Realienkunde” (basic social/natural science) geography and history.

Indices zur sprachlichen und literarischen Bildung in Deutschland

Institution(s)Dr. Uwe Grund, Hanover
Data Domain(s)Collections
Language(s)German
Modality/ieswritten
Disciplines (DFG)104105109
Disciplines (descriptive)German studies, educational science

The five-volume print version of the INDICES (Munich, etc.: Saur, 1991ff) lists and makes accessible around 10,000 documents by around 3,000 authors in five leading specialist journals and two paradigmatic official journals. The data is based on the autopsy of around 100,000 printed pages. The serial sources (from approx. 1910 to approx. 1970) are described in several dimensions (text genre / content focus / author’s method). Both extraction and annotation processes (using a thesaurus) are used. Control and quality assurance (uniform indexing depth, linkability and sortability of the data sets according to chronological, alphabetical and taxonomic criteria) took place via specially created sets of rules for data acquisition, evaluation and further processing (unpublished). Individual files for monographic sources (e.g. readers, language books) are available in raw versions.

Niklas-Luhmann-Archiv

Institution(s)Bielefeld University, Faculty of Sociology
Data Domain(s)Editions, Collections
Language(s)German, English, Italian, Spanish
Modality/ieswritten, spoken
Disciplines (DFG)102105106108109110111113
Disciplines (descriptive)Sociology, Philosophy, Law, Education, Literature, Religious Studies, Political Science, Organizational Science, History Of Science

Scientific estate of the sociologist Niklas Luhmann (1927–1998), one of the most important sociologists of the 20th century. Indexing, transcription, edition and digitization of the card box with around 90,000 notes, the left manuscripts and other materials (including audio and video recordings of lectures and interviews)

Presseausschnitte online

Institution(s)Herder-Institut für historische Ostmitteleuropaforschung – Institut der Leibniz-Gemeinschaft
Data Domain(s)Collections, Software Services
Language(s)mainly German
Modality/iesgeschrieben
Disciplines (DFG)102103111
Disciplines (descriptive)History, Contemporary History, Politics, Media Studies

Over 5 million clippings document the history, politics, culture and economy of East Central Europe from 1916 up to the present day. We have focused in particular on undertaking a systematic analysis of regional and national daily and weekly newspapers from East Central Europe and German-speaking areas covering the period from 1952 until March 1999. We offer comprehensive personal, local and thematic archives. They can be used as a unique documentation of the socialist experiment in Eastern Europe. Around 10,000 clippings about Persons are already digitized and combined with metadata, 6,500 of them are additionally OCRed.

Zusammenstellung von Sprachkorpora aus der Romania

Institution(s)Various providers, compiled by the Specialized Information Service for Romance Studies
Data Domain(s)Collections, Lexical Ressources
Language(s)Romance languages ​​(especially French, Italian, Portuguese, Romanian, Spanish); other languages, e.g. in translations, for instance English, sign languages
Modality/iesmainly written, partly spoken as audio or signed as video corpora with or without transcription.
Disciplines (DFG)101–02104105106108
Disciplines (descriptive)Romance studies, linguistics, other disciplines working with texts

The description of the individual data records can be found in the respective catalogue entry, which, in addition to a formal title listing (Dublin Core), is usually comprehensively indexed with GND keywords, DDC main classes and abstracts. The language of each resource is also recorded, which allows filtering by individual languages.

Zusammenstellung von Volltextsammlungen aus der Romania (Editionen)

Institution(s)Various providers, compiled by the Specialized Information Service for Romance Studies
Data Domain(s)Editions
Language(s)Mainly Romance languages ​​(especially French, Italian); occasionally, other languages, e.g. in translations
Modality/iesmainly written
Disciplines (DFG)101–02104105106108111
Disciplines (descriptive)Romance Studies (Literary studies, Linguistics, Cultural and Media Studies, Didactics), interdisciplinary philologies, Cultural and Media Studies, Social Sciences, Digital Humanities

The description of the individual data records can be found in the respective catalog, which, in addition to a formal title listing (Dublin Core), usually contains a comprehensive subject indexing with GND key words, DDC main classes and abstracts. The language concerned is also recorded, which allows filtering by individual languages.

Zusammenstellung von Volltextsammlungen aus der Romania (Sammlungen und Editionen)

Institution(s)Various providers, compiled by the Specialized Information Service for Romance Studies
Data Domain(s)Editions, Collections
Language(s)Mainly Romance languages ​​(especially French, Italian, Portuguese, Spanish); occasionally other languages, e.g. translations
Modality/iesmainly written
Disciplines (DFG)101–02104105106108111
Disciplines (descriptive)Romance Studies (Literary studies, Linguistics, Cultural and Media Studies, Didactics), interdisciplinary philologies, Cultural and Media Studies, Social Sciences, Digital Humanities

The description of individual data records can be found in the respective catalogue entry, which, in addition to a formal title listing (Dublin Core), usually contains a comprehensive subject indexing with GND keywords, DDC main classes and abstracts. The language concerned is also recorded, which allows filtering by individual languages.

Zusammenstellung von lexikographischen Projekten aus der Romania

Institution(s)Various providers, compiled by the Specialized Information Service for Romance Studies
Data Domain(s)Lexical Ressources
Language(s)Romance languages ​​(especially French, Italian)
Modality/iesusually written
Disciplines (DFG)101–02104105106108111
Disciplines (descriptive)Romance Studies (Literary Studies, Linguistics, Cultural and Media Studies, didactics), interdisciplinary philologies, Cultural and Media Studies, Social Sciences, Digital Humanities

The description of the individual data records can be found in the respective catalog entry, which, in addition to a formal title listing (Dublin Core), usually contains a comprehensive subject indexing with GND key words, DDC main classes and abstracts. The language concerned is also recorded, which allows filtering by individual languages.