Offers for research data for integration within the context of Text+
Running Call: Direct link to submit offers for research data (German online form)
Together with you, we would like to sift through the rich treasure of research data and possible applications offered by the humanities and other disciplines and contribute productively to the NFDI Text+ initiative. It is precisely through the diversity and quantity of research data and a wide range of research questions that a stable and multi-faceted basis for a research data infrastructure is to be created.
Within the Text+ initiative, three data domains are distinguished: Collections, Lexical Resources and Editions. In these data domains, you can enrich the initiative with your offers of research data, research questions and research tasks.
The humanities community in all its diversity is the most important reference point for Text+ and we would like to represent it with your help. We would therefore like to invite you to participate and to help shape the NFDI.
Your research data can complement the Text+ offer and can be included in the pool of representative data for Text+. The governance of Text+ intends that the Scientific Coordination and further development of the three data domains of Text+ shall be exclusively in the hands of expert scientists. This further development will be based on the data pool. In order to obtain a structured overview of the data stocks relevant to the community, we have created a questionnaire.
Any feedback given until August 16, 2020 could be included in the preparation of the application. Later submissions are possible using the online form. For further questions please contact office[at]text-plus.org.
Results of the Data Call
Below, you see some of the data that were proposed by the community for integration in the Text+ infrastructure.
▸ Tailored Corpora and Topic Models for Japanese Parliamentary Minutes (provisional title)
The Japanese Diet has made all speeches since 1947 public through an XML interface. The enormous number of texts allows for tracing the evolution of concepts (if understood as corresponding to Japanese words) in discourse if methods of distant reading are employed. At the German Institute for Japanese Studies (DIJ), such analyses are carried out using LDA topic models. In addition to the models, systematic analysis also uses metadata that is provided through the public interface (like date, committee, name of the speaker) and custom data structures based on this (like networks).
▸ Soldatenbriefe des 18. und 19. Jahrhunderts
170 “soldiers’ letters” from the years 1745 to 1872. The letters come to a small extent from older but reliable editions (no longer protected by copyright), but mainly from our own archive work, and were therefore transcribed and published for the first time by the editor, Marko Neumann, based on the manuscript. These are available on the website of the Universitätsverlag Winter based in Heidelberg and can be downloaded from there free of charge. This corpus is extremely valuable from both a linguistic and a historical perspective, in particular from a cultural, literary and military-historical perspective.
For the hurdles (from a legal point of view) and difficulties (with a view to the form of publication and especially the data format) for the re-use of this valuable data, cf. the user story belonging to this data offer “Soldiers letters of the 18th and 19th centuries: From the PDF edition to reusable, interoperable research data”.
▸ Erschließung des Korrespondenz der Constance de Salm (1767–1845)
The resource contains the metadata with which the correspondence of Constance de Salm (around 11,000 letters) in a long-term project of the German Historical Institute Paris was indexed in terms of content and form. In addition to information on recipients and senders (such as person, place and date), the content is indexed using keywords.
JudaicaLink is an RDF-based knowledge graph that integrates data from different sources in the field of Jewish studies. This mainly includes lexicons and biographies, there are links to GDN, DBpedia/Wikipedia, Wikidata and many other data sources. The data is used in the Specialized Information Service Jewish Studies to create cross-links in metadata and currently also full texts.
▸ LTA – Latin Text Archive (mit FFL Frankfurt Latin Lexicon)
LTA is a freely accessible, web-based analytical archive of (mostly) high quality critical editions of Latin texts, annotated with complex metadata, fully lemmatized, and linked to a full-form lexicon (FLL). It provides diachronic organized reference corpora according to text genres and text mining tools for individual corpus building and analysis. Size and diversity of the content allow researchers to build reliable diachronic corpora (e.g. from only one text type) for analysis. It covers text production in Latin-speaking Europe from 400 to 1500, but it will be extended continuously. Technically, LTA is based on DTA (Deutsches Textarchiv).
▸ Medizinische Gutachten des 17. und 18. Jahrhunderts
The resource is a text corpus containing 150 transcribed medical reports from the 17th and 18th centuries from printed medical casebooks. The texts are available as plain text files, have rudimentary annotations (line break and page break), but only brief bibliographical information (no TEI-compliant header!).
▸ Dokumente und Materialien zur ostmitteleuropäischen Geschichte
The digital edition offers thematic modules for university teaching on East Central European history in their temporal depth and spatial breadth, so that thematic modules on medieval history are offered as well as on contemporary history. All text sources, but also other materials such as statistics, are offered in the respective original language, in German translation and, if possible, as a scanned image from the original source to ensure citability; further materials such as maps, images, a selective bibliography with literature in Western languages and a chronology are also provided offered for orientation. All modules are subject to a double-blind peer review process. The offer is constantly being expanded and revised and an English-language version is currently being developed.
▸ New Testament Virtual Manuscript Room (NTVMTR) – ECM digital
In the Virtual Manuscript Room (VMR), the conventional short list of Greek manuscripts is supplemented by all the information available on the individual manuscripts. Above all, as far as the owning institutions agree, photos of the manuscripts are made available via an appropriate website. For this purpose, the microfilm holdings of the INTF are scanned (in some cases new photos are also obtained) and the content is indexed in such a way that they are linked to the transcripts of the manuscripts which were created in the INTF. There are also links pointing to photos and information elsewhere on the internet. The VMR has meanwhile been expanded into an interactive edition platform and serves as a working basis for the Editio Critica Maior (ECM) of the Greek New Testament created in the INTF, but can also be used for other text-critical editions of handwritten works. At the moment the ECM and VMR are growing together into an interactive critical edition of the New Testament.
▸ Niedersorbische Textkorpora
We distinguish an “old” and a “new” text corpus (under construction). Both corpora are connected by different access methods. The data basis for the latter currently (2020) comprises around 43 million tokens. The texts are annotated step by step (including normalization / lemmatization). The search does not require in-depth knowledge of historical writing and the variety of forms, but it currently does not access a lot of texts. The old text corpus comprises more than 23 million tokens, of which around 15 million are available online. The texts are not annotated and hardly processed, the old corpus only provides the original spelling. In addition, the texts have not been corrected so that (copying) errors are to be expected.
▸ APWCF, APWCD: Linguistisches Korpus der Acta Pacis Westphalicae, französische und deutsche Korrespondenzen
The resource is a linguistic corpus based on the digital edition of Acta Pacis Westphalicae (APW). For this re-use and the non-commercial publication of the data, the institutions holding the respective rights have given written permission. From a linguistic point of view, this resource, originally prepared for historiography, is extremely valuable: the German, French, often multilingual technical or informal text types represent different registers of an important phase of language change. The edition criteria show minimal interference with the original text. The following resources are available in the corpus so far (marking of metadata, separation of text data):
- French correspondence from the Acta Pacta Westphalicae (1644–1647), annotated with TreeTagger parameters for classical French from the PRESTO project (DFG / ANR), 2,640,000 tokens
- German correspondence from the Acta Pacta Westphalicae (1643–1648), approx. 835,000 tokens, not yet annotated
▸ Berliner Papyrusdatenbank
The resource is a constantly expanding and updated database of the Greek and Latin-speaking holdings of the Berlin papyrus collection, which is the largest of its kind in Germany and one of the five largest in the world. In addition to the metadata (e.g. content, dating, origin, publications and acquisition history) and high-resolution images, links to further information from other databases and projects are available.
▸ Bibliothek für Bildungsgeschichtliche Forschung (BBF)
The Library for Educational History Research provides Friedrich Fröbel’s letters and the correspondence between Eduard Spranger and Käthe Hadlich as an online edition. It contains 6,251 documents from the years 1799–1852 (Fröbel edition) and 1903–1960 (Spranger-Hadlich). The letters can be accessed via a index of persons and years. The texts are marked according to the guidelines of the Text Encoding Initiative (TEI). The migration of the edition to the TEI-Publisher is currently being prepared.
▸ CrossAsia ITR (Integriertes Textrepositorium)
The “Integrated Text Repository” CrossAsia ITR securely and sustainably archives image and text data from the databases licensed for Specialized Information Service Asia and CrossAsia, for which hosting, indexing and text mining rights could be obtained, as well as texts and image data such as photographs in the public domain together with their indexing data, with the aim of being able to offer them on an equal footing in accordance with the FAIR principles. The the current version (August 2020) contains full texts of approx. 335,000 titles with 53 million pages from 26 different, predominantly Chinese-language and English-language databases subject to license as well as public domain texts from the Asia collection of the digitized collections of the SBB-PK in Western and Asian languages. A list of resources can be found here.
▸ Database of Cross-Linguistic Colexifications (CLICS)
The original Database of Cross-Linguistic Colexifications (CLICS) has established a computer-assisted framework for the interactive representation of cross-linguistic colexification patterns. It has proven to be a useful tool for various kinds of investigation into cross-linguistic semantic associations, ranging from studies on semantic change, patterns of conceptualization, and linguistic paleontology. But CLICS has also been criticized for obvious shortcomings. Building on standardization efforts reflected in the CLDF initiative and novel approaches for fast, efficient, and reliable data aggregation, CLICS² expanded the original CLICS database. CLICS³ – the third installment of CLICS – exploits the framework pioneered in CLICS² to more than double the amount of data aggregated in the database.
▸ Datenbank mit Nachweisen romanistischer Forschungsdaten
Database based on Academic LinkShare, in which i.a. Dublin Core research data are formally and factually described. Subject indexing includes the allocation of GND keywords, DDC main classes, classifications according to region and resource type, as well as abstracts. Excerpts can be generated as required and presented separately on individual websites. The data has now also been integrated into the index of the search portal of the Specialized Information Service (currently still on a test basis).
▸ Datenbank und Meldeformular für romanistische Forschungsdaten
The registration form developed by romanistik.de, the AG Digitale Romanistik (digital Romance Studies) and the Specialized Information Service Romanistik (Romance Studies) on the communication platform
romanistik.de, which allows to draw attention to one’s own research data as well as to traditional publications. Reported research data can then be found on the platform and are also advertised in the
▸ Digitale und Retrodigitalisierte niedersorbische Wörterbücher
Uniform web version of four retro-digitized Lower Sorbian-German dictionaries based on finely granular, semantic-structurally modeled XML files. Also: Digital active German-Lower Sorbian dictionary. These lexical resources are increasingly linked to one another via common search interfaces.
▸ Europäische Religionsfrieden Digital (EuReD) – Digitale Quellenedition frühneuzeitlicher Religionsfrieden
The open-access and open-source edition provides for the first time the textual basis for comparative research on Early Modern religious peace-making in Europe. It includes detailed introductions and text-critial as well as explanatory commentaries and covers the period from 1485 (Peace of Kuttenberg) to 1788 (so-called Woellner Edict of Religion). The basis of the edition is formed by the texts of the religious peace treaties as they were first published and read (editio princeps). The edition is born-digital using XML/TEI-p5 standards.
The corpus consists of:
- Bibliography regarding the German author Hans Fallada (1893–1947), which lists all primary texts, adaptations, reviews and the current state of research.
- Digital copies of hardly accessible journalistic and literary contributions by Fallada as well as the first prints of his novels, which have appeared in serial form in various newspapers and magazines.
- Digitized contemporary reviews of Fallada’s work.
The corpus of these texts is currently distributed across many different archives and libraries and most of them are not digitally accessible.
▸ GEI-Digital – Die digitale Schulbuch-Bibliothek
GEI-Digital provides free access to digitised historical German textbooks from before 1918 though digital images, OCRed full text and extensive metadata. It enables targeted full-text searches within existing digitised collections. It currently contains more than 6,300 volumes of German language textbooks, mainly reading primers and the subjects “Realienkunde” (basic social/natural science) geography and history.
▸ Indices zur sprachlichen und literarischen Bildung in Deutschland
The five-volume print version of the INDICES (Munich, etc.: Saur, 1991ff) lists and makes accessible around 10,000 documents by around 3,000 authors in five leading specialist journals and two paradigmatic official journals. The data is based on the autopsy of around 100,000 printed pages. The serial sources (from approx. 1910 to approx. 1970) are described in several dimensions (text genre / content focus / author’s method). Both extraction and annotation processes (using a thesaurus) are used. Control and quality assurance (uniform indexing depth, linkability and sortability of the data sets according to chronological, alphabetical and taxonomic criteria) took place via specially created sets of rules for data acquisition, evaluation and further processing (unpublished). Individual files for monographic sources (e.g. readers, language books) are available in raw versions.
Scientific estate of the sociologist Niklas Luhmann (1927–1998), one of the most important sociologists of the 20th century. Indexing, transcription, edition and digitization of the card box with around 90,000 notes, the left manuscripts and other materials (including audio and video recordings of lectures and interviews)
▸ Presseausschnitte online
Over 5 million clippings document the history, politics, culture and economy of East Central Europe from 1916 up to the present day. We have focused in particular on undertaking a systematic analysis of regional and national daily and weekly newspapers from East Central Europe and German-speaking areas covering the period from 1952 until March 1999. We offer comprehensive personal, local and thematic archives. They can be used as a unique documentation of the socialist experiment in Eastern Europe. Around 10,000 clippings about Persons are already digitized and combined with metadata, 6,500 of them are additionally OCRed.
▸ Zusammenstellung von Sprachkorpora aus der Romania
The description of the individual data records can be found in the respective catalogue entry, which, in addition to a formal title listing (Dublin Core), is usually comprehensively indexed with GND keywords, DDC main classes and abstracts. The language of each resource is also recorded, which allows filtering by individual languages.
▸ Zusammenstellung von Volltextsammlungen aus der Romania (Editionen)
The description of the individual data records can be found in the respective catalog, which, in addition to a formal title listing (Dublin Core), usually contains a comprehensive subject indexing with GND key words, DDC main classes and abstracts. The language concerned is also recorded, which allows filtering by individual languages.
▸ Zusammenstellung von Volltextsammlungen aus der Romania (Sammlungen und Editionen)
The description of individual data records can be found in the respective catalogue entry, which, in addition to a formal title listing (Dublin Core), usually contains a comprehensive subject indexing with GND keywords, DDC main classes and abstracts. The language concerned is also recorded, which allows filtering by individual languages.
▸ Zusammenstellung von lexikographischen Projekten aus der Romania
The description of the individual data records can be found in the respective catalog entry, which, in addition to a formal title listing (Dublin Core), usually contains a comprehensive subject indexing with GND key words, DDC main classes and abstracts. The language concerned is also recorded, which allows filtering by individual languages.