Text+ User Story

Building Collections for Literary Scholars

Raisa Barthauer (SUB Göttingen) 

DFG subject area: 105 Literary Studies

Text+ data domain: Collections

Motivation

I want to build an individual text collection for my thesis in German literature of the Enlightenment. Using the TextGrid Repository, I have already built a first collection and started to annotate the texts using Catma. In addition to a close-reading approach, I am going to test my hypothesis against a reference corpus using low-level digital tools like the Voyant Tool and the Topics Explorer. To build the reference corpus, I have made a list of relevant works and authors for the different cultural areas. For an initial search for resources, I often use typical internet search engines, but the search results are often not very useful. A higher visibility of the relevant resources and easier accessibility via a central community or a central service would be of great help. However, most of the German literature is findable and accessible via the TextGrid repository; other resources are linked via the websites of FID Germanistik and the portal GiN-Guide. It would save a lot of time if different repositories and collections were searchable and accessible via one central community-oriented access point (VCR, FCS or Generic Search and Collection Registry (https://​www​.clariah​.de/​d​i​e​n​s​t​e​.​h​tml)) and if the tools were embedded, so that they can be used instantly. 

Several texts on my list are not yet digitized: what should I do? 

I asked my university library for a digitization service. Such a service is offered, but as I need a great amount of texts in digitized form, the service becomes too expensive for me in the long run. I have to digitize many texts on my own and spend a lot of time on that. The book scanner probably produces a first raw plain text version. But as a lot of the texts I need are printed in Blackletter, I do not trust the OCRed texts. To evaluate the quality of the results and further usability of the texts, I need further consulting by OCR specialists. 

How do I create the OCRed text and how do I improve the OCR quality of the given text? 

As I do not get OCR via the library, I need a simple tool (and tutorial) for OCR and the machine-readable text. Even if I got the OCR text out of the mass digitization process, I would want to improve the quality by training the OCR model — and I need people who know how to do that. 

How do I get a more deeply structured and annotated text collection? 

To feed the analysis of my thesis, I want to use annotation tools and low-level digital tools like the Voyant Tool. An easy-to-use tool (and tutorial https://​de​.dariah​.eu/​t​e​i​-​t​u​t​o​r​ial) to create a simple XML-TEI file for deeper annotation would thus be very helpful. 

Objectives 

To significantly facilitate the work on my thesis and to support my research, it would, first of all, be very helpful if a virtual collection registry and a federated content search were made available for different corpora through a single access system (e.g. a portal). In my opinion, researchers should be better enabled to use digital resources and tools on their own. For example, a workflow for the creation of a text collection along a specific research question could be offered, supporting especially non-DH researchers through the Text+ service portfolio. 

Concrete assistance in the case of non-digitized texts (for example in the form of experts or a community that can be consulted for advice, support, tutorials, and low-level tools) would not only make my work much more efficient, but would also enable a more profound analysis of my texts. The option to integrate various text analysis tools directly into my collection could make my work more efficient and organized. In general, an increased visibility of the resources on the internet would also be welcome. 

Solution 

Although many resources are already available through DARIAH and especially TextGrid, my demand for digitized texts and easily accessible tools to work on them is, unfortunately, not yet covered. Basically, I would require a reduction of the effort needed for the use of these resources. A central search interface and an increased interoperability would be very helpful. More texts should be made available as well as various tools. Training and consulting on this would significantly reduce the inhibition threshold for use, as would more dissemination, and this would also increase visibility. 

Challenges

One risk could be that I might have to spend a great amount of time on the preparation of the texts, so that the effort and the result would be disproportionate and that therefore the possibilities of analysis, which I get through annotation- and DH tools, could not be implemented, as the time expenditure would be too great.