Text+ User Story

Full-text digitization with OCR

Jan Horstmann (Forschungsverbund Marbach Weimar Wolfenbüttel)

DFG subject areas: 104 Linguistics, 105 Literary Studies

Text+ data domain: Collections

Motivation

Collection-managing institutions such as libraries and archives preserve text documents that are of great importance for the study of literary works and constellations (DFG subject classification: especially literary studies and linguistics). Diaries, correspondence, calendars, but also historical prints of novels or plays regularly take an important position in arguments and interpretations in literary studies as well as in historically oriented linguistic projects. Fully digitized texts are the prerequisite for all kinds of literary or linguistic research questions focussing on the digital analysis or interpretation of single texts or text corpora. The digitization of these text objects (a task that mainly information science is involved with) is progressing at different speeds in the individual institutions that hold the collection (both nationally and internationally). In many cases one stands at the status of image digitization and descriptive or technical metadata generation. As long as a collection is not declared relevant enough for a digital edition, the actual primary text data cannot be read by machines. Full-text digitization addresses this problem so that even large quantities of texts can be made computer-readable – not, of course, with the claim of a scholarly edition, but with the aim of mass data generation. The OCR (optical character recognition) for (historical) prints and the HTR (handwritten text recognition) for handwritten manuscripts with its possibility of model training for specific writings are the approaches of choice here.

Objectives

Problematic in this area are the numerous different approaches and standards used in the field of digitization (data domain: collections). In metadata generation and image digitization as well as in full-text recognition, there are hardly any binding cross-regional specifications regarding data structures, workflows, software intersection and storage processes. The two institutions of the Research Association Marbach Weimar Wolfenbüttel (MWW), the Klassik Stiftung and the Herzog August Bibliothek, have participated and continue to participate in the DFG’s VD projects in which standards of image digitization are applied. This also holds true for the OCR‑D project, which is developing an application for the academically reliable full text digitization of historical prints. MWW has declared its intention to apply for an implementation project to offer OCR‑D as a low-threshold web service. Questions and problems that become relevant in this context are, apart from the general implementability and the design of possible applications, how the automatically generated full text data should be handled afterwards. Open questions in this field that Text+ could help to address are e.g.: according to which schemata should research data be stored and made accessible in a sustainable way? How to deal with situations where the fully digitized texts are still subject to copyright? What about ethical dimensions, for example in cases where personal rights must be taken into account?

Review by community

If the planned OCR‑D implementation project of MWW is approved, Text+ services and developed standards for long-term storage, research data management and data transfer structures can be applied and evaluated.