Text+ User Story

Facilitating research in Early Modern German texts by OCR

Julie Lisa Davies (University of Münster), Daniela Schulz, Elisabeth Engl, Hartmut Beyer (Herzog August Bibliothek Wolfenbüttel)

DFG subject area: 102 History

Text+ data domain: Collections

Motivation

My research at the HAB focuses on 16th-18th century texts primarily in German, Latin and English. I look at material related to philosophy, theology, early modern science (especially botany) and witchcraft. I am therefore able to compare what it is like working with English and Latin texts compared to working with German blackletter texts. Of the three groups, the items printed using the blackletter font are the most challenging and time-consuming as a direct result of the font in several ways. Though OCR in any early print font is still far from perfect, and better in English than in Latin, the quality obtained when running blackletter prints through the available programs is so low, that it is barely worth the time running through the process. The results are often not usable. 

Objectives 

Having access to reasonable quality OCR of early prints would be beneficial to my research in three main ways. 

1. Being able to complete full text searches opens up avenues for research that would otherwise take much longer, be less comprehensive, or just not possible at all. For example, part of my research has involved identifying references to particular authors or topics in non-traditional texts. These references are often only minor to the main theme of the referencing work, and therefore the name or term I am looking for is often not included in the contents or index of the work – but there nonetheless.  

It is currently possible to quickly identify many of such mentions in a large English speaking sample with good reliability and in Latin samples with reasonable reliability. However, as the texts I work on are not often among those being prioritized for manual transcription/OCR correction, I still need to scan through blackletter texts manually which, in addition to being very time-consuming, is much less reliable. Being able to OCR texts oneself, in addition to having those more popular texts already available in the repository would greatly assist scholars who, like me, often work on less common texts. This in turn, would help further broaden and deepen our understanding of the past and allow research to be research driven, rather than driven (or limited) by the texts that are already accessible because of their popularity. 

2. Being able to OCR texts with reasonable results is also practically timesaving in many ways. The modern computer type face often easier to read and though, access to the original print remains essential for accuracy, having the OCRed version can make it much easier to quickly read over something to check for relevance, or when searching for references (if enough detail can’t be remembered for it to show up in a full text search). Furthermore being able to do practical things like copy and paste important quotes into notes, rather than typing them manually, also saves a lot of time. 

3. Ultimately, though it is still rare for historical texts in all categories, when texts can be linked to appropriate dictionaries, lexicons or parsing tools, this is also fantastic. To give a modern example, both my E‑book reader and my smart phone have a function which allows me to select and install foreign language dictionaries. This means I am able to click on unfamiliar German words and have a translation/definition within a couple of seconds. The range of lexicons/dictionaries that would be required for this to be effective in a research context would be very important, as would the ability to easily switch between references. But, when looking at texts which predate standardized spelling, a feature of this kind that automatically links to modern spellings would also be sensational. 

Solution 

For all these reasons, improving the quality of OCR for blackletter texts is therefore essential to expanding, but also possibly even maintaining, interest and momentum in the study of German language works – particularly because the blackletter typeface was used for German language texts for much longer than for other languages. Currently, young researchers in particular, are finding themselves drawn to projects for which they have remote access to sources and to which they can readily apply a host of digital tools.  

Improved OCR for blackletter texts will also further support key emerging research fields, such as the knowledge exchanges between German and English speaking lands, which are gaining increasing interest and influence as more archives do become available online.