Digital Humanities

New Approaches to OCR for Early Printed Books
2020, DigItalia, Rivista Del Digitale Nei Beni Culturali (with Nikolaus Weichselbaumer et. al.)

Summary

Books printed before 1800 present major problems for OCR. One of the main obstacles is the lack of diversity of historical fonts in training data. This project, consisting of book historians and computer scientists, aimed to address this deficiency by focussing on three major issues. Our first target was to create a tool that identifies font groups automatically in images of historical documents. We concentrated on Gothic font groups that were commonly used in German texts printed in the 15th and 16th century: the well-known Fraktur and the lesser known Bastarda, Rotunda, Textura und Schwabacher. The tool was trained with 35,000 images and reaches an accuracy level of 98%. It cannot only differentiate between the above-mentioned font groups but also Hebrew, Greek, Antiqua and Italic. Furthermore it can identify woodcut images and irrelevant data (book covers, empty pages, etc.). In a second step, we created an online training infrastructure (okralact). It facilitates the use of various open source OCR engines such as Tesseract, OCRopus, Kraken and Calamari. At the same time, it facilitates training for font group specific models. The high accuracy of the recognition tool paves the way for the unprecedented opportunity to differentiate between the fonts used by individual printers. With more training data and further adjustments, the tool could help to fill a major gap in historical research.

https://doi.org/10.36181/digitalia-00015 (OPEN ACCESS)

Dataset of Pages from Early Printed Books with Multiple Font Groups
2019, HIP ‘19. Proceedings of the 5th International Workshop on Historical Document Imaging and Processing (with Nikolaus Weichselbaumer et. al.)

Summary

Based on contemporary scripts, early printers developed a large variety of different fonts. While fonts may slightly differ from one printer to another, they can be divided into font groups, such as Textura, Antiqua, or Fraktur. The recognition of font groups is important for computer scientists to select adequate OCR models, and of high interest to humanities scholars studying early printed books and the history of fonts. In this paper, we introduce a new, public dataset for the recognition of font groups in early printed books, and evaluate several state-of-the-art CNNs for the font group recognition task. The dataset consists of more than 35 600 page images, each page showing up to five different font groups, of which ten are considered in this dataset.

https://doi.org/10.1145/3352631.3352640 (OPEN ACCESS)

The rapid rise of Fraktur
2020, DHd 2020 Spielräume: Digital Humanities zwischen Modellierung und Interpretation

Summary

From the first experiments in 1513, Fraktur quickly became the most successful gothic font in print history. Whereas gothic fonts in most other countries went out of use in the 16th and 17th centuries, Fraktur became by far the most used font for German texts in the early modern period. The font also made it to modernity and was used frequently, almost unchanged, until the middle of the 20th century. Even today the font is often used especially when a design should appear ‘historical’. Despite its importance, fairly little is known about the famous font. The origins of Fraktur at the beginning of the 16th century and the possible creators Vincenz Rockner and Johann Neudörffer have been the subjects of several studies (Kautzsch
1922, Kapr 1993: 24, Hessel 1937). Apart from this, however, we know remarkably little about its development over the following centuries. Only the Antiqua-Fraktur dispute around 1800 gained the interest of book historians again when German intellectuals discussed which of the two fonts is more appropriate for German texts (Lühmann 1981, Killius 1999). Yet the emergence of Fraktur and its leading role in font history remains understudied.

Tracing the emergence of Fraktur is complicated by two facts: On the one hand, contemporary evidence, such as invoices, letters and type specimens, is at best fragmentary and nearly impossible to contextualise without an analysis of the books themselves. On the other hand, researchers are simply overwhelmed by the amount of material available. For the 16th century alone, the German national bibliography VD16 (www.vd16.de) lists over 100,000 titles. This makes it impractical to look at every book individually and determine its fonts or even only its main text font.

Recent research presents a solution to this problem. With the help of a newly developed pattern recognition tool, large amounts of digitised book pages can be categorised into font groups. This tool was developed in the context of a project on font-specific OCR (Weichselbaumer et al. 2019, Seuret et al. 2019) and was then used for a large dataset of digitised books from BSB Munich. This paper will present the results and provide new insights into the rapid rise of Fraktur.

https://doi.org/10.5281/zenodo.3666690 (OPEN ACCESS)