Digital Humanities & Digital Cultural Heritage

Reading List: Text Recognition for Digital Collections

Posted: 08-07-2020 Topics: 2018-2022 Strategy

Optical character recognition (OCR) and handwritten text recognition (HTR) are processes most libraries are familiar with when digitising (large volumes of) text. The automated software recognises characters, which are then available for e.g. keyword search and computational analysis. The rise of machine learning applications saw a corresponding rise in HTR and improvements in OCR quality. 

This reading list aims to highlight some publications worth reading for librarians that wish to learn more about the text recognition processes, new developments in the field and the impact of lower quality text recognition in digital collections.

The LIBER Digital Humanities and Digital Cultural Heritage Working Group asked their members to suggest articles and publications that they found useful in their own work. All submissions have been collected in a Zotero folder. This list is a selection of those items.

  1. A Research Agenda for Historical and Multilingual Optical Character Recognition by David A. Smith and Ryan Cordell
    This report by Northeastern University’s NULab for Texts, Maps, and Networks is the result of an in-depth field study to help put together a research agenda for next steps to improve OCR. This final report includes nine recommendations addressed at various stakeholders with a vested interest in improving historical and multilingual OCR for print texts and manuscripts.
  1. Transforming scholarship in the archives through handwritten text recognition – Transkribus as a case study by Guenter Muehlberger et al.
    An overview of the current use of handwritten text recognition (HTR) on archival manuscript materials, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the effect HTR may have on scholarship, and evidences this turning point in the advanced use of digitised heritage content.
  2. Awesome OCR by Konstantin Baierer
    A Github repository with links to OCR engines, file formats, projects tutorials, datasets and many other resources pertaining to OCR.
  1. Europeana Pro Issue 13 on OCR edited by Gregory Markus 
    This post highlights projects run at various libraries around OCR. It includes a section on the OCR-D framework by Neudecker et al, a description of the British Library’s initiatives with OCR for Bangla and HTR for Arabic, including evaluation of results from ICDAR competitions by Tom Derrick and Adi Keinan-Schoonbaert, and a section on automated layout segmentation by Kettunnen et al.
  1. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study by Mark J. Hill and Simon Hengchen
    This article describes an assessment of OCR impact on a set of analytical tasks (topic modelling, authorship attribution, collocation analysis, vector space modelling). ECCO data was used in this research and its bibliography refers to relevant literature on the topic.

This list is by no means finite but should give you an insight into current practices around OCR and HTR, how it can be applied in a digital library and what limitations it has. It can also spark an interest in how you can improve your current recognised texts, something which we as a Working Group would be happy to talk about.

If you have suggestions for more literature, please share them with us (lotte.wilms@kb.nl and/or marian.lefferts@cerl.org) or add them to our Zotero library. You may also find it interesting to refer to previous Reading Lists assembled by this Working Group.