Assessing the impact of OCR quality on downstream NLP tasks

D Van Strien, K Beelen, MC Ardanuy, K Hosseini… - 2020 - repository.cam.ac.uk
A growing volume of heritage data is being digitized and made available as text via optical
character recognition (OCR). Scholars and libraries are increasingly using OCR-generated …

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

MJ Hill, S Hengchen - Digital Scholarship in the Humanities, 2019 - academic.oup.com
This article aims to quantify the impact optical character recognition (OCR) has on the
quantitative analysis of historical documents. Using Eighteenth Century Collections Online …

Dialect corpora from YouTube

S Coats - Language and linguistics in a complex world, 2023 - degruyter.com
This paper introduces two new large corpora comprised of YouTube Automatic Speech
Recognition (ASR) transcripts of the speech of videos from geographically localized …

From the paft to the fiiture: a fully automatic NMT and word embeddings method for OCR post-correction

M Hämäläinen, S Hengchen - arXiv preprint arXiv:1910.05535, 2019 - arxiv.org
A great deal of historical corpora suffer from errors introduced by the OCR (optical character
recognition) methods used in the digitization process. Correcting these errors manually is a …

An unsupervised method for OCR post-correction and spelling normalisation for Finnish

Q Duong, M Hämäläinen, S Hengchen - arXiv preprint arXiv:2011.03502, 2020 - arxiv.org
Historical corpora are known to contain errors introduced by OCR (optical character
recognition) methods used in the digitization process, often said to be degrading the …

PNRank: Unsupervised ranking of person name entities from noisy OCR text

H Dutta, A Gupta - Decision Support Systems, 2022 - Elsevier
Text databases have grown tremendously in number, size, and volume over the last few
decades. Optical Character Recognition (OCR) software is used to scan the text and make …

Challenging stylometry: The authorship of the baroque play La Segunda Celestina

L Hernández-Lorenzo, J Byszuk - Digital Scholarship in the …, 2023 - academic.oup.com
The aim of this study was to verify the possibility of Sor Juana Inés de la Cruz authoring the
anonymous part of the baroque play La Segunda Celestina, commissioned to Agustín de …

Noisy medieval data, from digitized manuscript to stylometric analysis: Evaluating Paul Meyer's hagiographic hypothesis

JB Camps, T Clérice, A Pinche - Digital Scholarship in the …, 2021 - academic.oup.com
Stylometric analysis of medieval vernacular texts is still a significant challenge: the
importance of scribal variation, be it spelling or more substantial, as well as the variants and …

Pruning decision rules by reduct-based weighting and ranking of features

U Stańczyk - Entropy, 2022 - mdpi.com
Methods and techniques of feature selection support expert domain knowledge in the
search for attributes, which are the most important for a task. These approaches can also be …

Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution

U Stańczyk, B Zielosko - Bulletin of the Polish Academy of …, 2021 - yadda.icm.edu.pl
When patterns to be recognised are described by features of continuous type, discretisation
becomes either an optional or necessary step in the initial data pre-processing stage …