Attention in natural language processing

A Galassi, M Lippi, P Torroni - IEEE transactions on neural …, 2020 - ieeexplore.ieee.org
Attention is an increasingly popular mechanism used in a wide range of neural
architectures. The mechanism itself has been realized in a variety of formats. However …

Survey of post-OCR processing approaches

TTH Nguyen, A Jatowt, M Coustaty… - ACM Computing Surveys …, 2021 - dl.acm.org
Optical character recognition (OCR) is one of the most popular techniques used for
converting printed documents into machine-readable ones. While OCR engines can do well …

Optical character recognition with neural networks and post-correction with finite state methods

S Drobac, K Lindén - International Journal on Document Analysis and …, 2020 - Springer
The optical character recognition (OCR) quality of the historical part of the Finnish
newspaper and journal corpus is rather low for reliable search and scientific research on the …

CRASpell: A contextual typo robust approach to improve Chinese spelling correction

S Liu, S Song, T Yue, T Yang, H Cai… - Findings of the …, 2022 - aclanthology.org
Recently, Bert-based models have dominated the research of Chinese spelling correction
(CSC). These methods have two limitations:(1) they have poor performance on multi-typo …

OCR post correction for endangered language texts

S Rijhwani, A Anastasopoulos, G Neubig - arXiv preprint arXiv …, 2020 - arxiv.org
There is little to no data available to build natural language processing models for most
endangered languages. However, textual data in these languages often exists in formats …

From the paft to the fiiture: a fully automatic NMT and word embeddings method for OCR post-correction

M Hämäläinen, S Hengchen - arXiv preprint arXiv:1910.05535, 2019 - arxiv.org
A great deal of historical corpora suffer from errors introduced by the OCR (optical character
recognition) methods used in the digitization process. Correcting these errors manually is a …

Neural OCR post-hoc correction of historical corpora

L Lyu, M Koutraki, M Krickl, B Fetahu - Transactions of the Association …, 2021 - direct.mit.edu
Optical character recognition (OCR) is crucial for a deeper access to historical collections.
OCR needs to account for orthographic variations, typefaces, or language evolution (ie, new …

Lexically aware semi-supervised learning for OCR post-correction

S Rijhwani, D Rosenblum… - Transactions of the …, 2021 - direct.mit.edu
Much of the existing linguistic data in many languages of the world is locked away in non-
digitized books and documents. Optical character recognition (OCR) can be used to produce …

MCRSpell: A metric learning of correct representation for Chinese spelling correction

C Li, M Zhang, X Zhang, Y Yan - Expert Systems with Applications, 2024 - Elsevier
Chinese spelling correction (CSC) is a difficult but gratifying work that not only helps
individuals read and understand the material in their daily lives but also serves as pre …

An unsupervised method for OCR post-correction and spelling normalisation for Finnish

Q Duong, M Hämäläinen, S Hengchen - arXiv preprint arXiv:2011.03502, 2020 - arxiv.org
Historical corpora are known to contain errors introduced by OCR (optical character
recognition) methods used in the digitization process, often said to be degrading the …