An OCR for classical Indic documents containing arbitrarily long words

A Dwivedi, R Saluja… - Proceedings of the …, 2020 - openaccess.thecvf.com
Proceedings of the IEEE/CVF Conference on Computer Vision and …, 2020openaccess.thecvf.com
OCR for printed classical Indic documents written in Sanskrit is a challenging research
problem. It involves complexities such as image degradation, lack of datasets and long-
length words. Due to these challenges, the word accuracy of available OCR systems, both
academic and industrial, is not very high for such documents. To address these
shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based
LSTM model for reading Sanskrit characters in line images. We introduce a dataset of …
Abstract
OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
openaccess.thecvf.com
以上显示的是最相近的搜索结果。 查看全部搜索结果