An OCR for classical Indic documents containing arbitrarily long words- 学术资源搜索

An OCR for classical Indic documents containing arbitrarily long words

A Dwivedi, R Saluja… - Proceedings of the …, 2020 - openaccess.thecvf.com

Proceedings of the IEEE/CVF Conference on Computer Vision and …, 2020•openaccess.thecvf.com

Abstract

OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.

openaccess.thecvf.com

展开收起

被引用次数：21 相关文章所有 11 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

An OCR for classical Indic documents containing arbitrarily long words

引用