Highly effective Arabic diacritization using sequence to sequence modeling

H Mubarak, A Abdelali, H Sajjad, Y Samih… - Proceedings of the …, 2019 - aclanthology.org
Proceedings of the 2019 Conference of the North American Chapter of …, 2019aclanthology.org
Arabic text is typically written without short vowels (or diacritics). However, their presence is
required for properly verbalizing Arabic and is hence essential for applications such as text
to speech. There are two types of diacritics, namely core-word diacritics and case-endings.
Most previous works on automatic Arabic diacritic recovery rely on a large number of
manually engineered features, particularly for case-endings. In this work, we present a
unified character level sequence-to-sequence deep learning model that recovers both types …
Abstract
Arabic text is typically written without short vowels (or diacritics). However, their presence is required for properly verbalizing Arabic and is hence essential for applications such as text to speech. There are two types of diacritics, namely core-word diacritics and case-endings. Most previous works on automatic Arabic diacritic recovery rely on a large number of manually engineered features, particularly for case-endings. In this work, we present a unified character level sequence-to-sequence deep learning model that recovers both types of diacritics without the use of explicit feature engineering. Specifically, we employ a standard neural machine translation setup on overlapping windows of words (broken down into characters), and then we use voting to select the most likely diacritized form of a word. The proposed model outperforms all previous state-of-the-art systems. Our best settings achieve a word error rate (WER) of 4.49% compared to the state-of-the-art of 12.25% on a standard dataset.
aclanthology.org
以上显示的是最相近的搜索结果。 查看全部搜索结果