[PDF][PDF] Word Fragments Based Arabic Language Identification

H El-Shishiny, A Troussov, DJ McCloskey… - … Conference on Arabic …, 2004 - academia.edu
H El-Shishiny, A Troussov, DJ McCloskey, M Takeuchi, A Nevidomsky, P Volkov
Proceedings from NEMLAR International Conference on Arabic Language …, 2004academia.edu
We discriminate efficiently between Arabic language and other languages exploiting Arabic
script by a word fragments based method. The method makes use of a combination of
features characteristic of Arabic language namely function words, prefixes, suffixes and
unigrams representing the character set of Arabic script. Results based on 180 samples,
selected randomly from the Internet, representing six Arabic based script languages namely
Arabic, Persian, Urdu, Pashto, Kurdish and Uighur achieved 94% recall and 94% precision …
Abstract
We discriminate efficiently between Arabic language and other languages exploiting Arabic script by a word fragments based method. The method makes use of a combination of features characteristic of Arabic language namely function words, prefixes, suffixes and unigrams representing the character set of Arabic script. Results based on 180 samples, selected randomly from the Internet, representing six Arabic based script languages namely Arabic, Persian, Urdu, Pashto, Kurdish and Uighur achieved 94% recall and 94% precision for Arabic language identification. A key advantage of this approach is that the language model used for identification is transparent and can be tuned and enhanced using linguistic expertise.
academia.edu
以上显示的是最相近的搜索结果。 查看全部搜索结果