[PDF][PDF] Assessment of a modern farsi corpus

E Darrudi, MR Hejazi, F Oroumchian - … of the 2nd Workshop on Information …, 2004 - Citeseer
Proceedings of the 2nd Workshop on Information Technology & its Disciplines …, 2004Citeseer
ABSTRACT The development of Language Engineering (LE) and Information Retrieval (IR)
applications requires availability of sizeable, reliable and representative corpora. This paper
describes how we have constructed a well-structured 345 MB tagged corpus of news, and
presents some beneficial statistics of this corpus based upon the characteristics of Farsi
language. It also goes into particular detail on the fitness of the frequency and rank of Farsi
words with Zipf-Mandelbrot's law. We will then present our measurement of Entropy of Farsi …
Abstract
The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on the fitness of the frequency and rank of Farsi words with Zipf-Mandelbrot’s law. We will then present our measurement of Entropy of Farsi for this corpus.
Citeseer
以上显示的是最相近的搜索结果。 查看全部搜索结果