作者
Ehsan Darrudi, Mohamad R Hejazi, Farhad Oroumchian
发表日期
2004/2
期刊
Proceedings of the 2nd Workshop on Information Technology & its Disciplines (WITID)
页码范围
73-7
简介
The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on the fitness of the frequency and rank of Farsi words with Zipf-Mandelbrot’s law. We will then present our measurement of Entropy of Farsi for this corpus.
引用总数
2003200420052006200720082009201020112012201320142015201620172018201920202021202220231127782315423113
学术搜索中的文章
E Darrudi, MR Hejazi, F Oroumchian - Proceedings of the 2nd Workshop on Information …, 2004