[PDF][PDF] Corpus and lexicon-mutual incompletness

C Krstev, D Vitas - Proceedings of the Corpus Linguistics …, 2005 - researchgate.net
Proceedings of the Corpus Linguistics Conference, 2005researchgate.net
The natural language processing group (NLP group) at the Faculty of Mathematics,
University of Belgrade is engaged for many years now in a task of producing various
language resources, both corpora and lexicons (Vitas et al. 2003). However, in the past our
main goal was to produce as many resources as possible in order to try to keep the pace
with the so called “big” languages. After producing resources of considerable size we
focused our attention to the evaluation of their quality. In order to support this process we …
The natural language processing group (NLP group) at the Faculty of Mathematics, University of Belgrade is engaged for many years now in a task of producing various language resources, both corpora and lexicons (Vitas et al. 2003). However, in the past our main goal was to produce as many resources as possible in order to try to keep the pace with the so called “big” languages. After producing resources of considerable size we focused our attention to the evaluation of their quality. In order to support this process we performed an experiment by applying the Serbian morphological dictionary to the corpus in order to establish: a) The extent and content of the corpus lexica that is not covered by e-dictionary. Here we are trying to see what kind of tools have to be developed for the recognition and tagging of unrecognized words such as derivatives, proper names, acronyms, foreign words, etc. b) The part of e-dictionary not covered by the lexica found in the corpus. We are looking for uncovered lemmas (for instance, to what extent corpus covers the names of zoological species), and uncovered forms (for instance, is imperfect tense really vanishing from contemporary Serbian), etc.
In section 2 we will discuss the structure of Serbian monolingual corpus, its size and accessibility of its part that is presented on web, in the section 3 we will present our Serbian morphological e-dictionary. In section 4 we will present the results of the analysis of the coverage of the corpus by the e-dictionary, while in section 5 we will analyse the coverage of e-dictionary in corpus. Finally, in section 6 we will give some concluding remarks, mainly concerning our future work on the further development of both the corpus and the e-dictionary on the basis of the results presented in this paper.
researchgate.net
以上显示的是最相近的搜索结果。 查看全部搜索结果