Finding multiword term candidates in Croatian

M Tadić, K Šojat - Information Extraction for Slavic Languages 2003 …, 2003 - croris.hr
Information Extraction for Slavic Languages 2003 Workshop, 2003croris.hr
Sažetak The paper presents the research in the field of statistical processing of a corpus of
texts in Croatian with the primary aim of finding statistically significant co-occurrences of n-
grams of tokens (digrams, trigrams and tetragrams). The collocations found with this method
present the list of candidates for multiword terminological units submitted to terminologists
for further processing ie manual selecting of the &# 8220; real terms&# 8221;. The statistical
measure of co-occurrence used is mutual information (MI3) accompanied with linguistic …
Sažetak
The paper presents the research in the field of statistical processing of a corpus of texts in Croatian with the primary aim of finding statistically significant co-occurrences of n-grams of tokens (digrams, trigrams and tetragrams). The collocations found with this method present the list of candidates for multiword terminological units submitted to terminologists for further processing ie manual selecting of the &# 8220; real terms&# 8221;. The statistical measure of co-occurrence used is mutual information (MI3) accompanied with linguistic filters: stop-words and POS. The results on non-lemmatized material of a highly inflected lan-guage such as Croatian show that MI measure alone is not sufficient to find satisfactory number of multi-word term candidates. In this case the usage of absolute frequency combined with linguistic filtering techniques gives broader list of candidates for real terms.
croris.hr
以上显示的是最相近的搜索结果。 查看全部搜索结果