[PDF][PDF] Exploiting text structure for topic identification

T Nomoto, Y Matsumoto - Fourth Workshop on Very Large …, 1996 - aclanthology.org
Fourth Workshop on Very Large Corpora, 1996aclanthology.org
The paper demonstrates how information on text structure can be used to improve the
performance on the identification of topical words in texts, which is based on a probabilistic
model of text categorization. We use texts which are not explicitly structured. A text structure
is identified by measuring the similarity between segments comprising the text and its title. It
is shown that a text structure thus identified gives a good clue to finding out parts of the text
most relevant to its content. The significance of exploiting information on the structure for …
Summary
The paper demonstrates how information on text structure can be used to improve the performance on the identification of topical words in texts, which is based on a probabilistic model of text categorization. We use texts which are not explicitly structured. A text structure is identified by measuring the similarity between segments comprising the text and its title. It is shown that a text structure thus identified gives a good clue to finding out parts of the text most relevant to its content. The significance of exploiting information on the structure for topic identification is demonstrated by a set of experiments conducted on the 19Mb of Japanese newspaper articles. The paper also brings concepts from the rhetorical structure theory (RST) to the statistical analysis of a text structure. Finally, it is shown that information on text structure is more effective for large documents than for small documents.
aclanthology.org
以上显示的是最相近的搜索结果。 查看全部搜索结果