作者
Ho-pong Leung, Fu-lai Chung, Stephen CF Chan, Robert Luk
发表日期
2005/4/8
研讨会论文
International Workshop on Challenges in Web Information Retrieval and Integration
页码范围
91-96
出版商
IEEE
简介
XML is becoming a common way of storing data. The elements and their arrangement in the document’s hierarchy not only describe the document structure but also imply the data’s semantic meaning, and hence provide valuable information to develop tools for manipulating XML documents. In this paper, we pursue a data mining approach to the problem of XML document clustering. We introduce a novel XML structural representation called common XPath (CXP), which encodes the frequently occurring elements with the hierarchical information, and propose to take the CXPs mined to form the feature vectors for XML document clustering. In other words, data mining acts as a feature extractor in the clustering process. Based on this idea, we devise a path-based XML document clustering algorithm called PBClustering which groups the documents according to their CXPs, i.e. their frequent structures. Encouraging …
引用总数
2006200720082009201020112012201320142015201620172018201920202021117610555156131
学术搜索中的文章
H Leung, F Chung, SCF Chan, R Luk - International Workshop on Challenges in Web …, 2005