Semantically-grounded construction of centroids for datasets with textual attributes

S Martı, A Valls, D Sánchez - Knowledge-Based Systems, 2012 - Elsevier
S Martı, A Valls, D Sánchez
Knowledge-Based Systems, 2012Elsevier
Centroids are key components in many data analysis algorithms such as clustering or
microaggregation. They are considered as the central value that minimises the distance to
all the objects in a dataset or cluster. Methods for centroid construction are mainly devoted to
datasets with numerical and categorical attributes, focusing on the numerical and
distributional properties of data. Textual attributes, on the contrary, consist of term lists
referring to concepts with a specific semantic content (ie, meaning), which cannot be …
Centroids are key components in many data analysis algorithms such as clustering or microaggregation. They are considered as the central value that minimises the distance to all the objects in a dataset or cluster. Methods for centroid construction are mainly devoted to datasets with numerical and categorical attributes, focusing on the numerical and distributional properties of data. Textual attributes, on the contrary, consist of term lists referring to concepts with a specific semantic content (i.e., meaning), which cannot be evaluated by means of classical numerical operators. Hence, the centroid of a dataset with textual attributes should be the term that minimises the semantic distance against the members of the set. Semantically-grounded methods aiming to construct centroids for datasets with textual attributes are scarce and, as it will be discussed in this paper, they are hampered by their limited semantic analysis of data. In this paper, we propose a method that, exploiting the knowledge provided by background ontologies (like WordNet), is able to construct the centroid of multivariate datasets described by means of textual attributes. Special efforts have been put in the minimisation of the semantic distance between the centroid and the input data. As a result, our method is able to provide optimal centroids (i.e., those that minimise the distance to all the objects in the dataset) according to the exploited background ontology and a semantic similarity measure. Our proposal has been evaluated by means of a real dataset consisting on short textual answers provided by visitors of a natural park. Results show that our centroids retain the semantic content of the input data better than related works.
Elsevier
以上显示的是最相近的搜索结果。 查看全部搜索结果