Using ghost edges for classification in sparsely labeled networks

B Gallagher, H Tong, T Eliassi-Rad… - Proceedings of the 14th …, 2008 - dl.acm.org
Proceedings of the 14th ACM SIGKDD international conference on Knowledge …, 2008dl.acm.org
We address the problem of classification in partially labeled networks (aka within-network
classification) where observed class labels are sparse. Techniques for statistical relational
learning have been shown to perform well on network classification tasks by exploiting
dependencies between class labels of neighboring nodes. However, relational classifiers
can fail when unlabeled nodes have too few labeled neighbors to support learning (during
training phase) and/or inference (during testing phase). This situation arises in real-world …
We address the problem of classification in partially labeled networks (a.k.a. within-network classification) where observed class labels are sparse. Techniques for statistical relational learning have been shown to perform well on network classification tasks by exploiting dependencies between class labels of neighboring nodes. However, relational classifiers can fail when unlabeled nodes have too few labeled neighbors to support learning (during training phase) and/or inference (during testing phase). This situation arises in real-world problems when observed labels are sparse.
In this paper, we propose a novel approach to within-network classification that combines aspects of statistical relational learning and semi-supervised learning to improve classification performance in sparse networks. Our approach works by adding "ghost edges" to a network, which enable the flow of information from labeled to unlabeled nodes. Through experiments on real-world data sets, we demonstrate that our approach performs well across a range of conditions where existing approaches, such as collective classification and semi-supervised learning, fail. On all tasks, our approach improves area under the ROC curve (AUC) by up to 15 points over existing approaches. Furthermore, we demonstrate that our approach runs in time proportional to LE, where L is the number of labeled nodes and E is the number of edges.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果