Characterizing and selecting fresh data sources

T Rekatsinas, XL Dong, D Srivastava - Proceedings of the 2014 ACM …, 2014 - dl.acm.org
Proceedings of the 2014 ACM SIGMOD international conference on Management of …, 2014dl.acm.org
Data integration is a challenging task due to the large numbers of autonomous data sources.
This necessitates the development of techniques to reason about the benefits and costs of
acquiring and integrating data. Recently the problem of source selection (ie, identifying the
subset of sources that maximizes the profit from integration) was introduced as a
preprocessing step before the actual integration. The problem was studied for static sources
and used the accuracy of data fusion to quantify the integration profit. In this paper, we study …
Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit.
In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果