An effective and cost-based framework for a qualitative hybrid data deduplication

CR Haruna, MS Hou, MJ Eghan… - Advances in Computer …, 2019 - Springer
Advances in Computer Communication and Computational Sciences: Proceedings of …, 2019Springer
In real world, entities may occur several times in a database. These duplicates may have
varying keys and/or include errors that make deduplication a difficult task. Deduplication
cannot be solved accurately using either machine-based or crowdsourcing techniques only.
Crowdsourcing were used to resolve the shortcomings of machine-based approaches.
Compared to machines, the crowd provided relatively accurate results, but with a slow
execution time and very expensive too. A hybrid technique for data deduplication using a …
Abstract
In real world, entities may occur several times in a database. These duplicates may have varying keys and/or include errors that make deduplication a difficult task. Deduplication cannot be solved accurately using either machine-based or crowdsourcing techniques only. Crowdsourcing were used to resolve the shortcomings of machine-based approaches. Compared to machines, the crowd provided relatively accurate results, but with a slow execution time and very expensive too. A hybrid technique for data deduplication using a Euclidean distance and a chromatic correlation clustering algorithm was presented. The technique aimed at: reducing the crowdsourcing cost, reducing the time the crowd use in deduplication and finally providing higher accuracy in data deduplication. In the experiments, the proposed algorithm was compared with some existing techniques and outperformed some, offering an utmost deduplication accuracy efficiency and also incurring low crowdsourcing cost.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果