On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records: Experience from a R&D project

W Andrzejewski, B Bębel, P Boiński, R Wrembel - Information Systems, 2024 - Elsevier
Data stored in information systems are often erroneous. Duplicate data are one of the typical
error type. To discover and handle duplicates, the so-called deduplication methods are …

Unsupervised matching of data and text

N Ahmadi, H Sand, P Papotti - 2022 IEEE 38th International …, 2022 - ieeexplore.ieee.org
Entity resolution is a widely studied problem with several proposals to match records across
relations. Matching textual content is a widespread task in many applications, such as …

Data integration, cleaning, and deduplication: Research versus industrial projects

R Wrembel - … Conference on Information Integration and Web, 2022 - Springer
In business applications, data integration is typically implemented as a data warehouse
architecture. In this architecture, heterogeneous and distributed data sources are accessed …

Deep clustering for data cleaning and integration

HT Rauf, A Freitas, NW Paton - arXiv preprint arXiv:2305.13494, 2023 - arxiv.org
Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in
areas such as text and image processing, and there have been impactful results that deploy …

pyJedAI: A Library with Resolution-Related Structures and Procedures for Products

E Ioannou, K Nikoletos… - INFORMS Journal on …, 2024 - pubsonline.informs.org
This work presents an open-source Python library, named pyJedAI, which provides
functionalities supporting the creation of algorithms related to product entity resolution …

TableDC: Deep Clustering for Tabular Data

HT Rauf, A Freitas, NW Paton - arXiv preprint arXiv:2405.17723, 2024 - arxiv.org
Deep clustering (DC), a fusion of deep representation learning and clustering, has recently
demonstrated positive results in data science, particularly text processing and computer …

On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records

Data stored in information systems are often erroneous. Duplicate data are one of the typical
error type. To discover and handle duplicates, the so-called deduplication methods are …