Output-optimal massively parallel algorithms for similarity joins

X Hu, K Yi, Y Tao - ACM Transactions on Database Systems (TODS), 2019 - dl.acm.org
Parallel join algorithms have received much attention in recent years due to the rapid
development of massively parallel systems such as MapReduce and Spark. In the database …

Adaptive distributed streaming similarity joins

G Siachamis, K Psarakis, M Fragkoulis… - Proceedings of the 17th …, 2023 - dl.acm.org
How can we perform similarity joins of multi-dimensional streams in a distributed fashion,
achieving low latency? Can we adaptively repartition those streams in order to retain high …

Distance-sensitive hashing

M Aumüller, T Christiani, R Pagh… - Proceedings of the 37th …, 2018 - dl.acm.org
Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy
or uncertain data, for example in connection with data cleaning (similarity join) and noise …

Efficient set containment join

J Yang, W Zhang, S Yang, Y Zhang, X Lin, L Yuan - The VLDB Journal, 2018 - Springer
In this paper, we study the problem of set containment join. Given two collections RR and SS
of records, the set containment join R ⋈ _ ⊆ SR⋈⊆ S retrieves all record pairs {(r, s)\} ∈ R …

Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and Quality

X Tang, F Zhang, S Zhang, Y Liu, B He, B He… - Proceedings of the …, 2024 - dl.acm.org
> Sampling is one of the most widely employed approximations in big data processing.
Among various challenges in sampling design, sampling for join is particularly intriguing yet …

An industrial dynamic skyline based similarity joins for multidimensional big data applications

B Yin, X Wei, J Wang, N Xiong… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
In the era of data deluge, data analysis has become a key task for many industrial
applications, eg, master data management, and data integration. In particular, similarity join …

Instance and output optimal parallel algorithms for acyclic joins

X Hu, K Yi - Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI …, 2019 - dl.acm.org
Massively parallel join algorithms have received much attention in recent years, while most
prior work has focused on worst-optimal algorithms. However, the worst-case optimality of …

A scalable similarity join algorithm based on MapReduce and LSH

S Rivault, M Bamha, S Limet, S Robert - International Journal of Parallel …, 2022 - Springer
Similarity joins are recognized to be among the most useful data processing and analysis
operations. A similarity join is used to retrieve all data pairs whose distances are smaller …

Jodes: Efficient Oblivious Join in the Distributed Setting

Y Wang, X Zeng, S Wang, F Li - arXiv preprint arXiv:2501.09334, 2025 - arxiv.org
Trusted execution environment (TEE) has provided an isolated and secure environment for
building cloud-based analytic systems, but it still suffers from access pattern leakages …

[PDF][PDF] Massively parallel entity matching with linear classification in low dimensional space

Y Tao - 21st International Conference on Database Theory …, 2018 - drops.dagstuhl.de
In entity matching classification, we are given two sets R and S of objects where whether r
and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D …