A survey on locality sensitive hashing algorithms and their applications

O Jafari, P Maurya, P Nagarkar, KM Islam… - arXiv preprint arXiv …, 2021 - arxiv.org
Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many
diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular …

[HTML][HTML] Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods

T Li, G Kou, Y Peng - Information Systems, 2020 - Elsevier
In malicious URLs detection, traditional classifiers are challenged because the data volume
is huge, patterns are changing over time, and the correlations among features are …

A review for weighted minhash algorithms

W Wu, B Li, L Chen, J Gao… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Data similarity (or distance) computation is a fundamental research topic which underpins
many high-level applications based on similarity measures in machine learning and data …

Refining codes for locality sensitive hashing

H Liu, W Zhou, Z Wu, S Zhang, G Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Learning to hash is of particular interest in information retrieval for large-scale data due to its
high efficiency and effectiveness. Most studies in hashing concentrate on constructing new …

Serving deep learning models with deduplication from relational databases

L Zhou, J Chen, A Das, H Min, L Yu, M Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org
There are significant benefits to serve deep learning models from relational databases. First,
features extracted from databases do not need to be transferred to any decoupled deep …

A fast LSH-based similarity search method for multivariate time series

C Yu, L Luo, LLH Chan, T Rakthanmanon… - Information Sciences, 2019 - Elsevier
Due to advances in mobile devices and sensors, there has been an increasing interest in
the analysis of multivariate time series. Identifying similar time series is a core subroutine in …

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

B Zheng, X Zhao, L Weng, QVH Nguyen, H Liu… - The VLDB Journal, 2022 - Springer
Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional
spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive …

An effective and scalable framework for authorship attribution query processing

R Sarwar, C Yu, N Tungare, K Chitavisutthivong… - IEEE …, 2018 - ieeexplore.ieee.org
Authorship attribution aims at identifying the original author of an anonymous text from a
given set of candidate authors and has a wide range of applications. The main challenge in …

Improved consistent weighted sampling revisited

W Wu, B Li, L Chen, C Zhang… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary
sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch …

A Survey on Efficient Processing of Similarity Queries over Neural Embeddings

Y Wang - arXiv preprint arXiv:2204.07922, 2022 - arxiv.org
Similarity query is the family of queries based on some similarity metrics. Unlike the
traditional database queries which are mostly based on value equality, similarity queries aim …