查看文章

arxiv.org 中的 [PDF]

Convolutional embedding for edit distance

作者

Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng

发表日期

2020/7/25

图书

International ACM SIGIR Conference on Research and Development in Information Retrieval

页码范围

599-608

简介

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and …

引用总数

被引用次数：27

202020212022202320242 5 7 7 6

学术搜索中的文章

Convolutional embedding for edit distance

X Dai, X Yan, K Zhou, Y Wang, H Yang, J Cheng - proceedings of the 43rd international ACM SIGIR …, 2020

被引用次数：23 相关文章所有 4 个版本

Edit distance embedding using convolutional neural networks*

X Dai, X Yan, K Zhou, Y Wang, H Yang, J Cheng - arXiv preprint arXiv:2001.11692, 2020

被引用次数：4 相关文章