查看文章

Optimized tokenization process for open-vocabulary code completion: An empirical study

作者

Yasir Hussain, Zhiqiu Huang, Yu Zhou, Izhar Ahmed Khan, Nasrullah Khan, Muhammad Zahid Abbas

发表日期

2023/6/14

图书

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

页码范围

398-405

简介

Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE …

引用总数

被引用次数：3

20243

学术搜索中的文章

Optimized tokenization process for open-vocabulary code completion: An empirical study

Y Hussain, Z Huang, Y Zhou, IA Khan, N Khan… - Proceedings of the 27th International Conference on …, 2023

被引用次数：3 相关文章