作者
Yasir Hussain, Zhiqiu Huang, Yu Zhou, Izhar Ahmed Khan, Nasrullah Khan, Muhammad Zahid Abbas
发表日期
2023/6/14
图书
Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering
页码范围
398-405
简介
Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE …
引用总数
学术搜索中的文章
Y Hussain, Z Huang, Y Zhou, IA Khan, N Khan… - Proceedings of the 27th International Conference on …, 2023