BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

P Chizhov, C Arnett, E Korotkova… - Proceedings of the …, 2024 - aclanthology.org
Abstract Language models can greatly benefit from efficient tokenization. However, they still
mostly utilize the classical Byte-Pair Encoding (BPE) algorithm, a simple and reliable …