Getting the most out of your tokenizer for pre-training and domain adaptation

G Dagan, G Synnaeve, B Roziere - arXiv preprint arXiv:2402.01035, 2024 - arxiv.org
Tokenization is an understudied and often neglected component of modern LLMs. Most
published works use a single tokenizer for all experiments, often borrowed from another …

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

S Liu, N Deng, S Sabour, Y Jia, M Huang… - Proceedings of the …, 2023 - aclanthology.org
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the
specifics of a downstream task and enhance long-form generation in mental health. Inspired …

Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

F Kunstner, R Yadav, A Milligan, M Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
Adam has been shown to outperform gradient descent in optimizing large language
transformers empirically, and by a larger margin than on other tasks, but it is unclear why this …

A formal perspective on byte-pair encoding

V Zouhar, C Meister, JL Gastaldi, L Du, T Vieira… - arXiv preprint arXiv …, 2023 - arxiv.org
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite
being devised initially as a compression method. BPE appears to be a greedy algorithm at …

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

N Godey, É de la Clergerie, B Sagot - arXiv preprint arXiv:2309.08351, 2023 - arxiv.org
Self-supervised pre-training of language models usually consists in predicting probability
distributions over extensive token vocabularies. In this study, we propose an innovative …

The foundations of tokenization: Statistical and computational concerns

JL Gastaldi, J Terilla, L Malagutti, B DuSell… - arXiv preprint arXiv …, 2024 - arxiv.org
Tokenization-the practice of converting strings of characters from an alphabet into
sequences of tokens over a vocabulary-is a critical step in the NLP pipeline. The use of …

Scaling Experiments in Self-Supervised Cross-Table Representation Learning

M Schambach, D Paul, JS Otterbach - arXiv preprint arXiv:2309.17339, 2023 - arxiv.org
To analyze the scaling potential of deep tabular representation learning models, we
introduce a novel Transformer-based architecture specifically tailored to tabular data and …

Tokenization Is More Than Compression

CW Schmidt, V Reddy, H Zhang, A Alameddine… - arXiv preprint arXiv …, 2024 - arxiv.org
Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging
raw text and language models. Existing tokenization approaches like Byte-Pair Encoding …

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

P Chizhov, C Arnett, E Korotkova… - Proceedings of the …, 2024 - aclanthology.org
Abstract Language models can greatly benefit from efficient tokenization. However, they still
mostly utilize the classical Byte-Pair Encoding (BPE) algorithm, a simple and reliable …

Tmmlu+: An improved traditional chinese evaluation suite for foundation models

ZR Tam, YT Pai, YW Lee, HH Shuai… - First Conference on …, 2024 - openreview.net
We present TMMLU+, a new benchmark designed for Traditional Chinese language
understanding. TMMLU+ is a multi-choice question-answering dataset with 66 subjects from …