Tokenization and the noiseless channel

G Dagan, G Synnaeve, B Roziere - arXiv preprint arXiv:2402.01035, 2024 - arxiv.org

Tokenization is an understudied and often neglected component of modern LLMs. Most
published works use a single tokenizer for all experiments, often borrowed from another …

被引用次数：21 相关文章所有 3 个版本

[PDF] aclanthology.org

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

S Liu, N Deng, S Sabour, Y Jia, M Huang… - Proceedings of the …, 2023 - aclanthology.org

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the
specifics of a downstream task and enhance long-form generation in mental health. Inspired …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

F Kunstner, R Yadav, A Milligan, M Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org

Adam has been shown to outperform gradient descent in optimizing large language
transformers empirically, and by a larger margin than on other tasks, but it is unclear why this …

被引用次数：20 相关文章所有 3 个版本

[PDF] arxiv.org

A formal perspective on byte-pair encoding

V Zouhar, C Meister, JL Gastaldi, L Du, T Vieira… - arXiv preprint arXiv …, 2023 - arxiv.org

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite
being devised initially as a compression method. BPE appears to be a greedy algorithm at …

被引用次数：37 相关文章所有 7 个版本

[PDF] arxiv.org

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

N Godey, É de la Clergerie, B Sagot - arXiv preprint arXiv:2309.08351, 2023 - arxiv.org

Self-supervised pre-training of language models usually consists in predicting probability
distributions over extensive token vocabularies. In this study, we propose an innovative …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

The foundations of tokenization: Statistical and computational concerns

JL Gastaldi, J Terilla, L Malagutti, B DuSell… - arXiv preprint arXiv …, 2024 - arxiv.org

Tokenization-the practice of converting strings of characters from an alphabet into
sequences of tokens over a vocabulary-is a critical step in the NLP pipeline. The use of …

被引用次数：2 相关文章所有 5 个版本

[PDF] arxiv.org

Scaling Experiments in Self-Supervised Cross-Table Representation Learning

M Schambach, D Paul, JS Otterbach - arXiv preprint arXiv:2309.17339, 2023 - arxiv.org

To analyze the scaling potential of deep tabular representation learning models, we
introduce a novel Transformer-based architecture specifically tailored to tabular data and …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Tokenization Is More Than Compression

CW Schmidt, V Reddy, H Zhang, A Alameddine… - arXiv preprint arXiv …, 2024 - arxiv.org

Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging
raw text and language models. Existing tokenization approaches like Byte-Pair Encoding …

被引用次数：18 相关文章所有 2 个版本

[PDF] aclanthology.org

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

P Chizhov, C Arnett, E Korotkova… - Proceedings of the …, 2024 - aclanthology.org

Abstract Language models can greatly benefit from efficient tokenization. However, they still
mostly utilize the classical Byte-Pair Encoding (BPE) algorithm, a simple and reliable …

Tmmlu+: An improved traditional chinese evaluation suite for foundation models

ZR Tam, YT Pai, YW Lee, HH Shuai… - First Conference on …, 2024 - openreview.net

We present TMMLU+, a new benchmark designed for Traditional Chinese language
understanding. TMMLU+ is a multi-choice question-answering dataset with 66 subjects from …

被引用次数：1 相关文章

高级搜索

QQ 群