Understanding and Mitigating Tokenization Bias in Language Models

B Phan, M Havasi, M Muckley, K Ullrich - arXiv preprint arXiv:2406.16829, 2024 - arxiv.org
State-of-the-art language models are autoregressive and operate on subword units known
as tokens. Specifically, one must encode the conditioning string into a list of tokens before …

Understanding and Mitigating Tokenization Bias in Language Models

B Phan, M Havasi, MJ Muckley, K Ullrich - ICML 2024 Workshop on … - openreview.net
State-of-the-art language models are autoregressive and operate on subword units known
as tokens. Specifically, one must encode the conditioning string into a list of tokens before …

Understanding and Mitigating Tokenization Bias in Language Models

B Phan, M Havasi, M Muckley, K Ullrich - arXiv e-prints, 2024 - ui.adsabs.harvard.edu
State-of-the-art language models are autoregressive and operate on subword units known
as tokens. Specifically, one must encode the conditioning string into a list of tokens before …