An overview on language models: Recent developments and outlook

C Wei, YC Wang, B Wang, CCJ Kuo - arXiv preprint arXiv:2303.05759, 2023 - arxiv.org
Language modeling studies the probability distributions over strings of texts. It is one of the
most fundamental tasks in natural language processing (NLP). It has been widely used in …

Perceiver io: A general architecture for structured inputs & outputs

A Jaegle, S Borgeaud, JB Alayrac, C Doersch… - arXiv preprint arXiv …, 2021 - arxiv.org
A central goal of machine learning is the development of systems that can solve many
problems in as many data domains as possible. Current architectures, however, cannot be …

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

JH Clark, D Garrette, I Turc, J Wieting - Transactions of the Association …, 2022 - direct.mit.edu
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet
nearly all commonly used models still require an explicit tokenization step. While recent …

Charformer: Fast character transformers via gradient-based subword tokenization

Y Tay, VQ Tran, S Ruder, J Gupta, HW Chung… - arXiv preprint arXiv …, 2021 - arxiv.org
State-of-the-art models in natural language processing rely on separate rigid subword
tokenization algorithms, which limit their generalization ability and adaptation to new …

Multimodal large language models: A survey

J Wu, W Gan, Z Chen, S Wan… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
The exploration of multimodal language models integrates multiple data types, such as
images, text, language, audio, and other heterogeneity. While the latest large language …

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Impact of tokenization on language models: An analysis for turkish

C Toraman, EH Yilmaz, F Şahinuç… - ACM Transactions on …, 2023 - dl.acm.org
Tokenization is an important text preprocessing step to prepare input tokens for deep
language models. WordPiece and BPE are de facto methods employed by important …

Strong Prediction: Language model surprisal explains multiple N400 effects

JA Michaelov, MD Bardolph, CK Van Petten… - Neurobiology of …, 2024 - direct.mit.edu
Theoretical accounts of the N400 are divided as to whether the amplitude of the N400
response to a stimulus reflects the extent to which the stimulus was predicted, the extent to …

Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models

S Shaikh, SM Daudpota, AS Imran, Z Kastrati - Applied Sciences, 2021 - mdpi.com
Data imbalance is a frequently occurring problem in classification tasks where the number of
samples in one category exceeds the amount in others. Quite often, the minority class data is …

Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold

S Ruder, I Vulić, A Søgaard - arXiv preprint arXiv:2206.09755, 2022 - arxiv.org
The prototypical NLP experiment trains a standard architecture on labeled English data and
optimizes for accuracy, without accounting for other dimensions such as fairness …