Distilling large language models into tiny and effective students using pqrnn

G Menghani - ACM Computing Surveys, 2023 - dl.acm.org

Deep learning has revolutionized the fields of computer vision, natural language
understanding, speech recognition, information retrieval, and more. However, with the …

被引用次数：251 相关文章所有 5 个版本

[PDF] mlr.press

Cramming: Training a Language Model on a single GPU in one day.

J Geiping, T Goldstein - International Conference on …, 2023 - proceedings.mlr.press

Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …

被引用次数：52 相关文章所有 7 个版本

[PDF] aclanthology.org

Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation

Y Wu, M Rezagholizadeh, A Ghaddar… - Proceedings of the …, 2021 - aclanthology.org

Intermediate layer matching is shown as an effective approach for improving knowledge
distillation (KD). However, this technique applies matching in the hidden spaces of two …

被引用次数：19 相关文章所有 2 个版本

[PDF] arxiv.org

Translate & Fill: Improving zero-shot multilingual semantic parsing with synthetic data

M Nicosia, Z Qu, Y Altun - arXiv preprint arXiv:2109.04319, 2021 - arxiv.org

While multilingual pretrained language models (LMs) fine-tuned on a single language have
shown substantial cross-lingual task transfer capabilities, there is still a wide performance …

被引用次数：22 相关文章所有 4 个版本

[PDF] arxiv.org

Mergedistill: Merging pre-trained language models using distillation

S Khanuja, M Johnson, P Talukdar - arXiv preprint arXiv:2106.02834, 2021 - arxiv.org

Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in
cross-lingual transfer, but they often lead to an inequitable representation of languages due …

被引用次数：17 相关文章所有 5 个版本

[PDF] arxiv.org

pNLP-mixer: An efficient all-MLP architecture for language

F Fusco, D Pascual, P Staar, D Antognini - arXiv preprint arXiv:2202.04350, 2022 - arxiv.org

Large pre-trained language models based on transformer architecture have drastically
changed the natural language processing (NLP) landscape. However, deploying those …

被引用次数：11 相关文章所有 9 个版本

[PDF] aclanthology.org

Too brittle to touch: comparing the stability of quantization and distillation towards developing low-resource MT models

H Diddee, S Dandapat, M Choudhury… - Proceedings of the …, 2022 - aclanthology.org

Leveraging shared learning through Massively Multilingual Models, state-of-the-art Machine
translation (MT) models are often able to adapt to the paucity of data for low-resource …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Unsupervised term extraction for highly technical domains

F Fusco, P Staar, D Antognini - arXiv preprint arXiv:2210.13118, 2022 - arxiv.org

Term extraction is an information extraction task at the root of knowledge discovery
platforms. Developing term extractors that are able to generalize across very diverse and …

被引用次数：3 相关文章所有 9 个版本

[PDF] aclanthology.org

Data augmentation and learned layer aggregation for improved multilingual language understanding in dialogue

E Razumovskaia, I Vulić… - Findings of the Association …, 2022 - aclanthology.org

Scaling dialogue systems to a multitude of domains, tasks and languages relies on costly
and time-consuming data annotation for different domain-task-language configurations. The …

被引用次数：4 相关文章所有 2 个版本

[PDF] thecvf.com

Using Large Text To Image Models with Structured Prompts for Skin Disease Identification: A Case Study

S Rajapaksa, JMU Vianney, R Castro… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper investigates the potential usage of large text-to-image (LTI) models for the
automated diagnosis of a few skin conditions with rarity or a serious lack of annotated …

被引用次数：1 相关文章所有 7 个版本

高级搜索

QQ 群