Evaluating large language models: A comprehensive survey

Z Guo, R Jin, C Liu, Y Huang, D Shi, L Yu, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities across a broad
spectrum of tasks. They have attracted significant attention and been deployed in numerous …

Prompting gpt-3 to be reliable

C Si, Z Gan, Z Yang, S Wang, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Large language models (LLMs) show impressive abilities via few-shot prompting.
Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world …

Multi-hop question answering

V Mavi, A Jangra, A Jatowt - Foundations and Trends® in …, 2024 - nowpublishers.com
Abstract The task of Question Answering (QA) has attracted significant research interest for a
long time. Its relevance to language understanding and knowledge retrieval tasks, along …

Generate-then-ground in retrieval-augmented generation for multi-hop question answering

Z Shi, W Sun, S Gao, P Ren, Z Chen, Z Ren - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large
language models (LLMs) due to the intensive knowledge required. Current solutions, like …

A survey on multi-hop question answering and generation

V Mavi, A Jangra, A Jatowt - arXiv preprint arXiv:2204.09140, 2022 - arxiv.org
The problem of Question Answering (QA) has attracted significant research interest for long.
Its relevance to language understanding and knowledge retrieval tasks, along with the …

A survey on measuring and mitigating reasoning shortcuts in machine reading comprehension

X Ho, JM Meissner, S Sugawara, A Aizawa - arXiv preprint arXiv …, 2022 - arxiv.org
The issue of shortcut learning is widely known in NLP and has been an important research
focus in recent years. Unintended correlations in the data enable models to easily solve …

Sparse moe as the new dropout: Scaling dense and self-slimmable transformers

T Chen, Z Zhang, A Jaiswal, S Liu, Z Wang - arXiv preprint arXiv …, 2023 - arxiv.org
Despite their remarkable achievement, gigantic transformers encounter significant
drawbacks, including exorbitant computational and memory footprints during training, as …

Select, substitute, search: A new benchmark for knowledge-augmented visual question answering

A Jain, M Kothyari, V Kumar, P Jyothi… - Proceedings of the 44th …, 2021 - dl.acm.org
Multimodal IR, spanning text corpus, knowledge graph and images, called outside
knowledge visual question answering (OKVQA), is of much recent interest. However, the …

STREET: A multi-task structured reasoning and explanation benchmark

D Ribeiro, S Wang, X Ma, H Zhu, R Dong… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce STREET, a unified multi-task and multi-domain natural language reasoning
and explanation benchmark. Unlike most existing question-answering (QA) datasets, we …

Understanding and improving zero-shot multi-hop reasoning in generative question answering

Z Jiang, J Araki, H Ding, G Neubig - arXiv preprint arXiv:2210.04234, 2022 - arxiv.org
Generative question answering (QA) models generate answers to questions either solely
based on the parameters of the model (the closed-book setting) or additionally retrieving …