Large language models and causal inference in collaboration: A comprehensive survey

X Liu, P Xu, J Wu, J Yuan, Y Yang, Y Zhou, F Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …

Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

Openeqa: Embodied question answering in the era of foundation models

A Majumdar, A Ajay, X Zhang, P Putta… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present a modern formulation of Embodied Question Answering (EQA) as the task of
understanding an environment well enough to answer questions about it in natural …

Can i trust your answer? visually grounded video question answering

J Xiao, A Yao, Y Li, TS Chua - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We study visually grounded VideoQA in response to the emerging trends of utilizing
pretraining techniques for video-language understanding. Specifically by forcing vision …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

A causal explainable guardrails for large language models

Z Chu, Y Wang, L Li, Z Wang, Z Qin, K Ren - Proceedings of the 2024 on …, 2024 - dl.acm.org
Large Language Models (LLMs) have shown impressive performance in natural language
tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for …

Vamos: Versatile action models for video understanding

S Wang, Q Zhao, MQ Do, N Agarwal, K Lee… - European Conference on …, 2024 - Springer
What makes good representations for video understanding, such as anticipating future
activities, or answering video-conditioned questions? While earlier approaches focus on …

Video-of-thought: Step-by-step video reasoning from perception to cognition

H Fei, S Wu, W Ji, H Zhang, M Zhang… - Forty-first International …, 2024 - openreview.net
Existing research of video understanding still struggles to achieve in-depth comprehension
and reasoning in complex videos, primarily due to the under-exploration of two key …

[PDF][PDF] Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion

S Yu, J Yoon, M Bansal - arXiv preprint arXiv:2402.05889, 2024 - southnlp.github.io
Despite impressive advancements in multimodal compositional reasoning approaches, they
are still limited in their flexibility and efficiency by processing fixed modality inputs while …