Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Scaffolding coordinates to promote vision-language coordination in large multi-modal models

X Lei, Z Yang, X Chen, P Li, Y Liu - arXiv preprint arXiv:2402.12058, 2024 - arxiv.org
State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …

Dual-View Visual Contextualization for Web Navigation

J Kil, CH Song, B Zheng, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Automatic web navigation aims to build a web agent that can follow language instructions to
execute complex and diverse tasks on real-world websites. Existing work primarily takes …

Towards general computer control: A multimodal agent for red dead redemption ii as a case study

W Tan, Z Ding, W Zhang, B Li, B Zhou, J Yue… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the success in specific tasks and scenarios, existing foundation agents, empowered
by large models (LMs) and advanced tools, still cannot generalize to different scenarios …

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

J Liu, Y Song, BY Lin, W Lam, G Neubig, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but
evaluating their performance in the web domain remains a challenge due to the lack of …

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

X Feng, ZY Chen, Y Qin, Y Lin, X Chen, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent developments within the research community, the integration of Large Language
Models (LLMs) in creating fully autonomous agents has garnered significant interest …

AndroidWorld: A dynamic benchmarking environment for autonomous agents

C Rawles, S Clinckemaillie, Y Chang, J Waltz… - arXiv preprint arXiv …, 2024 - arxiv.org
Autonomous agents that execute human tasks by controlling computers can enhance
human productivity and application accessibility. Yet, progress in this field will be driven by …

Automating the Enterprise with Foundation Models

M Wornow, A Narayan, K Opsahl-Ong… - arXiv preprint arXiv …, 2024 - arxiv.org
Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite
being of interest to the data management community for decades, the ultimate vision of end …

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Y Jiang, J Zhang, K Sun, Z Sourati, K Ahrabian… - arXiv preprint arXiv …, 2024 - arxiv.org
While multi-modal large language models (MLLMs) have shown significant progress on
many popular visual reasoning benchmarks, whether they possess abstract visual …

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks

M Wornow, A Narayan, B Viggiano, IS Khare… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating
models on business process management (BPM) tasks. BPM is the practice of documenting …