- 学术资源搜索

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

被引用次数：54 相关文章所有 2 个版本

[PDF] arxiv.org

Scaffolding coordinates to promote vision-language coordination in large multi-modal models

X Lei, Z Yang, X Chen, P Li, Y Liu - arXiv preprint arXiv:2402.12058, 2024 - arxiv.org

State-of-the-art Large Multi-Modal Models (LMMs) have demonstrated exceptional
capabilities in vision-language tasks. Despite their advanced functionalities, the …

被引用次数：6 相关文章所有 2 个版本

[PDF] thecvf.com

Dual-View Visual Contextualization for Web Navigation

J Kil, CH Song, B Zheng, X Deng… - Proceedings of the …, 2024 - openaccess.thecvf.com

Automatic web navigation aims to build a web agent that can follow language instructions to
execute complex and diverse tasks on real-world websites. Existing work primarily takes …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Towards general computer control: A multimodal agent for red dead redemption ii as a case study

W Tan, Z Ding, W Zhang, B Li, B Zhou, J Yue… - arXiv preprint arXiv …, 2024 - arxiv.org

Despite the success in specific tasks and scenarios, existing foundation agents, empowered
by large models (LMs) and advanced tools, still cannot generalize to different scenarios …

被引用次数：9 相关文章所有 5 个版本

[PDF] arxiv.org

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

J Liu, Y Song, BY Lin, W Lam, G Neubig, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but
evaluating their performance in the web domain remains a challenge due to the lack of …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

X Feng, ZY Chen, Y Qin, Y Lin, X Chen, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent developments within the research community, the integration of Large Language
Models (LLMs) in creating fully autonomous agents has garnered significant interest …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

AndroidWorld: A dynamic benchmarking environment for autonomous agents

C Rawles, S Clinckemaillie, Y Chang, J Waltz… - arXiv preprint arXiv …, 2024 - arxiv.org

Autonomous agents that execute human tasks by controlling computers can enhance
human productivity and application accessibility. Yet, progress in this field will be driven by …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Automating the Enterprise with Foundation Models

M Wornow, A Narayan, K Opsahl-Ong… - arXiv preprint arXiv …, 2024 - arxiv.org

Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite
being of interest to the data management community for decades, the ultimate vision of end …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Y Jiang, J Zhang, K Sun, Z Sourati, K Ahrabian… - arXiv preprint arXiv …, 2024 - arxiv.org

While multi-modal large language models (MLLMs) have shown significant progress on
many popular visual reasoning benchmarks, whether they possess abstract visual …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks

M Wornow, A Narayan, B Viggiano, IS Khare… - arXiv preprint arXiv …, 2024 - arxiv.org

Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating
models on business process management (BPM) tasks. BPM is the practice of documenting …

被引用次数：1 相关文章

高级搜索

QQ 群

Mm-llms: Recent advances in multimodal large language models

Scaffolding coordinates to promote vision-language coordination in large multi-modal models

Dual-View Visual Contextualization for Web Navigation

Towards general computer control: A multimodal agent for red dead redemption ii as a case study

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

AndroidWorld: A dynamic benchmarking environment for autonomous agents

Automating the Enterprise with Foundation Models

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks

引用