Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Unimernet: A universal network for real-world mathematical expression recognition

B Wang, Z Gu, G Liang, C Xu, B Zhang, B Shi… - arXiv preprint arXiv …, 2024 - arxiv.org
The paper introduces the UniMER dataset, marking the first study on Mathematical
Expression Recognition (MER) targeting complex real-world scenarios. The UniMER …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J Xiao, L Chen - arXiv preprint arXiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Y Ma, X Liu, X Chen, W Liu, C Wu, Z Wu, Z Pan… - arXiv preprint arXiv …, 2024 - arxiv.org
We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …

MatViX: Multimodal Information Extraction from Visually Rich Articles

G Khalighinejad, S Scott, O Liu, KL Anderson… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal information extraction (MIE) is crucial for scientific literature, where valuable data
is often spread across text, figures, and tables. In materials science, extracting structured …

Survey of large multimodal model datasets, application categories and taxonomy

P Pattnayak, HL Patel, B Kumar, A Agarwal… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more
versatile and robust systems by integrating and analyzing diverse types of data, including …

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

X Tang, T Hu, M Ye, Y Shao, X Yin, S Ouyang… - arXiv preprint arXiv …, 2025 - arxiv.org
Chemical reasoning usually involves complex, multi-step processes that demand precise
calculations, where even minor errors can lead to cascading failures. Furthermore, large …

Autonomous Microscopy Experiments through Large Language Model Agents

I Mandal, J Soni, M Zaki, MM Smedskjaer… - arXiv preprint arXiv …, 2024 - arxiv.org
The emergence of large language models (LLMs) has accelerated the development of self-
driving laboratories (SDLs) for materials research. Despite their transformative potential …

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Y Hao, J Gu, HW Wang, L Li, Z Yang, L Wang… - arXiv preprint arXiv …, 2025 - arxiv.org
The ability to organically reason over and with both text and images is a pillar of human
intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such …

Autonomous Microscopy Experiments through Large Language Model Agents

NMA Krishnan, I Mandal, J Soni, M Zaki, M Smedskjaer… - 2024 - researchsquare.com
The emergence of large language models (LLMs) has accelerated the development of self-
driving laboratories (SDLs) for materials research. Despite their transformative potential …