Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Robollm: Robotic vision tasks grounded on multimodal large language models

Z Long, G Killick, R McCreadie… - … on Robotics and …, 2024 - ieeexplore.ieee.org
Robotic vision applications often necessitate a wide range of visual perception tasks, such
as object detection, segmentation, and identification. While there have been substantial …

Clce: An approach to refining cross-entropy and contrastive learning for optimized learning fusion

Z Long, L Zhuang, G Killick, Z Meng, R Mccreadie… - ECAI 2024, 2024 - ebooks.iospress.nl
State-of-the-art pre-trained image models predominantly adopt a two-stage approach: initial
unsupervised pre-training on large-scale datasets followed by task-specific fine-tuning using …

Routing experts: Learning to route dynamic experts in multi-modal large language models

Q Wu, Z Ke, Y Zhou, G Luo, X Sun, R Ji - arXiv preprint arXiv:2407.14093, 2024 - arxiv.org
Recently, mixture of experts (MoE) has become a popular paradigm for achieving the trade-
off between modal capacity and efficiency of multi-modal large language models (MLLMs) …

Solving token gradient conflict in mixture-of-experts for large vision-language model

L Yang, D Shen, C Cai, F Yang, S Li, D Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-
Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving …

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

W Liang, L Yu, L Luo, S Iyer, N Dong, C Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The development of large language models (LLMs) has expanded to multi-modal systems
capable of processing text, images, and speech within a unified framework. Training these …

LaCViT: A Label-Aware Contrastive Fine-Tuning Framework for Vision Transformers

Z Long, R McCreadie, GA Camarasa… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Vision Transformers (ViTs) have emerged as popular models in computer vision,
demonstrating state-of-the-art performance across various tasks. This success typically …

Transferrable DP-Adapter Tuning: A Privacy-Preserving Multimodal Parameter-Efficient Fine-Tuning Framework

L Ji, S Xiao, B Xu, H Zhang - 2024 IEEE 24th International …, 2024 - ieeexplore.ieee.org
In recent years, multimodal large-scale pre-trained models have achieved tremendous
success and become a milestone in the field of artificial intelligence, demonstrating the …

Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation

X Liu, M Wang, S Deng, X Peng, Y Liu, R Nong… - Authorea …, 2024 - techrxiv.org
Large language models (LLMs) have seen rapid advancements in their ability to integrate
and generate knowledge from various sources. However, the challenge of efficiently …

Knowledge Graphs for Multi-Modal Learning: Survey and Perspective

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - Available at SSRN … - papers.ssrn.com
Integrated with multi-modal learning, knowledge graphs (KGs) as structured knowledge
repositories, can enhance AI for processing and understanding complex, real-world data …