Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Flatten transformer: Vision transformer using focused linear attention

D Han, X Pan, Y Han, S Song… - Proceedings of the …, 2023 - openaccess.thecvf.com
The quadratic computation complexity of self-attention has been a persistent challenge
when applying Transformer models to vision tasks. Linear attention, on the other hand, offers …

Artificial general intelligence for radiation oncology

C Liu, Z Liu, J Holmes, L Zhang, L Zhang, Y Ding… - Meta-radiology, 2023 - Elsevier
The emergence of artificial general intelligence (AGI) is transforming radiation oncology. As
prominent vanguards of AGI, large language models (LLMs) such as GPT-4 and PaLM 2 can …

Agent attention: On the integration of softmax and linear attention

D Han, T Ye, Y Han, Z Xia, S Pan, P Wan… - … on Computer Vision, 2025 - Springer
The attention module is the key component in Transformers. While the global attention
mechanism offers high expressiveness, its excessive computational cost restricts its …

A survey of knowledge graph reasoning on graph types: Static, dynamic, and multi-modal

K Liang, L Meng, M Liu, Y Liu, W Tu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Knowledge graph reasoning (KGR), aiming to deduce new facts from existing facts based on
mined logic rules underlying knowledge graphs (KGs), has become a fast-growing research …

Slide-transformer: Hierarchical vision transformer with local self-attention

X Pan, T Ye, Z Xia, S Song… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer
(ViT), which enables adaptive feature extraction from global contexts. However, existing self …

Grounding language models for visual entity recognition

Z Xiao, M Gong, P Cascante-Bonilla, X Zhang… - … on Computer Vision, 2025 - Springer
Abstract We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our
model extends an autoregressive Multimodal Large Language Model by employing retrieval …

Heterogeneous contrastive learning for foundation models and beyond

L Zheng, B Jing, Z Li, H Tong, J He - Proceedings of the 30th ACM …, 2024 - dl.acm.org
In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …

Transformer technology in molecular science

J Jiang, L Ke, L Chen, B Dou, Y Zhu… - Wiley …, 2024 - Wiley Online Library
A transformer is the foundational architecture behind large language models designed to
handle sequential data by using mechanisms of self‐attention to weigh the importance of …

Efficient token-guided image-text retrieval with consistent multimodal contrastive training

C Liu, Y Zhang, H Wang, W Chen… - … on Image Processing, 2023 - ieeexplore.ieee.org
Image-text retrieval is a central problem for understanding the semantic relationship
between vision and language, and serves as the basis for various visual and language …