Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arXiv preprint arXiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

XVO: Generalized visual odometry via cross-modal self-training

L Lai, Z Shangguan, J Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose XVO, a semi-supervised learning method for training generalized monocular
Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and …

Selfd: self-learning large-scale driving policies from the web

J Zhang, R Zhu, E Ohn-Bar - Proceedings of the IEEE/CVF …, 2022 - openaccess.thecvf.com
Effectively utilizing the vast amounts of ego-centric navigation data that is freely available on
the internet can advance generalized intelligent systems, ie, to robustly scale across …

Assister: Assistive navigation via conditional instruction generation

Z Huang, Z Shangguan, J Zhang, G Bar, M Boyd… - … on Computer Vision, 2022 - Springer
We introduce a novel vision-and-language navigation (VLN) task of learning to provide real-
time guidance to a blind follower situated in complex dynamic navigation scenarios …

Motion Diversification Networks

HJ Kim, E Ohn-Bar - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Abstract We introduce Motion Diversification Networks a novel framework for learning to
generate realistic and diverse 3D human motion. Despite recent advances in deep …

Feedback-Guided Autonomous Driving

J Zhang, Z Huang, A Ray… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
While behavior cloning has recently emerged as a highly successful paradigm for
autonomous driving humans rarely learn to perform complex tasks such as driving via …

Unified Local-Cloud Decision-Making via Reinforcement Learning

K Sengupta, Z Shangguan, S Bharadwaj… - … on Computer Vision, 2024 - Springer
Embodied vision-based real-world systems, such as mobile robots, require a careful
balance between energy consumption, compute latency, and safety constraints to optimize …

Text to Blind Motion

HJ Kim, K Sengupta, M Kuribayashi, H Kacorri… - arXiv preprint arXiv …, 2024 - arxiv.org
People who are blind perceive the world differently than those who are sighted, which can
result in distinct motion characteristics. For instance, when crossing at an intersection, blind …

Scalable Early Childhood Reading Performance Prediction

Z Shangguan, Z Huang, E Ohn-Bar… - arXiv preprint arXiv …, 2024 - arxiv.org
Models for student reading performance can empower educators and institutions to
proactively identify at-risk students, thereby enabling early and tailored instructional …