STEMM: Self-learning with speech-text manifold mixup for speech translation

Q Fang, R Ye, L Li, Y Feng, M Wang - arXiv preprint arXiv:2203.10426, 2022 - arxiv.org
How to learn a better speech representation for end-to-end speech-to-text translation (ST)
with limited labeled data? Existing techniques often attempt to transfer powerful machine …

Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination

H Fei, Q Liu, M Zhang, M Zhang, TS Chua - arXiv preprint arXiv …, 2023 - arxiv.org
In this work, we investigate a more realistic unsupervised multimodal machine translation
(UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text …

Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment

S Wu, H Fei, W Ji, TS Chua - arXiv preprint arXiv:2305.12260, 2023 - arxiv.org
Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency
issues, due to the inconsistencies of the semantic scene and syntax attributes during …

Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arXiv preprint arXiv …, 2024 - arxiv.org
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

Retrieving multimodal information for augmented generation: A survey

R Zhao, H Chen, W Wang, F Jiao, XL Do, C Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
As Large Language Models (LLMs) become popular, there emerged an important trend of
using multimodality to augment the LLMs' generation ability, which enables LLMs to better …

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T Jin, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …

Learning to imagine: Visually-augmented natural language generation

T Tang, Y Chen, Y Du, J Li, WX Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
People often imagine relevant scenes to aid in the writing process. In this work, we aim to
utilize visual information for composition in the same manner as humans. We propose a …

Tackling ambiguity with images: Improved multimodal machine translation and contrastive evaluation

M Futeral, C Schmid, I Laptev, B Sagot… - arXiv preprint arXiv …, 2022 - arxiv.org
One of the major challenges of machine translation (MT) is ambiguity, which can in some
cases be resolved by accompanying context such as images. However, recent work in …

On the evaluation of machine-generated reports

J Mayfield, E Yang, D Lawrie, S MacAvaney… - Proceedings of the 47th …, 2024 - dl.acm.org
Large Language Models (LLMs) have enabled new ways to satisfy information needs.
Although great strides have been made in applying them to settings like document ranking …

Exploring better text image translation with multimodal codebook

Z Lan, J Yu, X Li, W Zhang, J Luan, B Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Text image translation (TIT) aims to translate the source texts embedded in the image to
target translations, which has a wide range of applications and thus has important research …