[HTML][HTML] Multimodal transformer for unaligned multimodal language sequences

YHH Tsai, S Bai, PP Liang, JZ Kolter… - Proceedings of the …, 2019 - ncbi.nlm.nih.gov
Proceedings of the conference. Association for computational …, 2019ncbi.nlm.nih.gov
Human language is often multimodal, which comprehends a mixture of natural language,
facial gestures, and acoustic behaviors. However, two major challenges in modeling such
multimodal human language time-series data exist: 1) inherent data non-alignment due to
variable sampling rates for the sequences from each modality; and 2) long-range
dependencies between elements across modalities. In this paper, we introduce the
Multimodal Transformer (MulT) to generically address the above issues in an end-to-end …
Abstract
Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.
ncbi.nlm.nih.gov
以上显示的是最相近的搜索结果。 查看全部搜索结果