challenging Scene Understanding task approached in the past via linguistic priors or spatial
information in a single feature branch. We introduce a new deeply supervised two-branch
architecture, the Multimodal Attentional Translation Embeddings, where the visual features
of each branch are driven by a multimodal attentional mechanism that exploits spatio-
linguistic similarities in a low-dimensional space. We present a variety of experiments …