A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes

T Yu, X Lin, S Wang, W Sheng… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that
aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents …

A comprehensive survey on deep-learning-based visual captioning

B Xin, N Xu, Y Zhai, T Zhang, Z Lu, J Liu, W Nie, X Li… - Multimedia …, 2023 - Springer
Generating a description for an image/video is termed as the visual captioning task. It
requires the model to capture the semantic information of visual content and translate them …

Generalized zero-shot learning with multi-source semantic embeddings for scene recognition

X Song, H Zeng, S Zhang, L Herranz… - Proceedings of the 28th …, 2020 - dl.acm.org
Recognizing visual categories from semantic descriptions is a promising way to extend the
capability of a visual classifier beyond the concepts represented in the training data (ie seen …

Be specific, be clear: Bridging machine and human captions by scene-guided transformer

Y Huang, Z Zeng, Y Lu - Proceedings of the 2021 Workshop on Multi …, 2021 - dl.acm.org
Automatically generating natural language descriptions for images, ie, image captioning, is
one of the primary goals for multimedia understanding. The recent success of deep neural …