J Jiao, YM Tang, KY Lin, Y Gao, AJ Ma… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long- range dependencies between arbitrary image patches while the global attended receptive …
Capturing images has been increasingly popular in recent years, owing to the widespread availability of cameras. Images are essential in our daily lives because they contain a wealth …
C Wang, Y Shen, L Ji - Expert systems with applications, 2022 - Elsevier
In recent years, Transformer structures have been widely applied in image captioning with impressive performance. However, previous works often neglect the geometry and position …
X Fang, D Liu, P Zhou, Z Xu, R Li - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
This article studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to …
K Zhang, H Jiang, J Zhang, Q Huang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Medical report generation generates the corresponding report according to the given radiology image, which has been attracting increasing research interest. However, existing …
W Jiang, M Zhu, Y Fang, G Shi… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate …
R Sasibhooshan, S Kumaraswamy, S Sasidharan - Journal of Big Data, 2023 - Springer
Automatic caption generation with attention mechanisms aims at generating more descriptive captions containing coarser to finer semantic contents in the image. In this work …
Image captioning is a difficult problem for machine learning algorithms to compress huge amounts of images into descriptive languages. The recurrent models are popularly used as …
J Guo, M Wang, Y Zhou, B Song, Y Chi… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Image-text retrieval (ITR) is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. In recent years …