D Zhang, R Zhang, F Yang, Y Li, H Jia… - … on Multimedia and …, 2024 - ieeexplore.ieee.org
The superior performances of pre-trained vision-language models on various downstream
tasks demonstrate the effectiveness of integrating cross-modal vision-language knowledge …