H Sakaino, TN Phuong, VN Duy - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Recently large Vision Language (VL) models ie CLIP have demonstrated
impressive capabilities in training solely on internet-scale image-language pairs. Moreover …