K Wen, J Xia, Y Huang, L Li, J Xu… - 2021 IEEE/CVF …, 2021 - ieeexplore.ieee.org
There has been a recent surge of interest in cross-modal pre-training. However, existed
approaches pre-train a one-stream model to learn joint vision-language representation …