Unified vision-language pre-training for image captioning and vqa L Zhou, H Palangi, L Zhang, H Hu, J Corso, J Gao Proceedings of the AAAI conference on artificial intelligence 34 (07), 13041 …, 2020 | 888 | 2020 |
Gemini: a family of highly capable multimodal models G Team, R Anil, S Borgeaud, Y Wu, JB Alayrac, J Yu, R Soricut, ... arXiv preprint arXiv:2312.11805, 2023 | 852 | 2023 |
Florence: A new foundation model for computer vision L Yuan, D Chen, YL Chen, N Codella, X Dai, J Gao, H Hu, X Huang, B Li, ... arXiv preprint arXiv:2111.11432, 2021 | 739 | 2021 |
Towards automatic learning of procedures from web instructional videos L Zhou, C Xu, J Corso Proceedings of the AAAI Conference on Artificial Intelligence 32 (1), 2018 | 735 | 2018 |
End-to-end dense video captioning with masked transformer L Zhou, Y Zhou, JJ Corso, R Socher, C Xiong Proceedings of the IEEE conference on computer vision and pattern …, 2018 | 631 | 2018 |
Less is more: Clipbert for video-and-language learning via sparse sampling J Lei, L Li, L Zhou, Z Gan, TL Berg, M Bansal, J Liu Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2021 | 628 | 2021 |
Regionclip: Region-based language-image pretraining Y Zhong, J Yang, P Zhang, C Li, N Codella, LH Li, L Zhou, X Dai, L Yuan, ... Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2022 | 393 | 2022 |
Grounded video description L Zhou, Y Kalantidis, X Chen, JJ Corso, M Rohrbach Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2019 | 210 | 2019 |
Bevt: Bert pretraining of video transformers R Wang, D Chen, Z Wu, Y Chen, X Dai, M Liu, YG Jiang, L Zhou, L Yuan Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2022 | 205 | 2022 |
Clip-event: Connecting text and images with event structures M Li, R Xu, S Wang, L Zhou, X Lin, C Zhu, M Zeng, H Ji, SF Chang Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2022 | 116 | 2022 |
Omnivl: One foundation model for image-language and video-language tasks J Wang, D Chen, Z Wu, C Luo, L Zhou, Y Zhao, Y Xie, C Liu, YG Jiang, ... Advances in neural information processing systems 35, 5696-5710, 2022 | 115 | 2022 |
Value: A multi-task benchmark for video-and-language understanding evaluation L Li, J Lei, Z Gan, L Yu, YC Chen, R Pillai, Y Cheng, L Zhou, XE Wang, ... arXiv preprint arXiv:2106.04632, 2021 | 101 | 2021 |
Dense video captioning Y Zhou, L Zhou, C Xiong, R Socher US Patent 10,542,270, 2020 | 99 | 2020 |
Language models with image descriptors are strong few-shot video-language learners Z Wang, M Li, R Xu, L Zhou, J Lei, X Lin, S Wang, Z Yang, C Zhu, ... Advances in Neural Information Processing Systems 35, 8483-8497, 2022 | 97 | 2022 |
Watch what you just said: Image captioning with text-conditional attention L Zhou, C Xu, P Koch, JJ Corso Proceedings of the on Thematic Workshops of ACM Multimedia 2017, 305-313, 2017 | 94 | 2017 |
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction L Zhou, N Louis, JJ Corso British Machine Vision Conference, 2018 | 85 | 2018 |
Uc2: Universal cross-lingual cross-modal vision-and-language pre-training M Zhou, L Zhou, S Wang, Y Cheng, L Li, Z Yu, J Liu Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2021 | 77 | 2021 |
Multiagent reinforcement learning with sparse interactions by negotiation and knowledge transfer L Zhou, P Yang, C Chen, Y Gao IEEE transactions on cybernetics 47 (5), 1238-1250, 2016 | 61 | 2016 |
Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering D Gao, L Zhou, L Ji, L Zhu, Y Yang, MZ Shou Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2023 | 53 | 2023 |
Image caption generation with text-conditional semantic attention L Zhou, C Xu, P Koch, JJ Corso arXiv preprint arXiv:1606.04621 2, 2016 | 47 | 2016 |