D Saravanan, D Singh, V Gupta,
Z Khan… - arXiv preprint arXiv …, 2024 - arxiv.org
Compositionality is a fundamental aspect of vision-language understanding and is
especially required for videos since they contain multiple entities (eg persons, actions, and …