Y He,
Y Lin, J Wu, H Zhang, Y Zhang, R Le - arXiv preprint arXiv …, 2024 - arxiv.org
Existing large vision-language models (LVLMs) are largely limited to processing short,
seconds-long videos and struggle with generating coherent descriptions for extended video …