Sora: A review on background, technology, limitations, and opportunities of large vision models

Y Liu, K Zhang, Y Li, Z Yan, C Gao, R Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Scaling open-vocabulary object detection

M Minderer, A Gritsenko… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary object detection has benefited greatly from pretrained vision-language
models, but is still limited by the amount of available detection training data. While detection …

On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arXiv preprint arXiv …, 2023 - arxiv.org
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

M Dehghani, B Mustafa, J Djolonga… - Advances in …, 2024 - proceedings.neurips.cc
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …

Fastcomposer: Tuning-free multi-subject image generation with localized attention

G Xiao, T Yin, WT Freeman, F Durand… - arXiv preprint arXiv …, 2023 - arxiv.org
Diffusion models excel at text-to-image generation, especially in subject-driven generation
for personalized images. However, existing methods are inefficient due to the subject …

Zero-shot video editing using off-the-shelf image diffusion models

W Wang, Y Jiang, K Xie, Z Liu, H Chen, Y Cao… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale text-to-image diffusion models achieve unprecedented success in image
generation and editing. However, how to extend such success to video editing is unclear …

Token merging for fast stable diffusion

D Bolya, J Hoffman - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
The landscape of image generation has been forever changed by open vocabulary diffusion
models. However, at their core these models use transformers, which makes generation …

Llmlingua: Compressing prompts for accelerated inference of large language models

H Jiang, Q Wu, CY Lin, Y Yang, L Qiu - arXiv preprint arXiv:2310.05736, 2023 - arxiv.org
Large language models (LLMs) have been applied in various applications due to their
astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) …