Token merging: Your vit but faster

Y Liu, K Zhang, Y Li, Z Yan, C Gao, R Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …

被引用次数：69 相关文章所有 2 个版本

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

被引用次数：68 相关文章所有 3 个版本

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

被引用次数：39 相关文章所有 4 个版本

[PDF] neurips.cc

Scaling open-vocabulary object detection

M Minderer, A Gritsenko… - Advances in Neural …, 2024 - proceedings.neurips.cc

Open-vocabulary object detection has benefited greatly from pretrained vision-language
models, but is still limited by the amount of available detection training data. While detection …

被引用次数：69 相关文章所有 6 个版本

[PDF] arxiv.org

On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arXiv preprint arXiv …, 2023 - arxiv.org

The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

被引用次数：20 相关文章所有 2 个版本

[PDF] neurips.cc

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

M Dehghani, B Mustafa, J Djolonga… - Advances in …, 2024 - proceedings.neurips.cc

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …

被引用次数：33 相关文章所有 5 个版本

[PDF] arxiv.org

Fastcomposer: Tuning-free multi-subject image generation with localized attention

G Xiao, T Yin, WT Freeman, F Durand… - arXiv preprint arXiv …, 2023 - arxiv.org

Diffusion models excel at text-to-image generation, especially in subject-driven generation
for personalized images. However, existing methods are inefficient due to the subject …

被引用次数：90 相关文章所有 2 个版本

[PDF] arxiv.org

Zero-shot video editing using off-the-shelf image diffusion models

W Wang, Y Jiang, K Xie, Z Liu, H Chen, Y Cao… - arXiv preprint arXiv …, 2023 - arxiv.org

Large-scale text-to-image diffusion models achieve unprecedented success in image
generation and editing. However, how to extend such success to video editing is unclear …

被引用次数：74 相关文章所有 2 个版本

[PDF] thecvf.com

Token merging for fast stable diffusion

D Bolya, J Hoffman - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

The landscape of image generation has been forever changed by open vocabulary diffusion
models. However, at their core these models use transformers, which makes generation …

被引用次数：40 相关文章所有 5 个版本

[PDF] arxiv.org

Llmlingua: Compressing prompts for accelerated inference of large language models

H Jiang, Q Wu, CY Lin, Y Yang, L Qiu - arXiv preprint arXiv:2310.05736, 2023 - arxiv.org

Large language models (LLMs) have been applied in various applications due to their
astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) …

被引用次数：74 相关文章所有 6 个版本

高级搜索

QQ 群