Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Z Li, B Yang, Q Liu, Z Ma, S Zhang, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Multimodal Models have demonstrated impressive capabilities in understanding
general vision-language tasks. However, due to the limitation of supported input resolution …

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Z Li, B Yang, Q Liu, Z Ma, S Zhang, J Yang… - arXiv e …, 2023 - ui.adsabs.harvard.edu
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …