H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - The Twelfth International … - openreview.net
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …