Achieving visual reasoning is a long-term goal of artificial intelligence. In the last decade, several studies have applied deep neural networks (DNNs) to the task of learning visual …
X Pan, A Philip, Z Xie, O Schwartz - arXiv preprint arXiv:2405.14880, 2024 - arxiv.org
Self-attention in vision transformers has been thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to …