MAN and CAT: mix attention to nn and concatenate attention to YOLO

R Guan, KL Man, H Zhao, R Zhang, S Yao… - The Journal of …, 2023 - Springer
R Guan, KL Man, H Zhao, R Zhang, S Yao, J Smith, EG Lim, Y Yue
The Journal of Supercomputing, 2023Springer
CNNs have achieved remarkable image classification and object detection results over the
past few years. Due to the locality of the convolution operation, although CNNs can extract
rich features of the object itself, they can hardly obtain global context in images. It means the
CNN-based network is not a good candidate for detecting objects by utilizing the information
of the nearby objects, especially when the partially obscured object is hard to detect. ViTs
can get a rich context and dramatically improve the prediction in complex scenes with multi …
Abstract
CNNs have achieved remarkable image classification and object detection results over the past few years. Due to the locality of the convolution operation, although CNNs can extract rich features of the object itself, they can hardly obtain global context in images. It means the CNN-based network is not a good candidate for detecting objects by utilizing the information of the nearby objects, especially when the partially obscured object is hard to detect. ViTs can get a rich context and dramatically improve the prediction in complex scenes with multi-head self-attention. However, it suffers from long inference time and huge parameters, which leads ViT-based detection network that is hardly be deployed in the real-time detection system. In this paper, firstly, we design a novel plug-and-play attention module called mix attention (MA). MA combines channel, spatial and global contextual attention together. It enhances the feature representation of individuals and the correlation between multiple individuals. Secondly, we propose a backbone network based on mix attention called MANet. MANet-Base achieves the state-of-the-art performances on ImageNet and CIFAR. Last but not least, we propose a lightweight object detection network called CAT-YOLO, where we make a trade-off between precision and speed. It achieves the AP of 25.7% on COCO 2017 test-dev with only 9.17 million parameters, making it possible to deploy models containing ViT on hardware and ensure real-time detection. CAT-YOLO could better detect obscured objects than other state-of-the-art lightweight models.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果