Sam 2: Segment anything in images and videos

N Ravi, V Gabeur, YT Hu, R Hu, C Ryali, T Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving
promptable visual segmentation in images and videos. We build a data engine, which …

When do we not need larger vision models?

B Shi, Z Wu, M Mao, X Wang, T Darrell - European Conference on …, 2024 - Springer
Scaling up the size of vision models has been the de facto standard to obtain more powerful
visual representations. In this work, we discuss the point beyond which larger vision models …

On the benefits of 3d pose and tracking for human action recognition

J Rajasegaran, G Pavlakos… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work we study the benefits of using tracking and 3D poses for action recognition. To
achieve this, we take the Lagrangian view on analysing actions over a trajectory of human …

Segment anything in medical images and videos: Benchmark and deployment

J Ma, S Kim, F Li, M Baharoon, R Asakereh… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in segmentation foundation models have enabled accurate and efficient
segmentation across a wide range of natural images and videos, but their utility to medical …

Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J Xia… - arXiv preprint arXiv …, 2023 - arxiv.org
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

Streaming dense video captioning

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com
An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …

Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation

X Xiong, Z Wu, S Tan, W Li, F Tang, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Image segmentation plays an important role in vision understanding. Recently, the emerging
vision foundation models continuously achieved superior performance on various tasks …

CHAMMI: A benchmark for channel-adaptive models in microscopy imaging

ZS Chen, C Pham, S Wang, M Doron… - Advances in …, 2024 - proceedings.neurips.cc
Most neural networks assume that input images have a fixed number of channels (three for
RGB images). However, there are many settings where the number of channels may vary …

Multiscale vision transformers meet bipartite matching for efficient single-stage action localization

I Ntinou, E Sanchez… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Action Localization is a challenging problem that combines detection and recognition tasks
which are often addressed separately. State-of-the-art methods rely on off-the-shelf …

Benchmarks and challenges in pose estimation for egocentric hand interactions with objects

Z Fan, T Ohkawa, L Yang, N Lin, Z Zhou, S Zhou… - … on Computer Vision, 2024 - Springer
We interact with the world with our hands and see it through our own (egocentric)
perspective. A holistic 3D understanding of such interactions from egocentric views is …