作者
Artur Jesslen, Guofeng Zhang, Angtian Wang, Alan Yuille, Adam Kortylewski
简介
In this work, we pioneer a framework for 3D object representation learning that achieves exceptionally robust classification and pose estimation results. In particular, we introduce a 3D representation of object categories using a 3D template mesh composed of feature vectors at each mesh vertex. Our model predicts, for each pixel in a 2D image, a feature vector of the corresponding vertex in each category template mesh, hence establishing dense correspondences between image pixels and the 3D template geometry of all target object categories. The feature vectors on the mesh vertices are trained to be viewpoint invariant by leveraging associated camera poses. During inference, we efficiently estimate the object class and pose by matching the class-specific templates to a target feature map in a two-step process: First, we classify the image by matching the vertex features of each template to an input feature map. Interestingly, we found that image classification can be performed using the vertex features only and without requiring the 3D mesh geometry, hence making the class label inference very efficient. In a second step, the object pose can be inferred using a render-and-compare matching process that ensures spatial consistency between the detected vertices. Our experiments on image classification demonstrate that our proposed 3D object representation has a number of profound advantages over classical image-based representations. First, it is exceptionally robust on a range of real-world and synthetic out-of-distribution shifts while performing on par with state-of-the-art architectures on in-distribution data in terms of accuracy and …
学术搜索中的文章