research-article

A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose

Authors:

Matias Mendieta,

Chen ChenAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

Pages 5496 - 5507

https://doi.org/10.1145/3503161.3547844

Published: 10 October 2022 Publication History

Abstract

Existing deep learning-based human mesh reconstruction approaches have a tendency to build larger networks to achieve higher accuracy. Computational complexity and model size are often neglected, despite being key characteristics for practical use of human mesh reconstruction models (e.g. virtual try-on systems). In this paper, we present GTRS, a lightweight pose-based method that can reconstruct human mesh from 2D human pose. We propose a pose analysis module that uses graph transformers to exploit structured and implicit joint correlations, and a mesh regression module that combines the extracted pose feature with the mesh template to reconstruct the final human mesh. We demonstrate the efficiency and generalization of GTRS by extensive evaluations on the Human3.6M and 3DPW datasets. In particular, GTRS achieves better accuracy than the SOTA pose-based method Pose2Mesh while only using 10.2% of the parameters (Params) and 2.5% of the FLOPs on the challenging in-the-wild 3DPW dataset. Code is available at https://github.com/zczcwh/GTRS

Supplementary Material

MP4 File (MM22-fp0420.mp4)

Video Presentation for the paper: A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose

Download
13.34 MB

References

[1]

Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. 3686--3693.

Digital Library

[2]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In ECCV.

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[4]

Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. 2021. Beyond Static Features for Temporally Consistent 3D Human Pose and Shape From a Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1964--1973.

[5]

Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. 2020. Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose. In ECCV.

[6]

H. Ci, C. Wang, X. Ma, and Y. Wang. 2019. Optimizing Network Structure for 3D Human Pose Estimation. In ICCV.

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021).

[8]

Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. Rmpe: Regional multi-person pose estimation. In ICCV.

[9]

James W Grice and Kimberly K Assad. 2009. Generalized procrustes analysis: a tool for exploring aggregates and persons. Applied Multivariate Research, Vol. 13, 1 (2009), 93--112.

[10]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415 (2016).

[11]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[12]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[13]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 7 (jul 2014), 1325--1339.

Digital Library

[14]

Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. 2020. Coherent Reconstruction of Multiple Humans From a Single Image. In CVPR.

[15]

Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In CVPR.

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).

[18]

Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. 2020. VIBE: Video inference for human body pose and shape estimation. In CVPR.

[19]

N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis. 2019. Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. In ICCV.

[20]

Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. In CVPR.

[21]

Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6050--6059.

[22]

Kevin Lin, Lijuan Wang, and Zicheng Liu. 2021a. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1954--1963.

[23]

Kevin Lin, Lijuan Wang, and Zicheng Liu. 2021b. Mesh Graphormer. In ICCV.

[24]

Kevin Lin, Lijuan Wang, and Zicheng Liu. 2021c. Mesh Graphormer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12939--12948.

[25]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[26]

Kenkun Liu, Rongqi Ding, Zhiming Zou, Le Wang, and Wei Tang. 2020a. A Comprehensive Study of Weight Sharing in Graph Networks for 3D Human Pose Estimation. In ECCV.

[27]

Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, and Vijayan Asari. 2020b. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5064--5073.

[28]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. International Conference on Computer Vision (ICCV) (2021).

[29]

Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG), Vol. 33, 6 (2014), 1--13.

Digital Library

[30]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM TOG (2015).

Digital Library

[31]

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In ICCV.

[32]

Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV). IEEE, 506--516.

[33]

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Christian Theobalt. 2018. Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB. In 3D Vision (3DV), 2018 Sixth International Conference on. IEEE. http://gvv.mpi-inf.mpg.de/projects/SingleShotMultiPerson

[34]

Gyeongsik Moon and Kyoung Mu Lee. 2020. I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image. In ECCV.

[35]

Christopher Neff, Aneri Sheth, Steven Furgurson, John Middleton, and Hamed Tabkhi. 2021. EfficientHRNet: efficient and scalable high-resolution networks for real-time multi-person 2D human pose estimation. Journal of Real-Time Image Processing, Vol. 18, 4 (2021).

Digital Library

[36]

Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter V. Gehler, and Bernt Schiele. 2018. Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation. In 3DV.

[37]

Ahmed A A Osman, Timo Bolkart, and Michael J. Black. 2020. STAR: A Spare Trained Articulated Human Body Regressor. In ECCV.

[38]

Daniil Osokin. 2018. Real-time 2d multi-person pose estimation on CPU: Lightweight OpenPose. arXiv preprint arXiv:1811.12004 (2018).

[39]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).

[40]

Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to Estimate 3D Human Pose and Shape from a Single Color Image. In CVPR.

[41]

Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7753--7762.

[42]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.

[43]

Xuan Shen, Geng Yuan, Wei Niu, Xiaolong Ma, Jiexiong Guan, Zhengang Li, Bin Ren, and Yanzhi Wang. 2021. Towards Fast and Accurate Multi-Person Pose Estimation on Mobile Devices. arXiv preprint arXiv:2106.15304 (2021).

[44]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019a. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693--5703.

[45]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019b. Deep high-resolution representation learning for human pose estimation. In CVPR.

[46]

Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV). 529--545.

Digital Library

[47]

Matt Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. 2017. Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors. In BMVC.

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[49]

Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera. In European Conference on Computer Vision (ECCV).

Digital Library

[50]

Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. 2018. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV). 52--67.

Digital Library

[51]

Xiangyu Xu, Hao Chen, Francesc Moreno-Noguer, László A Jeni, and Fernando De la Torre. 2020. 3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning. In ECCV.

[52]

Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. 2021. Lite-HRNet: A Lightweight High-Resolution Network. In CVPR.

[53]

Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. 2020. Distribution-Aware Coordinate Representation for Human Pose Estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]

Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. 2021. PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop. In Proceedings of the IEEE International Conference on Computer Vision.

[55]

Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N. Metaxas. 2019. Semantic Graph Convolutional Networks for 3D Human Pose Regression. In CVPR.

[56]

Weixi Zhao, Yunjie Tian, Qixiang Ye, Jianbin Jiao, and Weiqiang Wang. 2021. GraFormer: Graph Convolution Transformer for 3D Pose Estimation. arXiv preprint arXiv:2109.08364 (2021).

[57]

Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 2021b. 3D Human Pose Estimation With Spatial and Temporal Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11656--11665.

[58]

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. 2021a. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6881--6890.

[59]

Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang Yang. 2019. Detailed human shape estimation from a single image by hierarchical mesh deformation. In CVPR.

[60]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).

[61]

Zhiming Zou and Wei Tang. 2021. Modulated Graph Convolutional Network for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11477--11487.

Cited By

Hua YWu WZheng CLu ALiu MChen CWu SElkind E(2023)Part aware contrastive learning for self-supervised action recognitionProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/95(855-863)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/95
Wu PLu XShen JYin YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611978(105-115)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611978
Asokan RVijayakumar T(2022)IoT based Pose detection of patients in Rehabilitation Centre by PoseNet Estimation ControlJournal of Innovative Image Processing10.36548/jiip.2022.2.0014:2(61-71)Online publication date: 9-Jun-2022
https://doi.org/10.36548/jiip.2022.2.001
Show More Cited By

Index Terms

A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

HyperGraph based human mesh hierarchical representation and reconstruction from a single image
Abstract
Reconstructing 3D human mesh from monocular images has been extensively studied. However, the existing non-parametric reconstruction methods are inefficient when modeling vertex relationship concerning human information due to they generally ...
Graphical abstract

Display Omitted
Highlights
- We propose a novel hypergraph-based human mesh hierarchical representation
- We introduce a HyperGraph Attention-based human mesh reconstruction network.
Read More
Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose
Computer Vision – ECCV 2020
Abstract
Most of the recent deep learning-based 3D human pose and mesh estimation methods regress the pose and shape parameters of human mesh models, such as SMPL and MANO, from an input image. The first weakness of these methods is the overfitting to ...
Read More
An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction
Artificial Intelligence
Abstract
Although existing image-based methods for 3D human mesh reconstruction have achieved remarkable accuracy, effectively capturing smooth human motion from monocular video remains a significant challenge. Recently, video-based methods for human mesh ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
356
Total Downloads

Downloads (Last 12 months)194
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

Cited By

Hua YWu WZheng CLu ALiu MChen CWu SElkind E(2023)Part aware contrastive learning for self-supervised action recognitionProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/95(855-863)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/95
Wu PLu XShen JYin YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Clip Fusion with Bi-level Optimization for Human Mesh Reconstruction from Monocular VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611978(105-115)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611978
Asokan RVijayakumar T(2022)IoT based Pose detection of patients in Rehabilitation Centre by PoseNet Estimation ControlJournal of Innovative Image Processing10.36548/jiip.2022.2.0014:2(61-71)Online publication date: 9-Jun-2022
https://doi.org/10.36548/jiip.2022.2.001
Zeng AJu XYang LGao RZhu XDai BXu Q(2022)DeciWatch: A Simple Baseline for Efficient 2D and 3D Pose EstimationComputer Vision – ECCV 202210.1007/978-3-031-20065-6_35(607-624)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20065-6_35

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents