Human Pose and Shape Estimation from Images and Video

NOTE: For more information and full list of publications visit the Real Virtual Humans site:

Direct prediction of 3D body pose and shape remains a challenge even for highly parameter- ized deep learning models. Mapping from the 2D images space to the 3D space is difficult due to perspective ambiguities, and lack of training data with 3D annotations.

To address this, we introduced (Neural Body Fitting (NBF) [1]), which integrates a statistical body model (SMPL) within a CNN, leveraging reliable bottom-up semantic body part segmentation and robust top-down body model constraints, see Figure-Top. NBF is fully differentiable, and can be trained using 2D and 3D annotations. In detailed experiments, we analyze how the components of our model affect performance, especially the use of part segmentations as an explicit intermediate representation, and present a robust, efficiently trainable framework for 3D human pose estimation from 2D images with competitive results on standard benchmarks. Using a similar bottom-up-top- down architecture [3], we propose a network architecture that comprises a new disentangled hidden space encoding explicit 2D and 3D features–it achieves state-of-the-art accuracy on challenging in the wild data.

In [2], we introduced one of the first methods for 3D human pose estimation of multiple people. Estimating multiple in 3D requires novel architectures and output representations in order to deal with teh varying number of people and occlusions. In [2], we introduce occlusion-robust pose-maps (ORPM) which enable full body pose inference even under strong partial occlusions by other people and objects in the scene. They key idea is to output a fixed number of maps which encode the 3D joint locations of all people in the scene, which are associated to person identities in a second stage, see Figure-Bottom. This allows to estimate the pose of multiple people at once without explicit bounding box detection.


[1] Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation
M. Omran, C. Lassner,, G. Pons-Moll, P. Gehler and B. Schiele 
3DV 2018 , International Conference on 3D Vision, 2018
Oral, 3DV Best Student Paper Award

[2] Single-Shot Multi-person 3D Pose Estimation from Monocular RGB
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll and C. Theobalt 
3DV 2018 , International Conference on 3D Vision, 2018

[3] In the Wild Human Pose Estimation using Explicit 2D Features and Intermediate 3D Representations
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Gerard Pons-Moll, Christian Theobalt
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. Oral