3D Human Pose and Shape Estimation

3D Reconstruction and Perception of People

Humans are incredibly good at perceiving people from visual data. Without even thinking about it, we quickly perceive the body shape, posture, facial expressions and clothing of other people. Our research trains machines to perceive people at the same level of detail as humans do.

Although current Computer Vision methods can predict 2D pose and image segmentation, because annotated data is available, predicting 3D human geometry, motion and clothing is an open problem, as training data--images and their corresponding 3D geometry--is not available.

Our approach to this problem is to infer and learn a powerful representation of people in 3D space. Intuitively, such representation encodes the machine mental model of people. Given an image, the inference algorithms should predict the full detail in 3D, which should be consistent with learned 3D human shape priors and its projection should overlap with the image observations, see Fig.1. This opens the door for semi-supervised learning because unlabeled images alone can be used to infer properties about the 3D world.

Figure 1: Self-supervised learning framework with explicit 3D world representations.

Following this paradigm, we introduced methods to reconstruct 3D human shape and pose from images, human shape and clothing from videos, and non-rigid deformations from video.

Human Pose and Shape Estimation from Images and Video: We introduced (Neural Body Fitting (NBF)), which integrates a statistical 3D body model (SMPL) within a CNN, leveraging reliable bottom-up semantic body part segmentation and robust top-down body model constraints, see Figure-Top. NBF is fully differentiable, and can be trained using self-consistency – the 3D world prediction needs to match the 2D images. This allows to learn about 3D humans with images alone, see Fig. 2.

Figure 2: Algorithms to infer pose, shape, 3D geometry, appearance and clothing from images.

Clothing: Understanding human behavior is not only about motion and body shape. The type of clothing people wear is another form of expression. People use clothing to express their political views, age, gender or social status. Instead of inferring body pose and shape while being invariant to clothing, we aim at perceiving and capturing human body shape along with clothing (category, appearance and shape) from images. We have introduced the first algorithms to reconstruct humans including their 3D clothing from video. Our recent work allows to predict body shape and clothing separately from a few images, allowing full control over the predictions, see Fig. 3.

Gerard Pons-Moll

DEPT. Computer Vision and Machine Learning
Phone +49.681.9325-2135
Email: gpons@mpi-inf.mpg.de