Human Pose Estimation from Video and IMU

NOTE: For more information and full list of publications visit the Real Virtual Humans site:

The recording of human motion is necessary for modelling, understanding and automatically animating full-body human movement. Traditional marker-based optical Motion Capture (MoCap) systems are intrusive and restrict motions to controlled laboratory spaces. Therefore, simple daily activities like biking, or having coffee with friends cannot be recorded with such systems. Image based motion capture methods offer an alternative, but they are still not accurate enough, and require direct line of sight with the camera.

To address these issues and to be able to record human motion in everyday natural situations, we leverage Inertial Measurement Units (IMUs), which measure local orientation and acceleration. IMUs provide cues about the human motion without requiring external cameras, which is desirable for outdoor recordings where occlusions occur often.

However, existing IMU sytems are intrusive because they require a large number of sensors (17 or more), worn on the body. In previous work [1](SIP), we have demonstrated an optimization based approach which can recover full body motion from only 6 IMUs attached to wrists, lower-legs waist and head.

While less intrusive, SIP is inherently offline, which limits a lot of applications. In recent work, [2] we present a Deep Learning based real time algorithm for full body reconstruction from 6 IMUs alone. We found that propagation of information forward and backward in time is crucial for reconstructing natural human motion, for which we use a bi-directional Recursive Neural Network, see Figure-Top. To achieve good generalization, we synthesize IMU readings with their corresponding poses–obtained by fitting the SMPL body model to marker based datasets.

In contrast to visual measurements, IMU can not provide absolute joint position informa- tion. This makes pure IMU based methods inaccurate for certain types of motions. Hence, in recent work, we introduce VIP [3] which combines IMUs and single a moving camera, to robustly recover human pose in challenging outdoor scenes. The moving camera, sensor heading drift, cluttered background, occlusions and many people visible in the video make the problem very hard. We associate 2D pose detections in each image to the corresponding IMU equipped persons by solving a novel graph based optimization problem that forces 3D to 2D coherency within a frame and across long range frames. Given these associations, we jointly optimize the pose of the SMPL body model, the camera pose and heading drift using a continuous optimization.

Using VIP, we collected the 3DPW dataset, which includes videos of humans in challenging scenes with accurate 3D parameters that provide, for the first time, the means to quantitatively evaluate monocular methods in difficult scenes and stimulate new research in this area, see Figure-Bottom.


[1] Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs
Timo von Marcard, Bodo Rosenhahn, Michael Black, Gerard Pons-Moll
in Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics), 349-360 2017. 

[2] Deep Inertial Poser Learning to Reconstruct Human Pose from SparseInertial Measurements in Real Time Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, Gerard Pons-Moll
in ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), vol. 37, no. 6, 185:1-185:15 2018.

[3] Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera
Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, Gerard Pons-Moll
in European Conference on Computer Vision (ECCV), 2018.