Mykhaylo Andriluka & Leonid Pishchulin & Siyu Tang

People Detection and Pose Estimation in Challenging Real-World Scenes

Detection, tracking and pose estimation of people are the key technologies for many applications such as automotive safety, human-computer interaction, robotic navigation, or indexing images and videos from the web. At the same time, they are among the most challenging problems in computer vision that remain a scientifi c challenge for realistic scenes.

Although state-of-the-art methods perform well for simple scenes with walking people, they often fail in scenes with people performing complex activities or in crowded scenes with multiple people that frequently partially or fully occlude each other. In this project, we address these limitations, building on the recent advances in generic object detection and computer graphics.

Training computer vision models from synthetically generated images

One of the key ingredients for the success of state-of-the-art methods is the ability to automatically learn the appearance of people from a collection of training images. We investigate how 3D human shape models from computer graphics can be leveraged to obtain synthetic training data suitable for training models in computer vision. We rely on the recent statistical model of 3D human shape and pose that is learned from a large collection of human body scans. Our approach allows us to directly control data variability while covering the major shape and pose variations of humans that are often diffi cult to capture when manually collecting real-world training images. Our method is able to generate synthetic images of people either by directly sampling from 3D shape model or by automatically reshaping real images of people. We validate the effectiveness of our approach on the task of articulated human detection and articulated pose estimation. The obtained results indicate that our automatically generated synthetic data helps to noticeably increase the performance on both tasks.

Figure 1 and 2:
Example results obtained with our pose estimation model trained on combination of real and synthetically generated images


Joint detection and tracking of people in crowded scenes

We apply the idea of synthetic training data generation to the problem of people detection and tracking in crowded scenes. We combine it with the observation that in the case of partial occlusions, the joint detection of pairs of people is often easier than the detection of each person individually. We propose a joint detector that relies on the common patterns of person-person occlusion and incorporates them as features for detection. Once a joint confi guration of people is detected, it can the be decoded into detections of individual persons. However, confi gurations of two people also exhibit an increased amount of appearance variation, which is known to make the detection problem more diffi cult. We compensate for that by synthetically generating examples with various degrees of partial occlusion and mutual arrangements of people. We show that our joint detector signifi cantly increases detection performance in crowded scenes and leads to an improvement in people tracking.

Figure 3 and 4:
Example results obtained with our method for joint detection and tracking of people in crowded scenes

Mykhaylo Andriluka

DEPT. 2 Computer Vision and Multimodal Computing
Phone +49 681 9325-2119

Leonid Pishchulin

DEPT. 2 Computer Vision and Multimodal Computing
Phone +49 681 9325-1208

Siyu Tang

DEPT. 2 Computer Vision and Multimodal Computing
Phone +49 681 9325-2009