Graphics, Vision & Video

Leader of the group: Prof. Dr. Christian Theobalt

- Homepage -

 

Vision and Research Strategy

The research group Graphics, Vision & Video investigates fundamental algorithmic questions that lie on the boundary between computer vision and computer graphics. Research problems in graphics and vision are increasingly converging and we investigate how to systematically combine the insights from computer vision on reconstruction and scene interpretation, with the expertise in graphics on the forward process of efficiently simulating and representing complex virtual scene models. In conjunction, this will lead to scene reconstruction, scene interpretation and scene synthesis algorithms of previously unseen robustness, efficiency, accuracy, and generality. The new algorithms will benefit research in computer vision and graphics, but also in other domains, such as human-computer interaction.

 

 

Research Areas and Achievements

Performance Capture

Performance capture, or 4D reconstruction, is the process of capturing detailed space-time coherent dynamic 3D geometry, motion and appearance models of real-world scenes from video recordings, or data of alternative sensors. It has a rapidly growing number of applications in vision and graphics, but also in robotics, telepresence, 3D TV production, as well as the re-emerging fields of virtual and augmented reality. In the past, the GVV group made several fundamental contributions to advance the field, and co-developed some of the first approaches to capture dynamic shape models of humans in wide and general everyday apparel. Unfortunately, most state-of-the-art performance capture approaches are still very limited, as they only succeed in indoor environments with controlled and constant background and lighting, and dense camera systems. They are also limited to capturing simple scenes, with single subjects and little complex topology changes, and they deliver scene models of limited geometric detail. Current approaches are thus far from being able to capture highly detailed models of real-world scenes in general outdoor environments where lighting and appearance are complex and time-varying, where scene content is highly dynamic and complex, and where only very few cameras can be used.

We therefore started to rethink the algorithmic foundations of 3D and 4D reconstruction to come closer to the latter goal. In particular, we design new scene representations and optimization methods for 4D reconstruction that enable, for the first time, marker-less skeleton capture at high quality in general environments, of humans and animals, in real-time, and with only a few possibly hand-held cameras. This technology is also a foundation of our spin-off the Captury GmbH. We also develop new methods and representations for real-time articulated hand motion capture with RGB and depth cameras at previously unseen accuracy. New paradigms for hand-gesture based HCI methods are also researched.

Additionally, we further push the reconstruction accuracy and ability of dynamic surface capture methods. We developed some of the first methods for performance capture of multiple people in close interaction. We also investigate new ways of combining reconstruction methods from vision with the efficient modeling of the light transport in computer graphics. This enabled us to phrase some of the first methods for inverse rendering in general outdoor scenes to estimate detailed models of incident lighting, appearance and detailed geometry from video. These new inverse rendering concepts benefit solutions to a variety of fundamental computer vision problems. They enable shape-from-shading-based surface detail reconstruction in uncontrolled scenes at extremely high detail. They also enable us to capture, for the first time, relightable dynamic scene models in uncontrolled environments, and they allow us greatly increase robustness of correspondence finding and pose optimization approaches. By this means we were able to develop some of the first approaches for high-detail full-body and facial performance capture from a single stereo camera view. Enhanced versions of these methods even allow monocular performance capture at extremely high accuracy, stability and geometric detail. We further investigate new ways to learn semantically meaningful low-dimensional representations of captured 4D animations that enable convenient scene modification of reconstructed scenes, and open up new ways of user-designed motion mapping between real-time motion sensors and arbitrarily shaped virtual characters.

Advanced User-centric Video Processing

The abundance of mobile devices with video cameras has made video footage a widely available commodity. In contrast to photography, however, truly powerful tools to support the user in capturing, and editing videos, and in exploring large corpora of video are not yet available. Also, video capture is far more complex than photography as it requires diligent shot planning. Since everyday users lack these skills many videos are very badly made, and require sophisticated edits to make the already captured footage pleasing to watch. Such video editing tasks, however, often have a higher-level semantic component which leads to difficult algorithmic problems not supported by existing video editing tools that follow a image-filter on a time line paradigm. We therefore develop new algorithms to enable automatic video edits of previously unseen complexity and to enable new ways of identifying and visualizing the spatio-temporal relations between videos in community sets. The algorithmic basis is formed by new methods to identify and use inter- and intra-video relationships that are robust under a variety of scene and recording conditions. This also leads to new machine learning approaches that facilitate structuring and exploring large databases of visual content, and also allow for new ways of data-driven image and video enhancement.

Algorithms for for 3D and 4D Reconstruction with New Sensors

We also develop new algorithms to work with the latest generation of depth cameras, such as the Kinect or time-of-flight cameras, that have seen increasing popularity as sensors for 3D and 4D reconstruction, as well as scene interpretation and human computer interaction (Sect. 41.8). Due to their ability to jointly capture color and depth, these sensors are also often referred to as RGBD cameras. State-of-the-art RGBD cameras have unique characteristics that need to be fundamentally addressed in any algorithm using their data. Even though they capture dense depth at video rate without interfering with the scene in the visual spectrum, they have low resolution, considerable noise, and non-trivial systematic distortions. We contribute with new noise measurement characterization and calibration, noise reduction, superresolution and sensor fusion techniques for data enhancement. We also work on methods for robust monocular skeleton reconstruction with single depth sensors, as well as a combination of a single depth camera and a sparse set of inertial depth sensors. Recently, we developed one of the first approaches for-real-time dense deformable mesh tracking from a single RGBD camera. We also presented one of the first approaches in the literature for real-time inverse rendering and shape-from-shading-based refinement of RGBD data of general uncontrolled scenes. These works extend concepts developed in our performance capture research by adapting them to the depth sensor characteristics, and by making use of extremely fast GPU-based non-linear optimization methods to cope with the very high-dimensional optimization problems.