The research group Graphics, Vision & Video investigates fundamental algorithmic questions at the intersection of computer graphics, computer vision and machine learning. We investigate new ways to systematically combine the insights from computer vision on reconstruction and scene interpretation, with the expertise in graphics on the forward process of efficiently simulating and representing complex visual scene models. We also explore architectural synergies between these visual computing paradigms and machine learning concepts, even new ways to deeply integrate both in new types of end-to-end trainable architectures. We strive for new algorithms that can learn and jointly refine their algorithmic structure and employed model representations on large corpora of sparsely or unlabeled real world data. Driving these concepts forward will lead to 3D and 4D scene reconstruction, scene interpretation and scene synthesis algorithms of previously unseen robustness, efficiency, accuracy, generality, scalability, semantically meaningful controllability, and advanced explainability. The new methods will advance computer graphics and computer vision in general, and provide new insights relevant to machine learning and human-computer interaction. But their relevance extends into a rapidly growing number of more general research and application areas that will profoundly transform our future lives. For example, they will revolutionize creative technology to create and edit visual content, as well as, future virtual and augmented realities. They will build important foundations of essential, currently unavailable, visual perception abilities needed by future intelligent computational and autonomous systems that need to understand and interact with the general human world.
The GVV group investigates basic algorithmic problems in four primary research areas.
Reconstructing the Real World in Motion:
We advanced the state of the art on marker-less human motion and performance capture, as well as general dense 4D reconstruction of the real world in motion, to which we made important contributions in the past, along several dimensions: generality of scenes that can be handled, the accuracy and quality, efficiency and robustness, and the simplicity of sensors needed. Notably, we researched entirely new ways of fusing and deeply integrating model-based and deep learning-based scene reconstruction.
For instance, our team developed entirely new approaches fusing machine learning and generative model-based reconstruction for state-of-the-art multi-view marker-less human motion capture. They succeed in outdoor scenes with challenging lighting using a very low number of sensors. We also presented the first approach for real-time 3D motion capture of full body motion from a single color camera. It combines a new CNN using a new tailored location map scene representation with a model-based skeleton fitting approach. Egocentric real-time pose estimation with a head-mounted fisheye camera is also feasible, for the first time, by our work. We further extended these concepts to enable new methods for multi-person 3D pose estimation from monocular video that, by means of a new posemap formulation in a CNN, achieves state-of-the-art accuracy and unseen real-time performance. The group also contributes widely used data sets (MPI-INF-3DHP, MuCo-3DHP, MarCONI) to train and test monocular 3D pose estimation methods.
In the past, we made important contributions to marker-less 3D hand tracking. In the reporting period, we developed the first approach to capture, in real-time, the 3D hand in interaction with objects in cluttered scenes with a single depth camera. We also developed a pioneering approach for real-time 3D hand motion capture from a single color camera. Further, we presented new ways for HCI and on-body interaction with hand tracking.
We presented new robust formulations for general deformable shape capture from monocular video and presented the first methods for reconstruction of high quality static shape and texture of a human template model from monocular color video. The group developed new formulations for high-quality multi-view performance capture of humans in general apparel that succeed in more general uncontrolled scenes. Our research also resulted in the first approach for full performance capture of 3D pose and deforming surface geometry of a human in general everyday apparel from a single color video. It combines model-based and deep learning-based reconstruction. In follow-up work, we presented the first method to do full 3D human performance capture from monocular video even in real-time.
We also presented widely cited methods to capture dynamic face geometry, expression, appearance and illumination from monocular video. An entirely new way of integrating a CNN and a model-based reconstruction approach in an end-to-end-trainable architecture enabled face reconstruction at unseen speed and accuracy from single images, while even enabling training on unlabeled face images. Follow-up work showed how we can even learn a full parametric face model from community image and video data.
We also maintain one of the largest repositories of reference data sets for general static and dynamic scene reconstruction, and marker-less motion and performance capture.
Large-scale, High-quality 3D Reconstruction with Lightweight Sensors
In the reporting period we continued to contribute state-of-the-art approaches for high quality static scene reconstruction with single (depth)cameras. We presented a new, widely used, algorithm for real-time dense 3D scanning and bundle adjustment with an RGB-D camera, and presented one of the first methods to scan very thin geometric structures with a consumer RGB-D camera by using geometric curve structures as fusion primitives. We further developed a new dictionary-based approach to 3D geometry denoising, as well as a state-of-the-art method for object retrieval and pose estimation from single color images.
Computational Videography and Inverse Rendering
The group continued to contribute new methods for advanced video processing and computational videography. For instance, we showed that by means of combining advanced monocular performance capture with new methods for neural network based video synthesis, entirely new ways of generating and editing human face portrait videos, as well as videos of entire humans at previously unseen photo realism as well as computational performance can be achieved. Beyond use in creative media generation, our approaches also pave the way for new ways of photo-realistic VR and AR experiences, such as photoreal headset removal for VR telepresence, or human avatar reenactment for video-realistic telepresence. We also developed new advanced approaches for inverse rendering, i.e., estimation of lighting and reflectance from monocular color and depth video. Examples are the first real-time method for BRDF estimation from monocular RGB, as well as the first real-time approach for live user-guided intrinsic video.
Foundational Algorithms for Visual Computing and Machine Learning
We investigate foundational algorithms of cross-cutting relevance for real world reconstruction, real world synthesis and inverse rendering algorithms. For instance, we developed new scalable and efficient approaches for solving general matching problems, as well as developed new programming tools to implement efficient GPU solvers for high dimensional non-convex energy minimization problems in visual computing. We also investigate new ways to design memory-efficient yet performant learning architectures, in particular CNNs, new ways to better understand and explain their emergent structural properties, and new ways to combine them with expert-designed algorithms and model representations in end-to-end trainable architectures.