Graphics, Vision & Video

Vision and Research Strategy

The research group Graphics, Vision & Video investigates challenging research questions at the intersection of computer graphics, computer vision and machine learning. We investigate new ways to systematically combine the insights from computer vision on reconstruction and scene interpretation, with the expertise in graphics on the forward process of efficiently simulating and representing complex visual and physical scene models. We also explore architectural synergies between these visual computing paradigms and machine learning concepts, particularly new ways to deeply integrate both in new types of end-to-end trainable architectures. We strive for new algorithms that can learn and jointly refine their algorithmic structure and employed model representations on a continuous inflow of ever more diverse and scarcely labeled real world data.
Driving these concepts forward will lead to a new generation of 3D and 4D scene reconstruction, scene interpretation and scene synthesis algorithms. They will offer greatly enhanced robustness, efficiency, accuracy, generality, scalability, semantically meaningful controllability, and advanced explainability. The new methods will advance computer graphics and computer vision in general, and provide new insights relevant to machine learning and human-computer interaction. But their relevance extends into a rapidly growing number of more general research and application areas that will profoundly transform our future lives. For example, they will revolutionize creative technology to create and edit visual content, as well as, future immersive reality and telepresence environments. They will also build foundations for critically needed visual perception abilities needed by future intelligent computational and autonomous systems that understand and safely interact in the general human world.

Research Areas and Achievements

Reconstructing the Real World in Motion: For many years, our group has made pioneering contributions to the challenging problem of reconstructing shape, motion and deformation, appearance and illumination models of the real world from video recordings. To achieve this goal, we research entirely new ways of fusing and deeply integrating model-based and deep learning-based scene reconstruction to obtain reconstruction methods with higher performance, accuracy, robustness, generalizability and efficiency.
Human modeling is a core area of our work in this domain. In the reporting period, we presented several important new methods for marker-less human motion capture that integrate learning-based and model-based concept in new ways. Our approaches achieve new levels of robustness via self-supervised in-the-wild learning, for the first time enable multi-person 3D motion capture in real-time from a single color camera, or enable the first real-time approach for physics-based monocular 3D human motion capture. We also developed new methodology to reconstruct human body motion from an egocentrically worn head-mounted camera.
We were further able to greatly advanced the state of the art on marker-less hand motion capture. The team presented new advanced parametric hand shape and appearance models, as well as new methods enabling, for the first time, motion and surface capture of two interacting hands from a single color or depth camera. We also presented far reaching new methods deeply integrating model-based and neural network-based inference that can be trained in weakly supervised ways and are able to reconstruct, from images or short video clips, shape and appearance models of humans in clothing, partially in addition to segmentation into individual clothing items. For many years, the GVV team has made pioneering contributions to the area of human performance capture, i.e., the reconstruction of densely deforming 3D surface and appearance models of humans from video. In the reporting period, we presented new methods enabling, for the first time, dense 3D human performance capture from monocular video. Our LiveCap and DeepCap algorithms were the first to even achieve this at real-time performance.
The GVV group further researches the methodical foundations of new methods for performance capture of the shape, appearance, illumination and face expressions of human faces from monocular video. Our group is widely known for new algorithms integrating neural network-based components and differentiable face rendering components on the basis of parametric scene models. We further advanced these concepts to present new methods that can be trained in an unsupervised way on in-the-wild face imagery, such that both, an efficient and accurate reconstruction model as well as a refined parametric face model emerge.
Beyond methods for human capture, the group also develops new widely cited algorithms to capture general deformable surfaces from monocular video. Key insights here were the integration of implicitly learned deformable shape priors into an end-to-end trainable nonrigid structure from motion approach, as well as the combination of learned scene features with a model-based deformable capture approach for improved generalization. For many years the group has also looked into new algorithms to capture models of challenging static scenes from images. Example contributions from the reporting period are a new algorithm to capture scenes with dense networks of thin structures from images by exploiting geometric structure knowledge, as well as a new method to learn a 3D shape model and coarse light transport information of a static scene from community imagery.
We maintain GVVPerfCapEva, one of the largest online repositories of data sets for static and dynamic scene reconstruction, and marker-less motion and performance capture.

Neural Rendering, Neural Scene Representations, and Computational Videography: The group made pioneering contributions to the area of neural rendering and neural scene representations, an emerging new field investigating how new neural network-based image formation models and scene representations for photo-real scene representation and image formation can be created. The group investigated how traditional explicit expert-designed scene models and rendering strategies can be combined with neural network-based concepts to achieve solution sof advanced efficiency, realism, and controllability. As one example, the group investigates advanced new methods to combine explicit geometric concepts, such as deformation graphs, with neural scene representations in end-to-end trainable ways. We also contributed a new method to learn implicit appearance fields of scenes from images on the basis of neural networks in a sparse voxel structure. The approach enables novel view synthesis at previously unseen quality and detail. We also presented entirely new ways to combine sparse monocularly reconstructed models of humans and human faces with neural network-based imagery, to create photo-real videos and video modifications of humans and human faces. Such high quality results can now, for the first time, be created by driving the video result with explicit animation parameter, or even from voice or text input.
Researchers from the group have also shown how to combine explicit models of humans and scenes with neural network-based appearance representations in texture space. Examples are a new approach to image-based view interpolation even in starkly specular scenes, as well as far reaching new methods to encode and re-render the full reflectance field of dynamic human faces or entire human bodies photo-realistically from novel viewpoints as well as under novel lighting. The aforementioned neural represenations also enable new ways of advanced videography, i.e., video synthesis and editing with much higher level of control than feasible with existing video modification technology. In this context, we also investigate new ways to decompose videos into layers of direct and indirect light transport, which enables advanced new ways of illumination-aware video synthesis and editing.

Foundational Algorithms for Visual Computing and Machine Learning: We investigate foundational algorithms for real world reconstruction, real world synthesis and neural scene representation. For instance, we developed new scalable and efficient approaches for solving general matching problems. We also investigate new ways to design memory-efficient yet performant learning architectures, in particular CNNs, new ways to better understand and explain their emergent structural properties, and new ways to combine them with expertdesigned algorithms and model representations in end-to-end trainable architectures. The team also develops new algorithms to map visual computing problems on emerging quantum computing architectures.