The research group Graphics, Vision & Video investigates challenging research questions at the intersection of computer graphics, computer vision and machine learning. We investigate new ways to systematically combine the insights from computer vision on reconstruction and scene interpretation, with the expertise in graphics on the forward process of eﬃciently simulating and representing complex visual and physical scene models. We also explore architectural synergies between these visual computing paradigms and machine learning concepts, particularly new ways to deeply integrate both in new types of end-to-end trainable architectures. We strive for new algorithms that can learn and jointly reﬁne their algorithmic structure and employed model representations on a continuous inﬂow of ever more diverse and scarcely labeled real world data.
Driving these concepts forward will lead to a new generation of 3D and 4D scene reconstruction, scene interpretation and scene synthesis algorithms. They will oﬀer greatly enhanced robustness, eﬃciency, accuracy, generality, scalability, semantically meaningful controllability, and advanced explainability. The new methods will advance computer graphics and computer vision in general, and provide new insights relevant to machine learning and human-computer interaction. But their relevance extends into a rapidly growing number of more general research and application areas that will profoundly transform our future lives. For example, they will revolutionize creative technology to create and edit visual content, as well as, future immersive reality and telepresence environments. They will also build foundations for critically needed visual perception abilities needed by future intelligent computational and autonomous systems that understand and safely interact in the general human world.
Reconstructing the Real World in Motion: For many years, our group has made pioneering contributions to the challenging problem of reconstructing shape, motion and deformation, appearance and illumination models of the real world from video recordings. To achieve this goal, we research entirely new ways of fusing and deeply integrating model-based and deep learning-based scene reconstruction to obtain reconstruction methods with higher performance, accuracy, robustness, generalizability and eﬃciency.
Human modeling is a core area of our work in this domain. In the reporting period, we presented several important new methods for marker-less human motion capture that integrate learning-based and model-based concept in new ways. Our approaches achieve new levels of robustness via self-supervised in-the-wild learning, for the ﬁrst time enable multi-person 3D motion capture in real-time from a single color camera, or enable the ﬁrst real-time approach for physics-based monocular 3D human motion capture. We also developed new methodology to reconstruct human body motion from an egocentrically worn head-mounted camera.
We were further able to greatly advanced the state of the art on marker-less hand motion capture. The team presented new advanced parametric hand shape and appearance models, as well as new methods enabling, for the ﬁrst time, motion and surface capture of two interacting hands from a single color or depth camera. We also presented far reaching new methods deeply integrating model-based and neural network-based inference that can be trained in weakly supervised ways and are able to reconstruct, from images or short video clips, shape and appearance models of humans in clothing, partially in addition to segmentation into individual clothing items. For many years, the GVV team has made pioneering contributions to the area of human performance capture, i.e., the reconstruction of densely deforming 3D surface and appearance models of humans from video. In the reporting period, we presented new methods enabling, for the ﬁrst time, dense 3D human performance capture from monocular video. Our LiveCap and DeepCap algorithms were the ﬁrst to even achieve this at real-time performance.
The GVV group further researches the methodical foundations of new methods for performance capture of the shape, appearance, illumination and face expressions of human faces from monocular video. Our group is widely known for new algorithms integrating neural network-based components and diﬀerentiable face rendering components on the basis of parametric scene models. We further advanced these concepts to present new methods that can be trained in an unsupervised way on in-the-wild face imagery, such that both, an eﬃcient and accurate reconstruction model as well as a reﬁned parametric face model emerge.
Beyond methods for human capture, the group also develops new widely cited algorithms to capture general deformable surfaces from monocular video. Key insights here were the integration of implicitly learned deformable shape priors into an end-to-end trainable nonrigid structure from motion approach, as well as the combination of learned scene features with a model-based deformable capture approach for improved generalization. For many years the group has also looked into new algorithms to capture models of challenging static scenes from images. Example contributions from the reporting period are a new algorithm to capture scenes with dense networks of thin structures from images by exploiting geometric structure knowledge, as well as a new method to learn a 3D shape model and coarse light transport information of a static scene from community imagery.
We maintain GVVPerfCapEva, one of the largest online repositories of data sets for static and dynamic scene reconstruction, and marker-less motion and performance capture.
Neural Rendering, Neural Scene Representations, and Computational Videography: The group made pioneering contributions to the area of neural rendering and neural scene representations, an emerging new ﬁeld investigating how new neural network-based image formation models and scene representations for photo-real scene representation and image formation can be created. The group investigated how traditional explicit expert-designed scene models and rendering strategies can be combined with neural network-based concepts to achieve solution sof advanced eﬃciency, realism, and controllability. As one example, the group investigates advanced new methods to combine explicit geometric concepts, such as deformation graphs, with neural scene representations in end-to-end trainable ways. We also contributed a new method to learn implicit appearance ﬁelds of scenes from images on the basis of neural networks in a sparse voxel structure. The approach enables novel view synthesis at previously unseen quality and detail. We also presented entirely new ways to combine sparse monocularly reconstructed models of humans and human faces with neural network-based imagery, to create photo-real videos and video modiﬁcations of humans and human faces. Such high quality results can now, for the ﬁrst time, be created by driving the video result with explicit animation parameter, or even from voice or text input.
Researchers from the group have also shown how to combine explicit models of humans and scenes with neural network-based appearance representations in texture space. Examples are a new approach to image-based view interpolation even in starkly specular scenes, as well as far reaching new methods to encode and re-render the full reﬂectance ﬁeld of dynamic human faces or entire human bodies photo-realistically from novel viewpoints as well as under novel lighting. The aforementioned neural represenations also enable new ways of advanced videography, i.e., video synthesis and editing with much higher level of control than feasible with existing video modiﬁcation technology. In this context, we also investigate new ways to decompose videos into layers of direct and indirect light transport, which enables advanced new ways of illumination-aware video synthesis and editing.
Foundational Algorithms for Visual Computing and Machine Learning: We investigate foundational algorithms for real world reconstruction, real world synthesis and neural scene representation. For instance, we developed new scalable and eﬃcient approaches for solving general matching problems. We also investigate new ways to design memory-eﬃcient yet performant learning architectures, in particular CNNs, new ways to better understand and explain their emergent structural properties, and new ways to combine them with expertdesigned algorithms and model representations in end-to-end trainable architectures. The team also develops new algorithms to map visual computing problems on emerging quantum computing architectures.