Graphics, Vision & Video

Vision and Research Strategy

The research group Graphics, Vision & Video investigates fundamental algorithmic questions at the boundary between computer vision and computer graphics. Research problems in graphics and vision are increasingly converging. We investigate new ways to systematically combine the insights from computer vision on reconstruction and scene interpretation, with the expertise in graphics on the forward process of efficiently simulating and representing complex visual scene models. We also explore architectural synergies between these visual computing paradigms and machine learning concepts, even their deeper integration. Driving these concepts forward simultaneously will lead to scene reconstruction, scene interpretation and scene synthesis algorithms of previously unseen robustness, efficiency, accuracy, generality user-friendliness. The new methods will advance computer graphics and computer vision in general, but their relevance extends into a rapidly growing number of more general research and application areas. They will be of profound relevance in technologies transforming our future lives. They will revolutionize future virtual and augmented realities and lay foundations of essential, currently unavailable, visual perception abilities needed by future computational and autonomous systems that need to understand and interact with the general, uncontrolled human world.



Research Areas and Achievements

Marker-less 4D Reconstruction and Inverse Rendering in Real World Scenes

4D reconstruction is the process of capturing detailed space-time coherent dynamic 3D geometry, motion and appearance models of real-world scenes from camera recordings. The need for robust and accurate dynamic scene reconstruction algorithms has dramatically increased in recent years. Earlier research was mainly geared towards content capture for movies, games and 3D media, e. g. performance capture methods of humans. Today, 4D reconstruction has become an essential building block for far reaching technologies, e. g., future virtual and augmented reality systems, or future autonomous systems that need to reconstruct the general uncontrolled human world in order to interact with it. Algorithmically, however, 4D reconstruction is still in an early stage.

In the past, the GVV group pioneered approaches for performance capture of humans in wide and general everyday apparel. Unfortunately, most state-of-the-art 4D reconstruction and performance capture methods are still limited in many critical ways: they often only succeed in controlled (indoor) studio environments, require dense camera systems, can only capture a limited scene range (single subjects, specific object types, simple motions, only simple interactions between objects), and deliver limited reconstruction detail. Current approaches are thus far from being able to capture highly detailed models of real-world scenes in general outdoor environments where lighting and appearance are complex and time-varying, where scenes are dynamic and complex, and where only very few or even a single camera can be afforded for reconstruction.

We therefore rethink the algorithmic foundations of 4D reconstruction to come closer to this latter long term goal. In particular, we advanced the state-of-the-art in marker-less skeletal motion capture, which we already co-defined in the past. For instance, we developed some of the first methods for real-time skeletal motion capture of the hand, even together with reconstructing the object it interacts with, from a single RGB-D camera. We also adapted these methods for new gesture-based man-machine interaction paradigms. Further on, we developed new scene representations and combinations of learning- and model-based reconstruction that allow previously unseen full-body motion capture in less controlled (even outdoor scenes) with as little as two cameras. Our approach for egocentric full body motion capture with head-worn fisheye cameras was a groundbreaking novelty.

We further substantially advanced reconstruction accuracy, ability and runtime performance of dense 4D reconstruction methods. We developed the first methods for dense multi-view performance capture of humans in general apparel in simple outdoor scenes. We further developed new real-time techniques for dense 4D reconstruction that do not require a template model to start with but build the model up alongside deformable tracking. This greatly enhances usability. Our groundbreaking real-time face performance capture methods that reconstruct geometry, reflectance and scene illumination from single RGB-D and RGB (Face2Face) video, and that allowing real-time photo-real face expression transfer to another person, set a new benchmark. These methods stirred enormous attention in general medial worldwide (e. g., New York Times, Jimmy Kimmel Live, over 3 million views on Youtube). We also pioneered approaches for high-quality personalized face rig reconstruction from single video, as well as approaches for high-quality lip motion and teeth row reconstruction from monocular video.

A core expertise of our group is the development of new inverse rendering approaches, i.e., methods for reconstructing surface reflectance and illumination in general uncalibrated scenes, together with geometry. We greatly advanced these algorithms, pushed towards real-time performance in certain scene types. We more tightly integrated them with 4D reconstruction for enhanced stability and reconstruction detail, in particular for face capture but also for advanced static scanning and video processing. Our group also maintains one of the largest repositories of marker-less motion and performance capture data sets, GVVPerfcapEva, which is widely used in the community.


Large-scale, High-quality 3D Reconstruction with Lightweight Sensors

Large-scale high-quality static 3D scene reconstruction with a moving depth scanner, such as an RGB-D camera, is the basis for many applications in mixed and augmented reality reality, in robotics, but also in architecture, digital heritage, or real world navigation. We developed BundleFusion, the first approach for globally-consistent online reconstruction of large scale scenes with an RGB-D camera by means of online global bundle-adjustment and loop closure. It leads to reconstruction results that are even on par with high-quality offline approaches. The geometry obtained with such a method is of high quality on a medium scale, but lacks fine-scale surface detail due to the low resolution of the depth channel of commodity depth cameras. To alleviate this problem, we developed new inverse rendering approaches that estimate surface reflectance and scene lighting, even in uncalibrated scenes, from the RGB-D input. This allows us to extract fine-scale geometric detail based on shading cues visible in the color image in near real-time. We also developed new programming tools, such as the Opt domain specific language, to develop non-linear solvers for optimization problems often found in reconstruction and visual computing in general.


Advanced User-centric Video Processing

The abundance of mobile devices has made video a widely available commodity. In contrast to photo editing, truly powerful tools to support the user in capturing and editing videos, or for exploring large corpora of video, are not yet available. Advanced video editing often requires more profound scene understanding to go beyond the widely used, but restrictive, image-filter on a time line paradigm. We developed advanced computational videography methods to extract structural scene information from everyday video footage. For instance, we developed one of the first video depth from defocus approach that enables video refocusing, tilt shift videography, or segmentation in postprocessing from normal video footage. We also developed the first approach for live intrinsic video that decomposes frames, in real-time, into per-pixel shading and albedo. This enables realistic lighting aware video editing and augmented reality. We also developed new machine learning approaches to structure and explore large video collections, and new methods that exploit camera noise models for video improvement and HDR imaging.