In recent years, an ongoing convergence of Computer Vision and Computer Graphics has been observed. In our group we investigate the algorithmic ingredients of two fields of research that draw from the ideas of both scientific disciplines, optical motion capture and free viewpoint video. It is our goal to explore the technical limits of today's video and still camera technology and to exploit their capabilities in order to develop new algorithmic solutions for motion analysis and the generation of photo-realistic immersive video content.
In our work on human motion capture, we research algorithms for the model-based analysis of human motion information from multiple video streams without the use of optical markers. In addition, we investigate methods that enable the fully-automatically marker-free reconstruction of kinematic skeleton models of arbitrary moving subjects from multi-view video footage.
While robust motion estimation from raw video streams is one algorithmic challenge, capturing very rapid human motion that takes place in a large region in space is another one. In our work we develop methods that allow us to capture this motion with regular off-the shelf still cameras. We have experimentally validated our concepts by capturing the athlete's hand motion as well as the flight of the ball during a baseball pitch.
A motion capture approach is an essential component of our work on free viewpoint video. Here, we investigate methods for model-based reconstruction of 3D videos of human actors from multiple video streams. The goal is to give a viewer the possibility to interactively choose an arbitrary viewpoint onto the 3D rendition of the reconstructed real-world scene. To achieve this goal we not only estimate the motion of a person but also her time-varying surface appearance (skin and clothes). In addition to the appearance under fixed illumination conditions we can also estimate dynamic reflectance properties from the input video streams. This enables us to augment any virtual environment with 3D video footage which is correctly rendered under the virtual illumination conditions that prevail. If one intends to transmit a free-viewpoint video over capacity-limited distribution channels, efficient ways for encoding it are necessary. We have developed and validated several algorithms which serve this purpose.
In another research project we investigate the real-time generation of novel views of dynamic scenes from multi-view video by means of hardware-accelerated shape-from-silhouette computation.
Our research strongly depends on high quality input video streams. For acquisition of the video material, we built a multi-view video acquisition studio. In all of the projects, there is an intensive cooperation between D4 and IRG3 Graphics-Optics-Vision headed by Marcus Magnor. We have contributed to the process of developing a standard for three-dimensional video within the Motion Picture Experts Group (MPEG), a subgroup of the ISO.
Investigators: joint project of researchers in D4 and IIRG3
In all our video-related research projects we use multiple high-quality video streams as input data that show the same scene from multiple frame-synchronized camera perspectives. For acquisition of this multi-view video (MVV) footage we have built a special-purpose recording studio [TLMS03]. Our first camera system consisted of eight synchronized IEEE1394 cameras that could record a scene at a resolution of 640x480 pixels and a maximal frame rate of 30 fps (15 fps in synchronized mode). We have upgraded to a new setup that features eight cameras each of which provides a resolution of 1004x1004 pixels, and 12 bits per-pixel color depth. The system can run at a maximal sustained frame rate of 48 fps. A storage backed with eight frame grabbers and eight parallel RAID systems enables us to stream the video data to hard disk in real time. The studio provides fully controllable lighting conditions and several flexibly arrangeable light sources. Arbitrary camera arrangements are possible. Fig. 0.1 illustrates our MVV recording studio.
We have developed an approach for capturing high-speed motion with regular digital photo cameras [TAH+04]. Our method demonstrates that it is possible to capture both the articulated hand motion of the pitcher and the flight parameters of the ball during a baseball pitch. We have captured the high-speed scene using four consumer-grade still cameras and the principle of multi-exposure photography. In multi-exposure photography the camera is set to a long integration time and a stroboscope illuminates the scene with high-frequency light bursts. This way, multiple exposures of the scene are superimposed in one image. We have automatically analyzed the recorded multi-exposure images to capture the flight trajectory as well as the initial flight parameters of a baseball. We have validated our results by means of a physically-based model of the ball's flight. Furthermore, the same principle has been employed to capture the rapid articulated hand motion during the pitch. For motion representation and rendering an anatomical hand model is used. Our results enable a detailed analysis and visualization of baseball pitches and show the dependencies between the hand motion, the initial flight parameters, and the resulting flight trajectory for different pitching techniques (Fig. 0.2).
For realistic animation of an artificial character a body model that represents the character's kinematic structure is required. Marker-free optical motion capture approaches exist, but due to their dependence on a specific type of a priori model they can hardly be used to track other subjects, e.g. animals. In order to extend the flexibility provided by marker-based motion capture systems, a novel approach is presented at [dATM+04], which is able to estimate the kinematic structure of a moving human subject without requiring significant a priori knowledge. Our method also enables us to track the motion without the use of optical markers. In addition, as shown at [TdAM+04], the method is general enough to be applicable in a similar form to other moving subjects whose structure can be modeled as a linked kinematic chain, e.g. animals or mechanical devices.
Input to our system are sequences of voxel volumes that are reconstructed from multi-view video streams by means of a shape-from-silhouette approach. At each time step the volumes are subdivided by fitting ellipsoidal shells to the voxel data, thereby approximating the shape of the moving subject. Exploiting the temporal dimension, we can identify correspondences between ellipsoids over time and thus identify coherent rigid body parts. Knowing the motion of the rigid bodies over time, the joint locations of the kinematic chain are estimated, and the motion parameters of the recorded subject are calculated based on the derived skeleton.
As shown in Fig. 0.3 our approach enables the automatic construction of a kinematic skeleton model of arbitrary moving subjects, such as humans (c) and animals (a, b), with practically no a-priori information about the body structure.
Investigator: Ming Li
Our goal is to reconstruct and render dynamic scenes from several video streams on-the-fly. The visual hull, a concept that was introduced by Laurentini [Lau94] proves to be an efficient shape approximation for this purpose. There exist two different approaches for visual hull reconstruction: voxel-based and polyhedron-based. We have built an on-line visual hull reconstruction and rendering system [LMS03] based on the latter approach since it is more suitable for faster rendering on common graphics hardware.
The 3D reconstruction is rather straightforward. In each frame we extract a 2D polygon from the silhouette outline and project it to 3D space to form a generalized cone. The 3D intersection of all the back-projections produces a polyhedral visual hull.For rendering, we employ the shadow mapping technique to avoid ``projecting-through'' artifacts when projecting textures onto the reconstructed visual hulls. A dynamic texture packing technique is also proposed to improve rendering performance by utilizing region-of-interest information. We have also developed a method for visual rendering that combines the strengths of two complementary hardware-accelerated approaches: direct constructive solid geometry (CSG) rendering and texture mapping-based visual cone trimming. The former approach completely eliminates the aliasing artifacts inherent in the latter, whereas the rapid speed of the latter approach compensates for the performance deficiency of the former [LMS04b]. A novel approach to reconstruct photo-hulls, i.e. multi-view photo-consistent and not only silhouette-consistent geometry, that runs completely on graphics hardware has also been developed [LMS04a].
Investigator: Christian Theobalt
In our research on free-viewpoint video we combine the strengths of a marker-free silhouette-based human motion capture algorithm and a multi-view texture generation in order to reconstruct 3D videos of human actors from multi-view video [CTMS03,TCMS04a,MT04]. The reconstructed 3D videos can be played back in real-time, and the viewer can interactively choose an arbitrary viewpoint onto the scene (Fig. 0.4a). During acquisition, the motion capture approach fits an adaptable a priori shape model to the silhouettes of a moving person. The motion capture algorithm employs an energy function that is efficiently implemented in graphics hardware (Fig. 0.4a. During rendering, the model is displayed in the sequence of captured poses and all video frames are blended into one consistent surface texture. We have sped up the method by implementing the motion capture algorithm as a distributed client-server system [TCMS03b]. Slight inaccuracies in captured body poses can be eliminated if, in addition to silhouette data, texture information is also considered during tracking. In a predictor-corrector scheme 3D flow fields are reconstructed from 2D optical flows (Fig. reffig:fvvd). These flow fields enable subtle corrective pose updates [TCMS03a,TCMS04b] (Fig. 0.4e). We have demonstrated the performance and robustness of our approach using even as complex motion as ballet dance.
Multi-View Video creates vast amounts of uncompressed, multiple camera video data. In order to store and transmit this data on consumer level systems, new compression methods have to be devised. We combine 3D image analysis and video compression techniques to achieve substantial data reduction while maintaining good quality and playback speed.
We aid motion compensation with knowledge on the recorded geometry, the given camera positions and the segmented multi-view video. For that purpose we fit an animated 3D model to the scene [CTMS03] and create a texture parameterization that is constant over time ( 0.5 shows such a texture parameterization). Naturally, this approach is based on the assumption that the 3D object's surface will not change drastically over time or camera angles (besides slight movements of e.g. loose clothing). A system overview is presented in [TZMS04]. GPU-assisted projective texturing into the texture map yields a so called MVV texture, partial video texture maps from the different camera views (0.6).
We have designed two compression methods for MVV textures: A two-level hierarchical, residual image compression based on an an extracted master texture (MVV-2D-Master-Diff), and a 4D-SPIHT based approach (MVV-4D-SPIHT).
MVV-2D-Master-Diff [ZLA+04] proved to be quick to calculate, and allows random access to MVV texture elements. Its 2D wavelet compression [BISK] is somewhat slow, but scales with the progress of 2D wavelet codec speeds, and could thus achieve real-time decoding in overseeable time.
MVV-4D-SPIHT [ZLMS04], on the other side, yields excellent compression, as it exploits all 4D correlations. The codec is based on a custom-built 4D-shape-adaptive wavelet compressor that utilizes an adaptation of the common SPIHT algorithm.
Free Viewpoint Video of Human Actors (Sect. 0.1.5) allows photo-realistic rendering of real-world people under the illumination conditions that prevailed at the time of recording. If one wants to augment a virtual environment with 3D video footage one has to make sure that it is correctly rendered under the novel virtual lighting situation. To do this, a reflectance model has to be estimated that mathematically describes the physics of light interaction at the body surface.
In our work on joint motion and reflectance capture, we extend our original free viewpoint video approach for human actors in order to capture relightable 3D videos [TAdA+05]. In addition to the motion we estimate a dynamic surface reflectance model from multi-view video footage. The dynamic surface reflectance model consists of a parametric BRDF (bidirectional reflectance distribution function) model for each point on the body surface and a time-varying normal map that captures the dynamic changes in surface appearance (e.g., wrinkles in clothing). Two types of MVV sequences are recorded for each person (Sect. 0.1.1). In the reflectance estimation sequence (RES) the person turns on spot and is only illuminated with one spot light. These data are used to infer a parametric BRDF model for each surface point. The dynamic scene sequence (DSS) captures the actual human motion that one wants to create a 3D video from. From the DSS the motion (as explained in Sect. 0.1.5) as well as the time-varying normal maps are estimated. We also present an algorithm to warp-correct input video images in order to guarantee multi-view photo-consistency in conjunction with inexact object geometry. Fig 0.7 shows screen-shots of different real-world people rendered from novel viewpoints. Subtle details in surface appearance are faithfully captured and it is even possible to dynamically change the apparel of a person in a free viewpoint video.