max planck institut
mpii logo Minerva of the Max Planck Society

Optical Motion Capture and Free Viewpoint Video

In recent years, an ongoing convergence of Computer Vision and Computer Graphics has been observed. In our group we investigate the algorithmic ingredients of two fields of research that draw from the ideas of both scientific disciplines, optical motion capture and free viewpoint video. It is our goal to explore the technical limits of today's video and still camera technology and to exploit their capabilities in order to develop new algorithmic solutions for motion analysis and the generation of photo-realistic immersive video content.

In our work on human motion capture, we research algorithms for the model-based analysis of human motion information from multiple video streams without the use of optical markers. In addition, we investigate methods that enable the fully-automatically marker-free reconstruction of kinematic skeleton models of arbitrary moving subjects from multi-view video footage.

While robust motion estimation from raw video streams is one algorithmic challenge, capturing very rapid human motion that takes place in a large region in space is another one. In our work we develop methods that allow us to capture this motion with regular off-the shelf still cameras. We have experimentally validated our concepts by capturing the athlete's hand motion as well as the flight of the ball during a baseball pitch.

A motion capture approach is an essential component of our work on free viewpoint video. Here, we investigate methods for model-based reconstruction of 3D videos of human actors from multiple video streams. The goal is to give a viewer the possibility to interactively choose an arbitrary viewpoint onto the 3D rendition of the reconstructed real-world scene. To achieve this goal we not only estimate the motion of a person but also her time-varying surface appearance (skin and clothes). In addition to the appearance under fixed illumination conditions we can also estimate dynamic reflectance properties from the input video streams. This enables us to augment any virtual environment with 3D video footage which is correctly rendered under the virtual illumination conditions that prevail. If one intends to transmit a free-viewpoint video over capacity-limited distribution channels, efficient ways for encoding it are necessary. We have developed and validated several algorithms which serve this purpose.

In another research project we investigate the real-time generation of novel views of dynamic scenes from multi-view video by means of hardware-accelerated shape-from-silhouette computation.

Our research strongly depends on high quality input video streams. For acquisition of the video material, we built a multi-view video acquisition studio. In all of the projects, there is an intensive cooperation between D4 and IRG3 Graphics-Optics-Vision headed by Marcus Magnor. We have contributed to the process of developing a standard for three-dimensional video within the Motion Picture Experts Group (MPEG), a subgroup of the ISO.

A Studio for Multi-View Video Recording

Investigators: joint project of researchers in D4 and IIRG3

In all our video-related research projects we use multiple high-quality video streams as input data that show the same scene from multiple frame-synchronized camera perspectives. For acquisition of this multi-view video (MVV) footage we have built a special-purpose recording studio [TLMS03]. Our first camera system consisted of eight synchronized IEEE1394 cameras that could record a scene at a resolution of 640x480 pixels and a maximal frame rate of 30 fps (15 fps in synchronized mode). We have upgraded to a new setup that features eight cameras each of which provides a resolution of 1004x1004 pixels, and 12 bits per-pixel color depth. The system can run at a maximal sustained frame rate of 48 fps. A storage backed with eight frame grabbers and eight parallel RAID systems enables us to stream the video data to hard disk in real time. The studio provides fully controllable lighting conditions and several flexibly arrangeable light sources. Arbitrary camera arrangements are possible. Fig. 0.1 illustrates our MVV recording studio.

Figure 0.1: (a) Recording are of the studio. (b) Camera on mounting pole. (c) Control PC with storage backend.
Image theobalt_recording_area Image theobalt_mdc1004 Image theobalt_storage_backend
(a) (b) (c)

Capturing Rapid Motion with Regular Still Cameras

Investigators: Christian Theobalt, Irene Albrecht and Jörg Haber

Figure 0.2: (a) Captured flight trajectory of ball. (b) Measured articulated hand motion when the ball leaves the pitcher's hand.
Image theobalt_stadium1 Image theobalt_hand_right
(a) (b)

We have developed an approach for capturing high-speed motion with regular digital photo cameras [TAH+04]. Our method demonstrates that it is possible to capture both the articulated hand motion of the pitcher and the flight parameters of the ball during a baseball pitch. We have captured the high-speed scene using four consumer-grade still cameras and the principle of multi-exposure photography. In multi-exposure photography the camera is set to a long integration time and a stroboscope illuminates the scene with high-frequency light bursts. This way, multiple exposures of the scene are superimposed in one image. We have automatically analyzed the recorded multi-exposure images to capture the flight trajectory as well as the initial flight parameters of a baseball. We have validated our results by means of a physically-based model of the ball's flight. Furthermore, the same principle has been employed to capture the rapid articulated hand motion during the pitch. For motion representation and rendering an anatomical hand model is used. Our results enable a detailed analysis and visualization of baseball pitches and show the dependencies between the hand motion, the initial flight parameters, and the resulting flight trajectory for different pitching techniques (Fig. 0.2).

Marker-Free-Kinematic Skeleton Estimation

Investigators: Edilson de Aguiar and Christian Theobalt

For realistic animation of an artificial character a body model that represents the character's kinematic structure is required. Marker-free optical motion capture approaches exist, but due to their dependence on a specific type of a priori model they can hardly be used to track other subjects, e.g. animals. In order to extend the flexibility provided by marker-based motion capture systems, a novel approach is presented at [dATM+04], which is able to estimate the kinematic structure of a moving human subject without requiring significant a priori knowledge. Our method also enables us to track the motion without the use of optical markers. In addition, as shown at [TdAM+04], the method is general enough to be applicable in a similar form to other moving subjects whose structure can be modeled as a linked kinematic chain, e.g. animals or mechanical devices.

Input to our system are sequences of voxel volumes that are reconstructed from multi-view video streams by means of a shape-from-silhouette approach. At each time step the volumes are subdivided by fitting ellipsoidal shells to the voxel data, thereby approximating the shape of the moving subject. Exploiting the temporal dimension, we can identify correspondences between ellipsoids over time and thus identify coherent rigid body parts. Knowing the motion of the rigid bodies over time, the joint locations of the kinematic chain are estimated, and the motion parameters of the recorded subject are calculated based on the derived skeleton.

As shown in Fig. 0.3 our approach enables the automatic construction of a kinematic skeleton model of arbitrary moving subjects, such as humans (c) and animals (a, b), with practically no a-priori information about the body structure.

Figure 0.3: Voxel set reconstructed from multi-view video streams and the estimated skeleton (joints shown in blue and bones in white) for (a) a bird, (b) a monster and (c) a human actor.
Image colibri_skel Image monster_skel Image human_skel
(a) (b) (c)

Hardware-accelerated Real-time Scene Reconstruction and Rendering

Investigator: Ming Li

Our goal is to reconstruct and render dynamic scenes from several video streams on-the-fly. The visual hull, a concept that was introduced by Laurentini [Lau94] proves to be an efficient shape approximation for this purpose. There exist two different approaches for visual hull reconstruction: voxel-based and polyhedron-based. We have built an on-line visual hull reconstruction and rendering system [LMS03] based on the latter approach since it is more suitable for faster rendering on common graphics hardware.

The 3D reconstruction is rather straightforward. In each frame we extract a 2D polygon from the silhouette outline and project it to 3D space to form a generalized cone. The 3D intersection of all the back-projections produces a polyhedral visual hull.For rendering, we employ the shadow mapping technique to avoid ``projecting-through'' artifacts when projecting textures onto the reconstructed visual hulls. A dynamic texture packing technique is also proposed to improve rendering performance by utilizing region-of-interest information. We have also developed a method for visual rendering that combines the strengths of two complementary hardware-accelerated approaches: direct constructive solid geometry (CSG) rendering and texture mapping-based visual cone trimming. The former approach completely eliminates the aliasing artifacts inherent in the latter, whereas the rapid speed of the latter approach compensates for the performance deficiency of the former [LMS04b]. A novel approach to reconstruct photo-hulls, i.e. multi-view photo-consistent and not only silhouette-consistent geometry, that runs completely on graphics hardware has also been developed [LMS04a].

Free Viewpoint Video of Human Actors

Investigator: Christian Theobalt

Figure 0.4: (a) rendered free-viewpoint video of a ballet dancer, (b) silhouette XOR error function used for pose computation, (c) underlying kinematic body model. (d) Corrective 3D flow vectors. (e) Free-Viewpoint Video without and (f)with corrective pose update.
Image theobalt_textured Image theobalt_xor Image theobalt_skeleton
(a) (b) (c)
Image theobalt_corr_flow_vectors Image theobalt_snap7_uncorrected_close Image theobalt_snap7_corrected_close
(d) (e) (f)

In our research on free-viewpoint video we combine the strengths of a marker-free silhouette-based human motion capture algorithm and a multi-view texture generation in order to reconstruct 3D videos of human actors from multi-view video [CTMS03,TCMS04a,MT04]. The reconstructed 3D videos can be played back in real-time, and the viewer can interactively choose an arbitrary viewpoint onto the scene (Fig. 0.4a). During acquisition, the motion capture approach fits an adaptable a priori shape model to the silhouettes of a moving person. The motion capture algorithm employs an energy function that is efficiently implemented in graphics hardware (Fig. 0.4a. During rendering, the model is displayed in the sequence of captured poses and all video frames are blended into one consistent surface texture. We have sped up the method by implementing the motion capture algorithm as a distributed client-server system [TCMS03b]. Slight inaccuracies in captured body poses can be eliminated if, in addition to silhouette data, texture information is also considered during tracking. In a predictor-corrector scheme 3D flow fields are reconstructed from 2D optical flows (Fig. reffig:fvvd). These flow fields enable subtle corrective pose updates [TCMS03a,TCMS04b] (Fig. 0.4e). We have demonstrated the performance and robustness of our approach using even as complex motion as ballet dance.

Encoding of 3D Video

Investigators: Gernot Ziegler and Hendrik Lensch

Figure 0.5: (left) 3D model and its skeleton (middle) body part grouping (right) texture parameterization.
Image ziegler_model

Figure 0.6: The texture parameterization and projective texturing convert camera views into partial texture maps (MVV texture elements).
Image ziegler_video2tex

Multi-View Video creates vast amounts of uncompressed, multiple camera video data. In order to store and transmit this data on consumer level systems, new compression methods have to be devised. We combine 3D image analysis and video compression techniques to achieve substantial data reduction while maintaining good quality and playback speed.

In MVV compression, research aims at exploiting the inherent correlation between the different camera views of the same scene.

We aid motion compensation with knowledge on the recorded geometry, the given camera positions and the segmented multi-view video. For that purpose we fit an animated 3D model to the scene [CTMS03] and create a texture parameterization that is constant over time ( 0.5 shows such a texture parameterization). Naturally, this approach is based on the assumption that the 3D object's surface will not change drastically over time or camera angles (besides slight movements of e.g. loose clothing). A system overview is presented in [TZMS04]. GPU-assisted projective texturing into the texture map yields a so called MVV texture, partial video texture maps from the different camera views (0.6).

We have designed two compression methods for MVV textures: A two-level hierarchical, residual image compression based on an an extracted master texture (MVV-2D-Master-Diff), and a 4D-SPIHT based approach (MVV-4D-SPIHT).

MVV-2D-Master-Diff [ZLA+04] proved to be quick to calculate, and allows random access to MVV texture elements. Its 2D wavelet compression [BISK] is somewhat slow, but scales with the progress of 2D wavelet codec speeds, and could thus achieve real-time decoding in overseeable time.

MVV-4D-SPIHT [ZLMS04], on the other side, yields excellent compression, as it exploits all 4D correlations. The codec is based on a custom-built 4D-shape-adaptive wavelet compressor that utilizes an adaptation of the common SPIHT algorithm.

Joint Motion and Reflectance Capture

Investigators: Christian Theobalt, Naveed Ahmed, Edilson de Aguiar, Gernot Ziegler and Hendrik Lensch

Free Viewpoint Video of Human Actors (Sect. 0.1.5) allows photo-realistic rendering of real-world people under the illumination conditions that prevailed at the time of recording. If one wants to augment a virtual environment with 3D video footage one has to make sure that it is correctly rendered under the novel virtual lighting situation. To do this, a reflectance model has to be estimated that mathematically describes the physics of light interaction at the body surface.

Figure 0.7: (a) Wrinkles on T-shirt are reproduced using time varying normal maps. (b, c) Person rendered from different viewpoints and illuminations (colored dots: light source positions, colors are light source colors). (d, e) Estimated BRDF for one type of clothing can be used as the apparel of a person even for the motion sequences in which the person was originally dressed differently.
Image nahmed_shirt
Image nahmed_girl Image nahmed_dance Image nahmed_recloth1 Image nahmed_recloth2

In our work on joint motion and reflectance capture, we extend our original free viewpoint video approach for human actors in order to capture relightable 3D videos [TAdA+05]. In addition to the motion we estimate a dynamic surface reflectance model from multi-view video footage. The dynamic surface reflectance model consists of a parametric BRDF (bidirectional reflectance distribution function) model for each point on the body surface and a time-varying normal map that captures the dynamic changes in surface appearance (e.g., wrinkles in clothing). Two types of MVV sequences are recorded for each person (Sect. 0.1.1). In the reflectance estimation sequence (RES) the person turns on spot and is only illuminated with one spot light. These data are used to infer a parametric BRDF model for each surface point. The dynamic scene sequence (DSS) captures the actual human motion that one wants to create a 3D video from. From the DSS the motion (as explained in Sect. 0.1.5) as well as the time-varying normal maps are estimated. We also present an algorithm to warp-correct input video images in order to guarantee multi-view photo-consistency in conjunction with inexact object geometry. Fig 0.7 shows screen-shots of different real-world people rendered from novel viewpoints. Subtle details in surface appearance are faithfully captured and it is even possible to dynamically change the apparel of a person in a free viewpoint video.


Joel Carranza, Christian Theobalt, Marcus Magnor, and Hans-Peter Seidel.
Free-viewpoint video of human actors.
ACM Transactions on Graphics, 22(3):569-577, July 2003.
(Proc. ACM SIGGRAPH '03).

Edilson de Aguiar, Christian Theobalt, Marcus Magnor, Holger Theisel, and Hans-Peter Seidel.
M3 : Marker-free model reconstruction and motion tracking from 3d voxel data.
In Daniel Cohen-Or, Hyeong-Seok Ko, Demetri Terzopoulos, and Joe Warren, editors, 12th Pacific Conference on Computer Graphics and Applications, PG 2004 : proceedings, pages 101-110, Seoul, Korea, October 2004. IEEE.

A. Laurentini.
The visual hull concept for silhouette-based image understanding.
Pattern Analysis and Machine Intelligence, 16(2):150-162, February 1994.

Ming Li, Marcus Magnor, and Hans-Peter Seidel.
Online accelerated rendering of visual hulls in real scenes.
Journal of WSCG, 11(2):290-297, 2003.

Ming Li, Marcus Magnor, and Hans-Peter Seidel.
Handware-accelerated rendering of photo hulls.
In Marie-Paule Cani and Mel Slater, editors, The European Association for Computer Graphics 25th Annual Conference EUROGRAPHICS 2004, volume 23 of Computer Graphics Forum, pages 635-642, Grenoble, France, 2004. Blackwell.

Ming Li, Marcus Magnor, and Hans-Peter Seidel.
A hybrid hardware-accelerated algorithm for high quality rendering of visual hulls.
In Wolfgang Heidrich and Ravin Balakrishnan, editors, Graphics Interface 2004 ; proceedings, pages 41-48, London, Canada, 2004. Canadian Information Processing Society.

Marcus Magnor and Christian Theobalt.
Model-based analysis of multi-view video data.
In 2004 Southwest Symposium on Image Analysis and Interpretation, pages 41-45, Lake Tahoe, USA, March 2004. IEEE.

Christian Theobalt, Naveed Ahmed, Edilson de Aguiar, Gernot Ziegler, Hendrik Lensch, Marcus Magnor, and Hans-Peter Seidel.
Joint motion and reflectance capture for creating relightable 3d videos.
Research Report MPI-I-2005-4-004, Max-Planck-Institut fuer Informatik, Saarbruecken, Germany, April 2005.

Christian Theobalt, Irene Albrecht, Jörg Haber, Marcus Magnor, and Hans-Peter Seidel.
Pitching a baseball -- tracking high-speed motion with multi-exposure images.
ACM Transactions on Graphics, 23(3):540-547, August 2004.
(Proc. ACM SIGGRAPH '04).

Christian Theobalt, Joel Carranza, Marcus Magnor, and Hans-Peter Seidel.
Enhancing silhouette-based human motion capture with 3d motion fields.
In Jon Rokne, Reinhard Klein, and Wenping Wang, editors, 11th Pacific Conference on Computer Graphics and Applications (PG-03), pages 185-193, Canmore, Canada, October 2003. IEEE.

Christian Theobalt, Joel Carranza, Marcus Magnor, and Hans-Peter Seidel.
A parallel framework for silhouette-based human motion capture.
In Thomas Ertl, Bernd Girod, Günther Greiner, Heinrich Niemann, Hans-Peter Seidel, Eckehard Steinbach, and Rüdiger Westermann, editors, Vision, Modeling and Visualization 2003 (VMV-03) : proceedings, pages 207-214, Munich, Germany, November 2003. Aka.

Christian Theobalt, Joel Carranza, Marcus Magnor, and Hans-Peter Seidel.
3d video - being part of the movie.
ACM SIGGRAPH Computer Graphics, 38(3):18-20, August 2004.

Christian Theobalt, Joel Carranza, Marcus Magnor, and Hans-Peter Seidel.
Combining 3d flow fields with silhouette-based human motion capture for immersive video.
Graphical Models, 66:333-351, September 2004.

Christian Theobalt, Edilson de Aguiar, Marcus Magnor, Holger Theisel, and Hans-Peter Seidel.
Marker-free kinematic skeleton estimation from sequences of volume data.
In Rynson Lau and George Baciu, editors, ACM Symposium on Virtual Reality Software and Technology (VRST 2004), pages 57-64, Hong Kong, 2004. ACM.

Christian Theobalt, Ming Li, Marcus Magnor, and Hans-Peter Seidel.
A flexible and versatile studio for multi-view video recording.
In Peter Hall and Philip Willis, editors, Vision, Video and Graphics 2003, pages 9-16, Bath, UK, July 2003. Eurographics.

Christian Theobalt, Gernot Ziegler, Marcus Magnor, and Hans-Peter Seidel.
Model-based free-viewpoint video: Acquisition, rendering, and encoding.
In Picture Coding Symposium 2004 (PCS-04), pages SpecialSession5,1-6, Davis, USA, December 2004. UC Davis.

Gernot Ziegler, Hendrik P. A. Lensch, Naveed Ahmed, Marcus Magnor, and Hans-Peter Seidel.
Multi-video compression in texture space.
In 11th IEEE International Conference on Image Processing (ICIP 2004), pages 2467-2470, Singapore, October 2004. IEEE Signal Processing Society, IEEE.

Gernot Ziegler, Hendrik P. A. Lensch, Marcus Magnor, and Hans-Peter Seidel.
Multi-video compression in texture space using 4d spiht.
In 2004 IEEE 6th Workshop on Multimedia Signal Processing, pages 39-42, Siena, Italy, September 2004. IEEE Signal Processing Society, IEEE.