Zhi Li receives PhD

On Friday, 08 May 2026 Zhi Li defended her thesis with the title: “Monocular 3D Human-Environment Understanding: From Interaction to Reconstruction". From June 2022 until November 2025 she was PhD student in Computer Science at the Saarland Informatics Campus, Saarbrücken and the Max Planck Institute for Informatics under the supervision of Prof. Bernt Schiele, head of Department “Computer Vision and Machine Learning”. The doctoral degree is awarded by Saarland University.

Abstract of the thesis:
Understanding and reconstructing 3D human and environment from monocular observations is a fundamental yet profoundly challenging problem in computer vision. Without stereo or multi-view information, monocular systems must infer depth, motion, and spatial structure from inherently ambiguous visual cues. However, the ubiquity of monocular cameras in autonomous systems, robotics, and consumer devices makes this setting not only practical but also essential. This thesis explores a unified framework for monocular 3D human-environment understanding that progresses through a series of self-supervised or weakly-supervised models, moving from capturing human motion to adapting to changing environments and ultimately reconstructing them. 

The first part investigates how environmental cues can be exploited to improve human motion understanding from monocular inputs. Specifically, physical constraints—such as ground contact, support, and body-environment proximity—are leveraged to guide pose estimation. A factorised correction-based framework is proposed for multi-person monocular 3D pose estimation, enabling stable optimisation over imperfect initial predictions. Based on this foundation, a contact-guided motion capture method is introduced, sampling from pose manifolds while enforcing dense contact consistency with the scene. These methods demonstrate how even limited monocular information can be enriched through structured interaction with the surrounding environment. 

Beyond human-centric modelling, the next stage examines how human motion itself can be used to recover environmental changes. In dynamic or deformable settings, static scene assumptions no longer hold. To address this, a joint reconstruction framework is developed to simultaneously estimate 3D human motion and environment deformations from monocular video. This approach captures mutual influence: humans adapt to the scene, and their movements reveal the scene’s pliability and evolution. Grounded in optimisation, this formulation models environment deformation through human motion, providing a pathway toward high-fidelity dynamic scene reconstruction.

As scenes evolve—both spatially and across domains—monocular systems must remain robust to distribution shifts. To address this, a source-free test-time domain adaptation framework is proposed for monocular depth estimation. A self-supervised optimisation strategy is employed to adapt depth predictions to unseen target domains during inference, without access to source domain data or annotations. By leveraging geometric consistency and photometric cues available at test time, this method effectively mitigates domain shifts commonly encountered in outdoor driving scenarios. Unlike prior approaches that require offline retraining or access to labelled source data, this solution is plug-and-play, efficient, and enhances generalisation in a fully unsupervised setting. 

The final stage turns toward full scene reconstruction from a single view. Methods are developed for semantic 3D occupancy prediction from monocular images, enabling feed-forward single-frame inference without reliance on ground-truth occupancy or LiDAR supervision. The approach begins with a NeRF-based volumetric rendering formulation to align 3D semantic predictions with 2D annotations through differentiable rendering losses. Within this framework, a multi-task interaction strategy is specifically designed to improve the synergy between semantic supervision and geometric reconstruction. By integrating semantic and geometric reasoning in a unified formulation, this method enables rich 3D scene understanding with minimal supervision. Despite being trained only with 2D supervision, the system can recover meaningful volumetric structure from single images, offering a practical step toward self-supervised monocular 3D reconstruction.  Across these stages, the contributions in this thesis form a coherent progression toward robust, self-supervised 3D perception from monocular visual input. From capturing interaction to reconstructing structure, the presented framework demonstrates how machines can perceive and interpret the 3D world through the narrow lens of a single camera—without requiring expensive sensors or annotations. This opens new possibilities in dynamic scene understanding, human-centric computing, and embodied AI.