Computer Vision and Machine Learning

Wenbin Li (PhD Student)

Personal Information

Research Interests

  • Robotics
  • Activity Modeling
  • Material Recognition
  • Machine Learning


  • 2013-present: PhD student at Max Planck Institute for Informatics and Saarland University, Germany
  • 2010-present: Graduate student at Graduate School for Computer Science, Saarland University, Germany
  • 2010-2013: M.Sc. in Computer Science, Saarland University, Germany
  • 2006-2010: B.Sc. in Science and Technology of Intelligence, Beijing University of Posts and Telecommunications, China

For more information, please visit my personal homepage.


Learning Manipulation under Physics Constraints with Visual Perception
W. Li, A. Leonardis, J. Bohg and M. Fritz
Technical Report, 2019
(arXiv: 1904.09860)
Understanding physical phenomena is a key competence that enables humans and<br>animals to act and interact under uncertain perception in previously unseen<br>environments containing novel objects and their configurations. In this work,<br>we consider the problem of autonomous block stacking and explore solutions to<br>learning manipulation under physics constraints with visual perception inherent<br>to the task. Inspired by the intuitive physics in humans, we first present an<br>end-to-end learning-based approach to predict stability directly from<br>appearance, contrasting a more traditional model-based approach with explicit<br>3D representations and physical simulation. We study the model's behavior<br>together with an accompanied human subject test. It is then integrated into a<br>real-world robotic system to guide the placement of a single wood block into<br>the scene without collapsing existing tower structure. To further automate the<br>process of consecutive blocks stacking, we present an alternative approach<br>where the model learns the physics constraint through the interaction with the<br>environment, bypassing the dedicated physics learning as in the former part of<br>this work. In particular, we are interested in the type of tasks that require<br>the agent to reach a given goal state that may be different for every new<br>trial. Thereby we propose a deep reinforcement learning framework that learns<br>policies for stacking tasks which are parametrized by a target structure.<br>
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz and A. Leonardis
Computer Vision - ECCV 2018 Workshops, 2018
From Perception over Anticipation to Manipulation
W. Li
PhD Thesis, Universität des Saarlandes, 2018
From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.
Visual Stability Prediction and Its Application to Manipulation
W. Li, A. Leonardis and M. Fritz
AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, 2017
Visual Stability Prediction for Robotic Manipulation
W. Li, A. Leonardis and M. Fritz
IEEE International Conference on Robotics and Automation (ICRA 2017), 2017
Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning
W. Li, J. Bohg and M. Fritz
Technical Report, 2017
(arXiv: 1711.00267)
Understanding physical phenomena is a key component of human intelligence and enables physical interaction with previously unseen environments. In this paper, we study how an artificial agent can autonomously acquire this intuition through interaction with the environment. We created a synthetic block stacking environment with physics simulation in which the agent can learn a policy end-to-end through trial and error. Thereby, we bypass to explicitly model physical knowledge within the policy. We are specifically interested in tasks that require the agent to reach a given goal state that may be different for every new trial. To this end, we propose a deep reinforcement learning framework that learns policies which are parametrized by a goal. We validated the model on a toy example navigating in a grid world with different target positions and in a block stacking task with different target structures of the final tower. In contrast to prior work, our policies show better generalization across different goals.
Recognition of Ongoing Complex Activities by Sequence Prediction Over a Hierarchical Label Space
W. Li and M. Fritz
2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016), 2016
To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction
W. Li, S. Azimi, A. Leonardis and M. Fritz
Technical Report, 2016
(arXiv: 1604.00066)
Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel object and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability and related quantities from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way bypassing the need for an explicit simulation. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. The evaluation is carried out on synthetic data and compared to human judgments on the same stimuli.
Teaching Robots the Use of Human Tools from Demonstration with Non-dexterous End-effectors
W. Li and M. Fritz
2015 IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2015), 2015
Learning Multi-scale Representations for Material Classification
W. Li
Pattern Recognition (GCPR 2014), 2014
Learning Multi-scale Representations for Material Classification
W. Li and M. Fritz
Technical Report, 2014
(arXiv: 1408.2938)
The recent progress in sparse coding and deep learning has made unsupervised feature learning methods a strong competitor to hand-crafted descriptors. In computer vision, success stories of learned features have been predominantly reported for object recognition tasks. In this paper, we investigate if and how feature learning can be used for material recognition. We propose two strategies to incorporate scale information into the learning procedure resulting in a novel multi-scale coding procedure. Our results show that our learned features for material recognition outperform hand-crafted descriptors on the FMD and the KTH-TIPS2 material classification benchmarks.
Recognizing Materials from Virtual Examples
W. Li and M. Fritz
Computer Vision - ECCV 2012, 2012