# Personal Information

## Research Interests

• Robotics
• Activity Modeling
• Material Recognition
• Machine Learning

## Education

• 2013-present: PhD student at Max Planck Institute for Informatics and Saarland University, Germany
• 2010-present: Graduate student at Graduate School for Computer Science, Saarland University, Germany
• 2010-2013: M.Sc. in Computer Science, Saarland University, Germany
• 2006-2010: B.Sc. in Science and Technology of Intelligence, Beijing University of Posts and Telecommunications, China

# Publications

2019
Learning to Reconstruct People in Clothing from a Single RGB Camera
T. Alldieck, M. A. Magnor, B. L. Bhatnagar, C. Theobalt and G. Pons-Moll
32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
(Accepted/in press)
In the Wild Human Pose Estimation using Explicit 2D Features and Intermediate 3D Representations
I. Habibie, W. Xu, D. Mehta, G. Pons-Moll and C. Theobalt
32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
(Accepted/in press)
Semantic Projection Network for Zero- and Few-Label Semantic Segmentation
Y. Xian, S. Choudhury, Y. He, B. Schiele and Z. Akata
32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
(Accepted/in press)
f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning
Y. Xian, S. Sharma, B. Schiele and Z. Akata
32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
(Accepted/in press)
SimulCap : Single-View Human Performance Capture with Cloth Simulation
T. Yu, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. Pons-Moll and Y. Liu
32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
(Accepted/in press)
LiveCap: Real-time Human Performance Capture from Monocular Video
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics, 2019
(Accepted/in press)
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications
X. Zhang, Y. Sugano and A. Bulling
CHI 2019, CHI Conference on Human Factors in Computing Systems, 2019
(Accepted/in press)
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 41, Number 1, 2019
Fashion is Taking Shape: Understanding Clothing Preference Based on Body Shape From Online Sources
H. Sattar, G. Pons-Moll and M. Fritz
2019 IEEE Winter Conference on Applications of Computer Vision (WACV 2019), 2019
Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods
A. Bhattacharyya, M. Fritz and B. Schiele
International Conference on Learning Representations (ICLR 2019), 2019
(Accepted/in press)
Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage
P. Müller, D. Buschek, M. X. Huang and A. Bulling
Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2019
(Accepted/in press)
PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features
J. Steil, M. Koelle, W. Heuten, S. Boll and A. Bulling
Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2019
(Accepted/in press)
Privacy-Aware Eye Tracking Using Differential Privacy
J. Steil, I. Hagestedt, M. X. Huang and A. Bulling
Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2019
(Accepted/in press)
Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour
M. X. Huang, J. Li, G. Ngai, H. V. Leong and A. Bulling
Technical Report, 2019
(arXiv: 1901.06572)
Abstract
Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. Internal thought is a pervasive mental activity and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modelling in intelligent interfaces, particularly for e-learning applications. Despite the close link between the eyes and the human mind, only a few studies have investigated vergence behaviour during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is computationally light-weight and that only requires eye vergence information that is readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations that exploits human blur perception. We evaluate our method for three increasingly challenging detection tasks: (1) during a controlled math-solving task, (2) during natural viewing of lecture videos, and (3) during daily activities, such as coding, browsing, and reading. Results from these evaluations demonstrate the performance and robustness of vergence-based detection of internal thought and, as such, open up new directions for research on interfaces that adapt to shifts of mental attention.
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements
M. X. Huang and A. Bulling
Technical Report, 2019
(arXiv: 1903.04047)
Abstract
Recent methods to automatically calibrate stationary eye trackers were shown to effectively reduce inherent calibration distortion. However, these methods require additional information, such as mouse clicks or on-screen content. We propose the first method that only requires users' eye movements to reduce calibration distortion in the background while users naturally look at an interface. Our method exploits that calibration distortion makes straight saccade trajectories appear curved between the saccadic start and end points. We show that this curving effect is systematic and the result of distorted gaze projection plane. To mitigate calibration distortion, our method undistorts this plane by straightening saccade trajectories using image warping. We show that this approach improves over the common six-point calibration and is promising for reducing distortion. As such, it provides a non-intrusive solution to alleviating accuracy decrease of eye tracker during long-term use.
2018
Sequential Attacks on Agents for Long-Term Adversarial Goals
E. Tretschk, S. J. Oh and M. Fritz
2. ACM Computer Science in Cars Symposium (CSCS 2018), 2018
Detailed Human Avatars from Monocular Video
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt and G. Pons-Moll
3DV 2018 , International Conference on 3D Vision, 2018
Single-Shot Multi-person 3D Pose Estimation from Monocular RGB
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll and C. Theobalt
3DV 2018 , International Conference on 3D Vision, 2018
Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation
M. Omran, C. Lassner,, G. Pons-Moll, P. Gehler and B. Schiele
3DV 2018 , International Conference on 3D Vision, 2018
Video Object Segmentation with Language Referring Expressions
A. Khoreva, A. Rohrbach and B. Schiele
ACCV 2018, 14th Asian Conference on Computer Vision, 2018
(Accepted/in press)
NightOwls: A Pedestrians at Night Dataset
L. Neumann, M. Karg, S. Zhang, C. Scharfenberger, E. Piegert, S. Mistr, O. Prokofyeva, R. Thiel, A. Vedaldi, A. Zisserman and B. Schiele
ACCV 2018, 14th Asian Conference on Computer Vision, 2018
(Accepted/in press)
Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time
Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges and G. Pons-Moll
ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2018), Volume 37, Number 6, 2018
Quick Bootstrapping of a Personalized Gaze Model from Real-Use Interactions
M. X. Huang, J. Li, G. Ngai and H. Va Leong
ACM Transactions on Intelligent Systems and Technology, Volume 9, Number 4, 2018
Adversarial Scene Editing: Automatic Object Removal from Weak Supervision
R. Shetty, M. Fritz and B. Schiele
Advances in Neural Information Processing Systems 31, 2018
Abstract
While great progress has been made recently in automatic image manipulation, it has been limited to object centric images like faces or structured scene datasets. In this work, we take a step towards general scene-level image editing by developing an automatic interaction-free object removal model. Our model learns to find and remove objects from general scene images using image-level labels and unpaired data in a generative adversarial network (GAN) framework. We achieve this with two key contributions: a two-stage editor architecture consisting of a mask generator and image in-painter that co-operate to remove objects, and a novel GAN based prior for the mask generator that allows us to flexibly incorporate knowledge about object shapes. We experimentally show on two datasets that our method effectively removes a wide variety of objects using weak supervision only
Unsupervised Learning of Shape and Pose with Differentiable Point Clouds
E. Insafutdinov and A. Dosovitskiy
Advances in Neural Information Processing Systems 31 (NIPS 2018), 2018
VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements
M. Khamis, C. Oechsner, F. Alt and A. Bulling
AVI 2018, International Conference on Advanced Visual Interfaces, 2018
JAMI: Fast Computation of Conditional Mutual Information for ceRNA Network Analysis
A. Horňáková, M. List, J. Vreeken and M. H. Schulz
Bioinformatics, Volume 34, Number 17, 2018
Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild
M. Khamis, A. Baier, N. Henze, F. Alt and A. Bulling
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
Which one is me? Identifying Oneself on Public Displays
M. Khamis, C. Becker, A. Bulling and F. Alt
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
Training Person-Specific Gaze Estimators from Interactions with Multiple Devices
X. Zhang, M. X. Huang, Y. Sugano and A. Bulling
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
GazeDirector: Fully Articulated Eye Gaze Redirection in Video
E. Wood, T. Baltrusaitis, L.-P. Morency, P. Robinson and A. Bulling
Computer Graphics Forum (Proc. EUROGRAPHICS 2018), Volume 37, Number 2, 2018
Grounding Visual Explanations
L. A. Hendricks, R. Hu, T. Darrell and Z. Akata
Computer Vision -- ECCV 2018, 2018
Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes
Y. He, B. Schiele and M. Fritz
Computer Vision -- ECCV 2018, 2018
Textual Explanations for Self-Driving Vehicles
J. Kim, A. Rohrbach, T. Darrell, J. Canny and Z. Akata
Computer Vision -- ECCV 2018, 2018
Abstract
Deep neural perception and control networks have become key com- ponents of self-driving vehicles. User acceptance is likely to benefit from easy- to-interpret textual explanations which allow end-users to understand what trig- gered a particular behavior. Explanations may be triggered by the neural con- troller, namely introspective explanations , or informed by the neural controller’s output, namely rationalizations . We propose a new approach to introspective ex- planations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i . e ., acceleration and change of course. The controller’s at- tention identifies image regions that potentially influence the network’s output. Second, we use an attention-based video-to-text model to produce textual ex- planations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD- X) dataset. Code is available at https://github.com/JinkyuKimUCB/explainable-deep-driving
A Hybrid Model for Identity Obfuscation by Face Replacement
Q. Sun, A. Tewari, W. Xu, M. Fritz, C. Theobalt and B. Schiele
Computer Vision -- ECCV 2018, 2018
Recovering Accurate {3D} Human Pose in the Wild Using {IMUs} and a Moving Camera
T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn and G. Pons-Moll
Computer Vision -- ECCV 2018, 2018
GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User
M. Khamis, A. Kienle, F. Alt and A. Bulling
DroNet’18, 4th ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, 2018
Demo of XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, H. Rhodin, W. Xu, G. Pons-Moll and C. Theobalt
ECCV 2018 Demo Sessions, 2018
A Vision-grounded Dataset for Predicting Typical Locations for Verbs
N. Mukuze, A. Rohrbach, V. Demberg and B. Schiele
Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
Eye Movements During Everyday Behavior Predict Personality Traits
S. Hoppe, T. Loetscher, S. Morey and A. Bulling
Frontiers in Human Neuroscience, Volume 12, 2018
Objects, Relationships, and Context in Visual Data
H. Zhang and Q. Sun
ICMR’18, International Conference on Multimedia Retrieval, 2018
Video Based Reconstruction of 3D People Models
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
PoseTrack: A Benchmark for Human Pose Estimation and Tracking
M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Accurate and Diverse Sampling of Sequences based on a “Best of Many” Sample Objective
A. Bhattacharyya, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty
A. Bhattacharyya, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs
E. Laude, J.-H. Lange, J. Schüpfer, C. Domokos, L. Leal-Taixé, F. R. Schmidt, B. Andres and D. Cremers
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Disentangled Person Image Generation
L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images
T. Orekondy, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Learning 3D Shape Completion from Laser Scan Data with Weak Supervision
D. Stutz and A. Geiger
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Natural and Effective Obfuscation by Head Inpainting
Q. Sun, L. Ma, S. J. Oh, L. Van Gool, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Feature Generating Networks for Zero-Shot Learning
Y. Xian, T. Lorenz, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Fooling Vision and Language Models Despite Localization and Attention Mechanism
X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell and D. Song
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll and Y. Liu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Occluded Pedestrian Detection through Guided Attention in CNNs
S. Zhang, J. Yang and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Learning to Refine Human Pose Estimation
M. Fieraru, A. Khoreva, L. Pishchulin and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), 2018
Image and Video Captioning with Augmented Neural Architectures
R. Shetty, H. R. Tavakoli and J. Laaksonen
IEEE MultiMedia, Volume 25, Number 2, 2018
M. X. Huang, J. Li, G. Ngai, H. V. Leong and K. A. Hua
IEEE Transactions on Multimedia, Volume 20, Number 7, 2018
Reflectance and Natural Illumination from Single-Material Specular Objects Using Deep Learning
S. Georgoulis, K. Rematas, T. Ritschel, E. Gavves, M. Fritz, L. Van Gool and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 8, 2018
Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification
M. Lapin, M. Hein and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 7, 2018
Discriminatively Trained Latent Ordinal Model for Video Classification
K. Sikka and G. Sharma
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 8, 2018
Zero-shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly
Y. Xian, C. H. Lampert, B. Schiele and Z. Akata
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018
(Accepted/in press)
Abstract
Due to the importance of zero-shot learning, i.e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.
Towards Reaching Human Performance in Pedestrian Detection
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 4, 2018
Abstract
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background- versus-foreground errors. To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of training and test annotations.
Learning 3D Shape Completion under Weak Supervision
D. Stutz and A. Geiger
International Journal of Computer Vision, 2018
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori and L. Fei-Fei
International Journal of Computer Vision, Volume 126, Number 2-4, 2018
Every Little Movement Has a Meaning of Its Own: Using Past Mouse Movements to Predict the Next Interaction
T. C. K. Kwok, E. Y. Fu, E. Y. Wu, M. X. Huang, G. Ngai and H.-V. Leong
IUI 2018, 23rd International Conference on Intelligent User Interfaces, 2018
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour
P. Müller, M. X. Huang and A. Bulling
IUI 2018, 23rd International Conference on Intelligent User Interfaces, 2018
Explainable AI: The New 42?
R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. Kieseberg and A. Holzinger
Machine Learning and Knowledge Extraction (CD-MAKE 2018), 2018
Tracing Cell Lineages in Videos of Lens-free Microscopy
M. Rempfler, V. Stierle, K. Ditzel, S. Kumar, P. Paulitschke, B. Andres and B. H. Menze
Medical Image Analysis, Volume 48, 2018
Cross-Species Learning: A Low-Cost Approach to Learning Human Fight from Animal Fight
E. Y. Fu, M. X. Huang, H. V. Leong and G. Ngai
MM’18, 26th ACM Multimedia Conference, 2018
The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned
M. Khamis, F. Alt and A. Bulling
MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018
Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors
J. Steil, P. Müller, Y. Sugano and A. Bulling
MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018
NRST: Non-rigid Surface Tracking from Monocular Video
M. Habermann, W. Xu, H. Rohdin, M. Zollhöfer, G. Pons-Moll and C. Theobalt
Pattern Recognition (GCPR 2018), 2018
Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction
M. Barz, F. Daiber, D. Sonntag and A. Bulling
Proceedings ETRA 2018, 2018
Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimulus Trajectory is Partially Hidden
T. Mattusch, M. Mirzamohammad, M. Khamis, A. Bulling and F. Alt
Proceedings ETRA 2018, 2018
Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour
P. Müller, M. X. Huang, X. Zhang and A. Bulling
Proceedings ETRA 2018, 2018
Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings
S. Park, X. Zhang, A. Bulling and O. Hilliges
Proceedings ETRA 2018, 2018
Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets
J. Steil, M. X. Huang and A. Bulling
Proceedings ETRA 2018, 2018
Revisiting Data Normalization for Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano and A. Bulling
Proceedings ETRA 2018, 2018
A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation
R. Shetty, B. Schiele and M. Fritz
Proceedings of the 27th USENIX Security Symposium, 2018
Partial Optimality and Fast Lower Bounds for Weighted Correlation Clustering
J.-H. Lange, A. Karrenbauer and B. Andres
Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
A. Khan, I. Steiner, Y. Sugano, A. Bulling and R. Macdonald
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
Generating Counterfactual Explanations with Natural Language
L. A. Hendricks, R. Hu, T. Darrell and Z. Akata
Proceedings of the 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018), 2018
(arXiv: 1806.09809)
Abstract
Natural language explanations of deep neural network decisions provide an intuitive way for a AI agent to articulate a reasoning process. Current textual explanations learn to discuss class discriminative features in an image. However, it is also helpful to understand which attributes might change a classification decision if present in an image (e.g., "This is not a Scarlet Tanager because it does not have black wings.") We call such textual explanations counterfactual explanations, and propose an intuitive method to generate counterfactual explanations by inspecting which evidence in an input is missing, but might contribute to a different classification decision if present in the image. To demonstrate our method we consider a fine-grained image classification task in which we take as input an image and a counterfactual class and output text which explains why the image does not belong to a counterfactual class. We then analyze our generated counterfactual explanations both qualitatively and quantitatively using proposed automatic metrics.
Advanced Steel Microstructure Classification by Deep Learning Methods
S. M. Azimi, D. Britz, M. Engstler, M. Fritz and F. Mücklich
Scientific Reports, Volume 8, 2018
Abstract
The inner structure of a material is called microstructure. It stores the genesis of a material and determines all its physical and chemical properties. While microstructural characterization is widely spread and well known, the microstructural classification is mostly done manually by human experts, which opens doors for huge uncertainties. Since the microstructure could be a combination of different phases with complex substructures its automatic classification is very challenging and just a little work in this field has been carried out. Prior related works apply mostly designed and engineered features by experts and classify microstructure separately from feature extraction step. Recently Deep Learning methods have shown surprisingly good performance in vision applications by learning the features from data together with the classification step. In this work, we propose a deep learning method for microstructure classification in the examples of certain microstructural constituents of low carbon steel. This novel method employs pixel-wise segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by max-voting scheme. Our system achieves 93.94% classification accuracy, drastically outperforming the state-of-the-art method of 48.89% accuracy, indicating the effectiveness of pixel-wise approaches. Beyond the success presented in this paper, this line of research offers a more robust and first of all objective way for the difficult task of steel quality appreciation.
Towards Reverse-Engineering Black-Box Neural Networks
S. J. Oh, M. Augustin, B. Schiele and M. Fritz
Sixth International Conference on Learning Representations (ICLR 2018), 2018
(Accepted/in press)
Long-Term Image Boundary Prediction
A. Bhattacharyya, M. Malinowski, B. Schiele and M. Fritz
Thirty-Second AAAI Conference on Artificial Intelligence, 2018
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz and A. Leonardis
Visual Learning and Embodied Agents in Simulation Environments (ECCV 2018 Workshop), 2018
(arXiv: 1809.03707)
Abstract
In-depth scene descriptions and question answering tasks have greatly increased the scope of today's definition of scene understanding. While such tasks are in principle open ended, current formulations primarily focus on describing only the current state of the scenes under consideration. In contrast, in this paper, we focus on the future states of the scenes which are also conditioned on actions. We posit this as a question answering task, where an answer has to be given about a future scene state, given observations of the current scene, and a question that includes a hypothetical action. Our solution is a hybrid model which integrates a physics engine into a question answering architecture in order to anticipate future scene states resulting from object-object interactions caused by an action. We demonstrate first results on this challenging new problem and compare to baselines, where we outperform fully data-driven end-to-end learning approaches.
Higher-order Projected Power Iterations for Scalable Multi-Matching
F. Bernard, J. Thunberg, P. Swoboda and C. Theobalt
Technical Report, 2018
(arXiv: 1811.10541)
Abstract
The matching of multiple objects (e.g. shapes or images) is a fundamental problem in vision and graphics. In order to robustly handle ambiguities, noise and repetitive patterns in challenging real-world settings, it is essential to take geometric consistency between points into account. Computationally, the multi-matching problem is difficult. It can be phrased as simultaneously solving multiple (NP-hard) quadratic assignment problems (QAPs) that are coupled via cycle-consistency constraints. The main limitations of existing multi-matching methods are that they either ignore geometric consistency and thus have limited robustness, or they are restricted to small-scale problems due to their (relatively) high computational cost. We address these shortcomings by introducing a Higher-order Projected Power Iteration method, which is (i) efficient and scales to tens of thousands of points, (ii) straightforward to implement, (iii) able to incorporate geometric consistency, and (iv) guarantees cycle-consistent multi-matchings. Experimentally we show that our approach is superior to existing methods.
Bayesian Prediction of Future Street Scenes through Importance Sampling based Optimization
A. Bhattacharyya, M. Fritz and B. Schiele
Technical Report, 2018
(arXiv: 1806.06939)
Abstract
For autonomous agents to successfully operate in the real world, anticipation of future events and states of their environment is a key competence. This problem can be formalized as a sequence prediction problem, where a number of observations are used to predict the sequence into the future. However, real-world scenarios demand a model of uncertainty of such predictions, as future states become increasingly uncertain and multi-modal -- in particular on long time horizons. This makes modelling and learning challenging. We cast state of the art semantic segmentation and future prediction models based on deep learning into a Bayesian formulation that in turn allows for a full Bayesian treatment of the prediction problem. We present a new sampling scheme for this model that draws from the success of variational autoencoders by incorporating a recognition network. In the experiments we show that our model outperforms prior work in accuracy of the predicted segmentation and provides calibrated probabilities that also better capture the multi-modal aspects of possible future states of street scenes.
Proceedings PETMEI 2018
A. Bulling, E. Kasneci and C. Lander (Eds.)
ACM, 2018
Primal-Dual Wasserstein GAN
M. Gemici, Z. Akata and M. Welling
Technical Report, 2018
(arXiv: 1805.09575)
Abstract
We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for building latent variable models of the data distribution based on the primal and the dual formulations of the optimal transport (OT) problem. We utilize the primal formulation to learn a flexible inference mechanism and to create an optimal approximate coupling between the data distribution and the generative model. In order to learn the generative model, we use the dual formulation and train the decoder adversarially through a critic network that is regularized by the approximate coupling obtained from the primal. Unlike previous methods that violate various properties of the optimal critic, we regularize the norm and the direction of the gradients of the critic function. Our model shares many of the desirable properties of auto-encoding models in terms of mode coverage and latent structure, while avoiding their undesirable averaging properties, e.g. their inability to capture sharp visual features when modeling real images. We compare our algorithm with several other generative modeling techniques that utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception Scores (IS).
MLCapsule: Guarded Offline Deployment of Machine Learning as a Service
L. Hanzlik, Y. Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes and M. Fritz
Technical Report, 2018
(arXiv: 1808.00590)
Abstract
With the widespread use of machine learning (ML) techniques, ML as a service has become increasingly popular. In this setting, an ML model resides on a server and users can query the model with their data via an API. However, if the user's input is sensitive, sending it to the server is not an option. Equally, the service provider does not want to share the model by sending it to the client for protecting its intellectual property and pay-per-query business model. In this paper, we propose MLCapsule, a guarded offline deployment of machine learning as a service. MLCapsule executes the machine learning model locally on the user's client and therefore the data never leaves the client. Meanwhile, MLCapsule offers the service provider the same level of control and security of its model as the commonly used server-side execution. In addition, MLCapsule is applicable to offline applications that require local execution. Beyond protecting against direct model access, we demonstrate that MLCapsule allows for implementing defenses against advanced attacks on machine learning models such as model stealing/reverse engineering and membership inference.
Manipulating Attributes of Natural Scenes via Hallucination
L. Karacan, Z. Akata, A. Erdem and E. Erdem
Technical Report, 2018
(arXiv: 1808.07413)
Abstract
In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hallucinated with the given attributes, the corresponding look is then transferred to the input image while preserving the semantic details intact, giving a photo-realistic manipulation result. As the proposed framework hallucinates what the scene will look like, it does not require any reference style image as commonly utilized in most of the appearance or style transfer approaches. Moreover, it allows to simultaneously manipulate a given scene according to a diverse set of transient attributes within a single model, eliminating the need of training multiple networks per each translation task. Our comprehensive set of qualitative and quantitative results demonstrate the effectiveness of our approach against the competing methods.
Combinatorial Persistency Criteria for Multicut and Max-Cut
J.-H. Lange, B. Andres and P. Swoboda
Technical Report, 2018
(arXiv: 1812.01426)
Abstract
In combinatorial optimization, partial variable assignments are called persistent if they agree with some optimal solution. We propose persistency criteria for the multicut and max-cut problem as well as fast combinatorial routines to verify them. The criteria that we derive are based on mappings that improve feasible multicuts, respectively cuts. Our elementary criteria can be checked enumeratively. The more advanced ones rely on fast algorithms for upper and lower bounds for the respective cut problems and max-flow techniques for auxiliary min-cut problems. Our methods can be used as a preprocessing technique for reducing problem sizes or for computing partial optimality guarantees for solutions output by heuristic solvers. We show the efficacy of our methods on instances of both problems from computer vision, biomedical image analysis and statistical physics.
Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation
K. Z. Lin, W. Xu, Q. Sun, C. Theobalt and T.-S. Chua
Technical Report, 2018
(arXiv: 1812.09899)
Abstract
We propose a novel approach to jointly perform 3D object retrieval and pose estimation from monocular images.In order to make the method robust to real world scene variations in the images, e.g. texture, lighting and background,we learn an embedding space from 3D data that only includes the relevant information, namely the shape and pose.Our method can then be trained for robustness under real world scene variations without having to render a large training set simulating these variations. Our learned embedding explicitly disentangles a shape vector and a pose vector, which alleviates both pose bias for 3D shape retrieval and categorical bias for pose estimation. Having the learned disentangled embedding, we train a CNN to map the images to the embedding space, and then retrieve the closest 3D shape from the database and estimate the 6D pose of the object using the embedding vectors. Our method achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore outperforms the previous state-of-the-art methods on both tasks.
From Perception over Anticipation to Manipulation
W. Li
PhD Thesis, Universität des Saarlandes, 2018
Abstract
From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.
Deep Appearance Maps
M. Maximov, T. Ritschel and M. Fritz
Technical Report, 2018
(arXiv: 1804.00863)
Abstract
We propose a deep representation of appearance, i. e. the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have used deep learning to extract classic appearance representations relating to reflectance model parameters (e. g. Phong) or illumination (e. g. HDR environment maps). We suggest to directly represent appearance itself as a network we call a deep appearance map (DAM). This is a 4D generalization over 2D reflectance maps, which held the view direction fixed. First, we show how a DAM can be learned from images or video frames and later be used to synthesize appearance, given new surface orientations and viewer positions. Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learning-to-learn). Finally, we generalize this to an appearance estimation-and-segmentation task, where we map from an image showing multiple materials to multiple networks reproducing their appearance, as well as per-pixel segmentation.
Image Manipulation against Learned Models Privacy and Security Implications
S. J. Oh
PhD Thesis, Universität des Saarlandes, 2018
Abstract
Machine learning is transforming the world. Its application areas span privacy sensitive and security critical tasks such as human identification and self-driving cars. These applications raise privacy and security related questions that are not fully understood or answered yet: Can automatic person recognisers identify people in photos even when their faces are blurred? How easy is it to find an adversarial input for a self-driving car that makes it drive off the road? This thesis contributes one of the first steps towards a better understanding of such concerns. We observe that many privacy and security critical scenarios for learned models involve input data manipulation: users obfuscate their identity by blurring their faces and adversaries inject imperceptible perturbations to the input signal. We introduce a data manipulator framework as a tool for collectively describing and analysing privacy and security relevant scenarios involving learned models. A data manipulator introduces a shift in data distribution for achieving privacy or security related goals, and feeds the transformed input to the target model. This framework provides a common perspective on the studies presented in the thesis. We begin the studies from the user’s privacy point of view. We analyse the efficacy of common obfuscation methods like face blurring, and show that they are surprisingly ineffective against state of the art person recognition systems. We then propose alternatives based on head inpainting and adversarial examples. By studying the user privacy, we also study the dual problem: model security. In model security perspective, a model ought to be robust and reliable against small amounts of data manipulation. In both cases, data are manipulated with the goal of changing the target model prediction. User privacy and model security problems can be described with the same objective. We then study the knowledge aspect of the data manipulation problem. The more one knows about the target model, the more effective manipulations one can craft. We propose a game theoretic manipulation framework to systematically represent the knowledge level on the target model and derive privacy and security guarantees. We then discuss ways to increase knowledge about a black-box model by only querying it, deriving implications that are relevant to both privacy and security perspectives.
Understanding and Controlling User Linkability in Decentralized Learning
T. Orekondy, S. J. Oh, B. Schiele and M. Fritz
Technical Report, 2018
(arXiv: 1805.05838)
Abstract
Machine Learning techniques are widely used by online services (e.g. Google, Apple) in order to analyze and make predictions on user data. As many of the provided services are user-centric (e.g. personal photo collections, speech recognition, personal assistance), user data generated on personal devices is key to provide the service. In order to protect the data and the privacy of the user, federated learning techniques have been proposed where the data never leaves the user's device and "only" model updates are communicated back to the server. In our work, we propose a new threat model that is not concerned with learning about the content - but rather is concerned with the linkability of users during such decentralized learning scenarios. We show that model updates are characteristic for users and therefore lend themselves to linkability attacks. We show identification and matching of users across devices in closed and open world scenarios. In our experiments, we find our attacks to be highly effective, achieving 20x-175x chance-level performance. In order to mitigate the risks of linkability attacks, we study various strategies. As adding random noise does not offer convincing operation points, we propose strategies based on using calibrated domain-specific data; we find these strategies offers substantial protection against linkability threats with little effect to utility.
Knockoff Nets: Stealing Functionality of Black-Box Models
T. Orekondy, B. Schiele and M. Fritz
Technical Report, 2018
(arXiv: 1812.02766)
Abstract
Machine Learning (ML) models are increasingly deployed in the wild to perform a wide range of tasks. In this work, we ask to what extent can an adversary steal functionality of such "victim" models based solely on blackbox interactions: image in, predictions out. In contrast to prior work, we present an adversary lacking knowledge of train/test data used by the model, its internals, and semantics over model outputs. We formulate model functionality stealing as a two-step approach: (i) querying a set of input images to the blackbox model to obtain predictions; and (ii) training a "knockoff" with queried image-prediction pairs. We make multiple remarkable observations: (a) querying random images from a different distribution than that of the blackbox training data results in a well-performing knockoff; (b) this is possible even when the knockoff is represented using a different architecture; and (c) our reinforcement learning approach additionally improves query sample efficiency in certain settings and provides performance gains. We validate model functionality stealing on a range of datasets and tasks, as well as on a popular image analysis API where we create a reasonable knockoff for as little as 30. Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation R. Shetty, B. Schiele and M. Fritz Technical Report, 2018 (arXiv: 1812.06707) Abstract Importance of visual context in scene understanding tasks is well recognized in the computer vision community. However, to what extent the computer vision models for image classification and semantic segmentation are dependent on the context to make their predictions is unclear. A model overly relying on context will fail when encountering objects in context distributions different from training data and hence it is important to identify these dependencies before we can deploy the models in the real-world. We propose a method to quantify the sensitivity of black-box vision models to visual context by editing images to remove selected objects and measuring the response of the target models. We apply this methodology on two tasks, image classification and semantic segmentation, and discover undesirable dependency between objects and context, for example that "sidewalk" segmentation relies heavily on "cars" being present in the image. We propose an object removal based data augmentation solution to mitigate this dependency and increase the robustness of classification and segmentation models to contextual variations. Our experiments show that the proposed data augmentation helps these models improve the performance in out-of-context scenarios, while preserving the performance on regular data. PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis J. Steil, M. Koelle, W. Heuten, S. Boll and A. Bulling Technical Report, 2018 (arXiv: 1801.04457) Abstract As first-person cameras in head-mounted displays become increasingly prevalent, so does the problem of infringing user and bystander privacy. To address this challenge, we present PrivacEye, a proof-of-concept system that detects privacysensitive everyday situations and automatically enables and disables the first-person camera using a mechanical shutter. To close the shutter, PrivacEye detects sensitive situations from first-person camera videos using an end-to-end deep-learning model. To open the shutter without visual input, PrivacEye uses a separate, smaller eye camera to detect changes in users' eye movements to gauge changes in the "privacy level" of the current situation. We evaluate PrivacEye on a dataset of first-person videos recorded in the daily life of 17 participants that they annotated with privacy sensitivity levels. We discuss the strengths and weaknesses of our proof-of-concept system based on a quantitative technical evaluation as well as qualitative insights from semi-structured interviews. Disentangling Adversarial Robustness and Generalization D. Stutz, M. Hein, and B. Schiele Technical Report, 2018 (arXiv: 1812.00740) Abstract Obtaining deep networks that are robust against adversarial examples and generalize well is an open problem. A recent hypothesis even states that both robust and accurate models are impossible, i.e., adversarial robustness and generalization are conflicting goals. In an effort to clarify the relationship between robustness and generalization, we assume an underlying, low-dimensional data manifold and show that: 1. regular adversarial examples leave the manifold; 2. adversarial examples constrained to the manifold, i.e., on-manifold adversarial examples, exist; 3. on-manifold adversarial examples are generalization errors, and on-manifold adversarial training boosts generalization; 4. and regular robustness is independent of generalization. These assumptions imply that both robust and accurate models are possible. However, different models (architectures, training strategies etc.) can exhibit different robustness and generalization characteristics. To confirm our claims, we present extensive experiments on synthetic data (with access to the true manifold) as well as on EMNIST, Fashion-MNIST and CelebA. Attributing Fake Images to GANs: Analyzing Fingerprints in Generated Images N. Yu, L. Davis and M. Fritz Technical Report, 2018 (arXiv: 1811.08180) Abstract Research in computer graphics has been in pursuit of realistic image generation for a long time. Recent advances in machine learning with deep generative models have shown increasing success of closing the realism gap by using data-driven and learned components. There is an increasing concern that real and fake images will become more and more difficult to tell apart. We take a first step towards this larger research challenge by asking the question if and to what extend a generated fake image can be attribute to a particular Generative Adversarial Networks (GANs) of a certain architecture and trained with particular data and random seed. Our analysis shows single samples from GANs carry highly characteristic fingerprints which make attribution of images to GANs possible. Surprisingly, this is even possible for GANs with same architecture and same training that only differ by the training seed. Gaze Estimation and Interaction in Real-World Environments X. Zhang PhD Thesis, Universität des Saarlandes, 2018 2017 They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers M. Khamis, L. Bandelow, S. Schick, D. Casadevall, A. Bulling and F. Alt 16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017 EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging C. Lander, S. Gehring, M. Löchtefeld, A. Bulling and A. Krüger 16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017 Long-Term On-Board Prediction of Pedestrians in Traffic Scenes A. Bhattacharyya, M. Fritz and B. Schiele 1st Conference on Robot Learning (CoRL 2017), 2017 Gradient-free Policy Architecture Search and Adaptation S. Ebrahimi, A. Rohrbach and T. Darrell 1st Conference on Robot Learning (CoRL 2017), 2017 STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling Y. He, W.-C. Chiu, M. Keuper and M. Fritz 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Learning Non-maximum Suppression J. Hosang, R. Benenson and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 ArtTrack: Articulated Multi-Person Tracking in the Wild E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Gaze Embeddings for Zero-Shot Image Classification N. Karessli, Z. Akata, B. Schiele and A. Bulling 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Learning Video Object Segmentation from Static Images A. Khoreva, F. Perazzi, R. Benenson, B. Schiele and A. Sorkine-Hornung 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Simple Does It: Weakly Supervised Instance and Semantic Segmentation A. Khoreva, R. Benenson, J. Hosang, M. Hein and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 InstanceCut: from Edges to Instances with MultiCut A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy and C. Rother 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele and B. Andres 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Exploiting Saliency for Object Segmentation from Image Level Labels S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Generating Descriptions with Grounded and Co-Referenced People A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 A Domain Based Approach to Social Relation Recognition Q. Sun, B. Schiele and M. Fritz 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 A Message Passing Algorithm for the Minimum Cost Multicut Problem P. Swoboda and B. Andres 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Multiple People Tracking by Lifted Multicut and Person Re-identification S. Tang, M. Andriluka, B. Andres and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Zero-shot learning - The Good, the Bad and the Ugly Y. Xian, B. Schiele and Z. Akata 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 CityPersons: A Diverse Dataset for Pedestrian Detection S. Zhang, R. Benenson and B. Schiele 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017 Abstract Convnets have enabled significant progress in pedestrian detection recently, but there are still open questions regarding suitable architectures and training data. We revisit CNN design and point out key adaptations, enabling plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset. To achieve further improvement from more and better data, we introduce CityPersons, a new set of person annotations on top of the Cityscapes dataset. The diversity of CityPersons allows us for the first time to train one single CNN model that generalizes well over multiple benchmarks. Moreover, with additional training with CityPersons, we obtain top results using FasterRCNN on Caltech, improving especially for more difficult cases (heavy occlusion and small scale) and providing higher localization quality. It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation X. Zhang, Y. Sugano, M. Fritz and A. Bulling 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), 2017 Visual Stability Prediction and Its Application to Manipulation W. Li, A. Leonardis and M. Fritz AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, 2017 Pose Guided Person Image Generation L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars and L. Van Gool Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017 Everyday Eye Tracking for Real-World Consumer Behavior Analysis A. Bulling and M. Wedel A Handbook of Process Tracing Methods for Decision Research, 2017 (Accepted/in press) ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones M. X. Huang, J. Li, G. Ngai and H. V. Leong CHI’17, 35th Annual ACM Conference on Human Factors in Computing Systems, 2017 Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications M. Klauck, Y. Sugano and A. Bulling CHI 2017 Extended Abstracts, 2017 Lucid Data Dreaming for Object Tracking A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele DAVIS Challenge on Video Object Segmentation 2017, 2017 GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication M. Khamis,, M. Hassib, E. von Zezschwitz, A. Bulling and F. Alt ICMI’17, 19th ACM International Conference on Multimodal Interaction, 2017 What Is Around The Camera? S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool IEEE International Conference on Computer Vision (ICCV 2017), 2017 Adversarial Image Perturbation for Privacy Protection -- A Game Theory Perspective S. J. Oh, M. Fritz and B. Schiele IEEE International Conference on Computer Vision (ICCV 2017), 2017 Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images T. Orekondy, B. Schiele and M. Fritz IEEE International Conference on Computer Vision (ICCV 2017), 2017 Efficient Algorithms for Moral Lineage Tracing M. Rempfler, J.-H. Lange, F. Jug, C. Blasse, E. W. Myers, B. H. Menze and B. Andres IEEE International Conference on Computer Vision (ICCV 2017), 2017 Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz and B. Schiele IEEE International Conference on Computer Vision (ICCV 2017), 2017 Paying Attention to Descriptions Generated by Image Captioning Models H. R. Tavakoli, R. Shetty, A. Borji and J. Laaksonen IEEE International Conference on Computer Vision (ICCV 2017), 2017 Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling H. Sattar, A. Bulling and M. Fritz 2017 IEEE International Conference on Computer Vision Workshops (MBCC @ICCV 2017), 2017 Abstract Previous work focused on predicting visual search targets from human fixations but, in the real world, a specific target is often not known, e.g. when searching for a present for a friend. In this work we instead study the problem of predicting the mental picture, i.e. only an abstract idea instead of a specific target. This task is significantly more challenging given that mental pictures of the same target category can vary widely depending on personal biases, and given that characteristic target attributes can often not be verbalised explicitly. We instead propose to use gaze information as implicit information on users' mental picture and present a novel gaze pooling layer to seamlessly integrate semantic and localized fixation information into a deep image representation. We show that we can robustly predict both the mental picture's category as well as attributes on a novel dataset containing fixation data of 14 users searching for targets on a subset of the DeepFahion dataset. Our results have important implications for future search interfaces and suggest deep gaze pooling as a general-purpose approach for gaze-supported computer vision systems. Visual Stability Prediction for Robotic Manipulation W. Li, A. Leonardis and M. Fritz IEEE International Conference on Robotics and Automation (ICRA 2017), 2017 MARCOnI -- ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 3, 2017 Novel Views of Objects from a Single Image K. Rematas, C. Nguyen, T. Ritschel, M. Fritz and T. Tuytelaars IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 8, 2017 Expanded Parts Model for Semantic Description of Humans in Still Images G. Sharma, F. Jurie and C. Schmid IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 1, 2017 A Compact Representation of Human Actions by Sliding Coordinate Coding R. Ding, Q. Sun, M. Liu and H. Liu International Journal of Advanced Robotic Systems, Volume 14, Number 6, 2017 Ask Your Neurons: A Deep Learning Approach to Visual Question Answering M. Malinowski, M. Rohrbach and M. Fritz International Journal of Computer Vision, Volume 125, Number 1-3, 2017 Movie Description A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele International Journal of Computer Vision, Volume 123, Number 1, 2017 Abstract Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015. Cell Lineage Tracing in Lens-Free Microscopy Videos M. Rempfler, S. Kumar, V. Stierle, P. Paulitschke, B. Andres and B. H. Menze Medical Image Computing and Computer Assisted Intervention -- MICCAI 2017, 2017 Building Statistical Shape Spaces for 3D Human Modeling L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt and B. Schiele Pattern Recognition, Volume 67, 2017 Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes Q. Sun, H. Liu and T. Harada Pattern Recognition, Volume 64, 2017 Learning Dilation Factors for Semantic Segmentation of Street Scenes Y. He, M. Keuper, B. Schiele and M. Fritz Pattern Recognition (GCPR 2017), 2017 A Comparative Study of Local Search Algorithms for Correlation Clustering E. Levinkov, A. Kirillov and B. Andres Pattern Recognition (GCPR 2017), 2017 Look Together: Using Gaze for Assisting Co-located Collaborative Search Y. Zhang, K. Pfeuffer, M. K. Chong, J. Alexander, A. Bulling and H. Gellersen Personal and Ubiquitous Computing, Volume 21, Number 1, 2017 GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices M. Khamis, R. Hasholzner, A. Bulling and F. Alt Pervasive Displays 2017 (PerDis 2017), 2017 Analysis and Optimization of Graph Decompositions by Lifted Multicuts A. Horňáková, J.-H. Lange and B. Andres Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017 EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays M. Khamis, D. Buschek, T. Thieron, F. Alt and A. Bulling Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 4, 2017 InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation M. Tonsen, J. Steil, Y. Sugano and A. Bulling Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 3, 2017 Efficiently Summarising Event Sequences with Rich Interleaving Patterns A. Bhattacharyya and J. Vreeken Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), 2017 Are you stressed? Your eyes and the mouse can tell J. Wang, M. X. Huang, G. Ngai and H. V. Leong Seventh International Conference on Affective Computing and Intelligent Interaction (ACII 2017), 2017 EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays M. Khamis, A. Hoesl, A. Klimczak, M. Reiss, F. Alt and A. Bulling UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017 Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery X. Zhang, Y. Sugano and A. Bulling UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017 Analysis and Improvement of the Visual Object Detection Pipeline J. Hosang PhD Thesis, Universität des Saarlandes, 2017 Abstract Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression. Learning to Segment in Images and Videos with Different Forms of Supervision A. Khoreva PhD Thesis, Universität des Saarlandes, 2017 Abstract Much progress has been made in image and video segmentation over the last years. To a large extent, the success can be attributed to the strong appearance models completely learned from data, in particular using deep learning methods. However,to perform best these methods require large representative datasets for training with expensive pixel-level annotations, which in case of videos are prohibitive to obtain. Therefore, there is a need to relax this constraint and to consider alternative forms of supervision, which are easier and cheaper to collect. In this thesis, we aim to develop algorithms for learning to segment in images and videos with different levels of supervision. First, we develop approaches for training convolutional networks with weaker forms of supervision, such as bounding boxes or image labels, for object boundary estimation and semantic/instance labelling tasks. We propose to generate pixel-level approximate groundtruth from these weaker forms of annotations to train a network, which allows to achieve high-quality results comparable to the full supervision quality without any modifications of the network architecture or the training procedure. Second, we address the problem of the excessive computational and memory costs inherent to solving video segmentation via graphs. We propose approaches to improve the runtime and memory efficiency as well as the output segmentation quality by learning from the available training data the best representation of the graph. In particular, we contribute with learning must-link constraints, the topology and edge weights of the graph as well as enhancing the graph nodes - superpixels - themselves. Third, we tackle the task of pixel-level object tracking and address the problem of the limited amount of densely annotated video data for training convolutional networks. We introduce an architecture which allows training with static images only and propose an elaborate data synthesis scheme which creates a large number of training examples close to the target domain from the given first frame mask. With the proposed techniques we show that densely annotated consequent video data is not necessary to achieve high-quality temporally coherent video segmentationresults. In summary, this thesis advances the state of the art in weakly supervised image segmentation, graph-based video segmentation and pixel-level object tracking and contributes with the new ways of training convolutional networks with a limited amount of pixel-level annotated training data. Lucid Data Dreaming for Multiple Object Tracking A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele Technical Report, 2017 (arXiv: 1703.09554) Abstract Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k ~ 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the object tracking task. Decomposition of Trees and Paths via Correlation J.-H. Lange and B. Andres Technical Report, 2017 (arXiv: 1706.06822v2) Abstract We study the problem of decomposing (clustering) a tree with respect to costs attributed to pairs of nodes, so as to minimize the sum of costs for those pairs of nodes that are in the same component (cluster). For the general case and for the special case of the tree being a star, we show that the problem is NP-hard. For the special case of the tree being a path, this problem is known to be polynomial time solvable. We characterize several classes of facets of the combinatorial polytope associated with a formulation of this clustering problem in terms of lifted multicuts. In particular, our results yield a complete totally dual integral (TDI) description of the lifted multicut polytope for paths, which establishes a connection to the combinatorial properties of alternative formulations such as set partitioning. Image Classification with Limited Training Data and Class Ambiguity M. Lapin PhD Thesis, Universität des Saarlandes, 2017 Abstract Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations. Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning W. Li, J. Bohg and M. Fritz Technical Report, 2017 (arXiv: 1711.00267) Abstract Understanding physical phenomena is a key component of human intelligence and enables physical interaction with previously unseen environments. In this paper, we study how an artificial agent can autonomously acquire this intuition through interaction with the environment. We created a synthetic block stacking environment with physics simulation in which the agent can learn a policy end-to-end through trial and error. Thereby, we bypass to explicitly model physical knowledge within the policy. We are specifically interested in tasks that require the agent to reach a given goal state that may be different for every new trial. To this end, we propose a deep reinforcement learning framework that learns policies which are parametrized by a goal. We validated the model on a toy example navigating in a grid world with different target positions and in a block stacking task with different target structures of the final tower. In contrast to prior work, our policies show better generalization across different goals. Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image M. Malinowski PhD Thesis, Universität des Saarlandes, 2017 Abstract Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question. Person Recognition in Social Media Photos S. J. Oh, R. Benenson, M. Fritz and B. Schiele Technical Report, 2017 (arXiv: 1710.03224) Abstract People nowadays share large parts of their personal lives through social media. Being able to automatically recognise people in personal photos may greatly enhance user convenience by easing photo album organisation. For human identification task, however, traditional focus of computer vision has been face recognition and pedestrian re-identification. Person recognition in social media photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses) and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnet features from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearance gap between training and testing samples. We present an in-depth analysis of the importance of different features according to time and viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, social groups, and events. Compared the conference version of the paper, this paper additionally presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results even compared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including the head viewpoint-wise breakdown of performance, and (5) results on the open-world setup. Whitening Black-Box Neural Networks S. J. Oh, M. Augustin, B. Schiele and M. Fritz Technical Report, 2017 (arXiv: 1711.01768) Abstract Many deployed learned models are black boxes: given input, returns output. Internal information about the model, such as the architecture, optimisation procedure, or training data, is not disclosed explicitly as it might contain proprietary information or make the system more vulnerable. This work shows that such attributes of neural networks can be exposed from a sequence of queries. This has multiple implications. On the one hand, our work exposes the vulnerability of black-box neural networks to different types of attacks -- we show that the revealed internal information helps generate more effective adversarial examples against the black box model. On the other hand, this technique can be used for better protection of private content from automatic recognition models using adversarial examples. Our paper suggests that it is actually hard to draw a line between white box and black box models. Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract) D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach Technical Report, 2017 (arXiv: 1711.07373) Abstract Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotations is one of the bottlenecks of generating multimodal explanations. Thus, we propose two large-scale datasets with annotations that visually and textually justify a classification decision for various activities, i.e. ACT-X, and for question answering, i.e. VQA-X. We also introduce a multimodal methodology for generating visual and textual explanations simultaneously. We quantitatively show that training with the textual explanations not only yields better textual justification models, but also models that better localize the evidence that support their decision. Generation and Grounding of Natural Language Descriptions for Visual Data A. Rohrbach PhD Thesis, Universität des Saarlandes, 2017 Abstract Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics. Visual Decoding of Targets During Visual Search From Human Eye Fixations H. Sattar, M. Fritz and A. Bulling Technical Report, 2017 (arXiv: 1706.05993) Abstract What does human gaze reveal about a users' intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user's mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model's predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of user People detection and tracking in crowded scenes S. Tang PhD Thesis, Universität des Saarlandes, 2017 Abstract People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance. 2016 Multi-Cue Zero-Shot Learning with Strong Supervision Z. Akata, M. Malinowski, M. Fritz and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 CP-mtML: Coupled Projection Multi-task Metric Learning for Large Scale Face Retrieval B. Bhattarai, G. Sharma and F. Jurie 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 The Cityscapes Dataset for Semantic Urban Scene Understanding M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Moral Lineage Tracing F. Jug, E. Levinkov, C. Blasse, E. W. Myers and B. Andres 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Weakly Supervised Object Boundaries A. Khoreva, R. Benenson, M. Omran, M. Hein and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Abstract State-of-the-art learning based boundary detection methods require extensive training data. Since labelling object boundaries is one of the most expensive types of annotations, there is a need to relax the requirement to carefully annotate images to make both the training more affordable and to extend the amount of training data. In this paper we propose a technique to generate weakly supervised annotations and show that bounding box annotations alone suffice to reach high-quality object boundaries without using any object-specific boundary annotations. With the proposed weak supervision techniques we achieve the top performance on the object boundary detection task, outperforming by a large margin the current fully supervised state-of-the-art methods. Loss Functions for Top-k Error: Analysis and Insights M. Lapin, M. Hein and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Learning Deep Representations of Fine-Grained Visual Descriptions S. Reed, Z. Akata, H. Lee and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Deep Reflectance Maps K. Rematas, T. Ritschel, M. Fritz, E. Gavves and T. Tuytelaars 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Abstract Undoing the image formation process and therefore decomposing appearance into its intrinsic properties is a challenging task due to the under-constraint nature of this inverse problem. While significant progress has been made on inferring shape, materials and illumination from images only, progress in an unconstrained setting is still limited. We propose a convolutional neural architecture to estimate reflectance maps of specular materials in natural lighting conditions. We achieve this in an end-to-end learning formulation that directly predicts a reflectance map from the image itself. We show how to improve estimates by facilitating additional supervision in an indirect scheme that first predicts surface orientation and afterwards predicts the reflectance map by a learning-based sparse data interpolation. In order to analyze performance on this difficult task, we propose a new challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg) using both synthetic and real images. Furthermore, we show the application of our method to a range of image-based editing tasks on real images. Convexity Shape Constraints for Image Segmentation L. A. Royer, D. L. Richmond, B. Andres and D. Kainmueller 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 LOMo: Latent Ordinal Model for Facial Analysis in Videos K. Sikka, G. Sharma and M. Bartlett 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 End-to-end People Detection in Crowded Scenes R. Stewart and M. Andriluka 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Latent Embeddings for Zero-shot Classification Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 How Far are We from Solving Pedestrian Detection? S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele and C. Theobalt ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2016), Volume 35, Number 6, 2016a Learning What and Where to Draw S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele and L. Honglak Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016 SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull S. Schneegass, Y. Oualil and A. Bulling CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, 2016 Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces P. Xu, Y. Sugano and A. Bulling CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, 2016 GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices M. Khamis, F. Alt, M. Hassib, E. von Zezschwitz, R. Hasholzner and A. Bulling CHI 2016 Extended Abstracts, 2016 On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input D. Kirst and A. Bulling CHI 2016 Extended Abstracts, 2016 Abstract Rotations performed with the index finger and thumb involve some of the most complex motor action among common multi-touch gestures, yet little is known about the factors affecting performance and ergonomics. This note presents results from a study where the angle, direction, diameter, and position of rotations were systematically manipulated. Subjects were asked to perform the rotations as quickly as possible without losing contact with the display, and were allowed to skip rotations that were too uncomfortable. The data show surprising interaction effects among the variables, and help us identify whole categories of rotations that are slow and cumbersome for users. Pervasive Attentive User Interfaces A. Bulling Computer, Volume 49, Number 1, 2016 Towards Segmenting Consumer Stereo Videos: Benchmark, Baselines and Ensembles W.-C. Chiu, F. Galasso and M. Fritz Computer Vision -- ACCV 2016, 2016 Local Higher-order Statistics (LHS) Describing Images with Statistics of Local Non-binarized Pixel Patterns G. Sharma and F. Jurie Computer Vision and Image Understanding, Volume 142, 2016 An Efficient Fusion Move Algorithm for the Minimum Cost Lifted Multicut Problem T. Beier, B. Andres, U. Köthe and F. A. Hamprecht Computer Vision - ECCV 2016, 2016 Generating Visual Explanations L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele and T. Darrell Computer Vision -- ECCV 2016, 2016 Abstract Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. We propose a novel loss function based on sampling and reinforcement learning that learns to generate sentences that realize a global sentence property, such as class specificity. Our results on a fine-grained bird species classification dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods. DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka and B. Schiele Computer Vision -- ECCV 2016, 2016 Abstract The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. We evaluate our approach on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation. Models and code available at http://pose.mpi-inf.mpg.de Faceless Person Recognition: Privacy Implications in Social Media S. J. Oh, R. Benenson, M. Fritz and B. Schiele Computer Vision -- ECCV 2016, 2016 Grounding of Textual Phrases in Images by Reconstruction A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele Computer Vision -- ECCV 2016, 2016 A 3D Morphable Eye Region Model for Gaze Estimation E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson and A. Bulling Computer Vision -- ECCV 2016, 2016 VConv-DAE: Deep Volumetric Shape Learning Without Object Labels A. Sharma, O. Grau and M. Fritz Computer Vision - ECCV 2016 Workshops, 2016 Abstract With the advent of affordable depth sensors, 3D capture becomes more and more ubiquitous and already has made its way into commercial products. Yet, capturing the geometry or complete shapes of everyday objects using scanning devices (eg. Kinect) still comes with several challenges that result in noise or even incomplete shapes. Recent success in deep learning has shown how to learn complex shape distributions in a data-driven way from large scale 3D CAD Model collections and to utilize them for 3D processing on volumetric representations and thereby circumventing problems of topology and tessellation. Prior work has shown encouraging results on problems ranging from shape completion to recognition. We provide an analysis of such approaches and discover that training as well as the resulting representation are strongly and unnecessarily tied to the notion of object labels. Furthermore, deep learning research argues ~\cite{Vincent08} that learning representation with over-complete model are more prone to overfitting compared to the approach that learns from noisy data. Thus, we investigate a full convolutional volumetric denoising auto encoder that is trained in a unsupervised fashion. It outperforms prior work on recognition as well as more challenging tasks like denoising and shape completion. In addition, our approach is atleast two order of magnitude faster at test time and thus, provides a path to scaling up 3D deep learning. Multi-Person Tracking by Multicut and Deep Matching S. Tang, B. Andres, M. Andriluka and B. Schiele Computer Vision - ECCV 2016 Workshops, 2016 Improved Image Boundaries for Better Video Segmentation A. Khoreva, R. Benenson, F. Galasso, M. Hein and B. Schiele Computer Vision -- ECCV 2016 Workshops, 2016 Abstract Graph-based video segmentation methods rely on superpixels as starting point. While most previous work has focused on the construction of the graph edges and weights as well as solving the graph partitioning problem, this paper focuses on better superpixels for video segmentation. We demonstrate by a comparative analysis that superpixels extracted from boundaries perform best, and show that boundary estimation can be significantly improved via image and time domain cues. With superpixels generated from our better boundaries we observe consistent improvement for two video segmentation methods in two different datasets. Eyewear Computing -- Augmenting the Human with Head-mounted Wearable Assistants A. Bulling, O. Cakmakci, K. Kunze and J. M. Rehg (Eds.) Schloss Dagstuhl, 2016 Attention, please!: Comparing Features for Measuring Audience Attention Towards Pervasive Displays F. Alt, A. Bulling, L. Mecke and D. Buschek DIS 2016, 11th ACM SIGCHI Designing Interactive Systems Conference, 2016 Sensing and Controlling Human Gaze in Daily Living Space for Human-Harmonized Information Environments Y. Sato, Y. Sugano, A. Sugimoto, Y. Kuno and H. Koike Human-Harmonized Information Technology, 2016 Smooth Eye Movement Interaction Using EOG Glasses M. Dhuliawala, J. Lee, J. Shimizu, A. Bulling, K. Kunze, T. Starner and W. Woo ICMI’16, 18th ACM International Conference on Multimodal Interaction, 2016 Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries S. Nag Chowdhury, M. Malinowski, A. Bulling and M. Fritz ICMR’16, ACM International Conference on Multimedia Retrieval, 2016 DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016 Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation M. Malinowski, M. Rohrbach and M. Fritz IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016), 2016 (Accepted/in press) Abstract We are addressing an open-ended question answering task about real-world images. With the help of currently available methods developed in Computer Vision and Natural Language Processing, we would like to push an architecture with a global visual representation to its limits. In our contribution, we show how to achieve competitive performance on VQA with global visual features (Residual Net) together with a carefully desgined architecture. A Joint Learning Approach for Cross Domain Age Estimation B. Bhattarai, G. Sharma, A. Lechervy and F. Jurie IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2016), 2016 Learning to Detect Visual Grasp Affordance H. Oh Song, M. Fritz, D. Goehring and T. Darell IEEE Transactions on Automation Science and Engineering, Volume 13, Number 2, 2016 Label-Embedding for Image Classification Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 7, 2016 3D Pictorial Structures Revisited: Multiple Human Pose Estimation V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab and S. Ilic IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 10, 2016 Leveraging the Wisdom of the Crowd for Fine-Grained Recognition J. Deng, J. Krause, M. Stark and L. Fei-Fei IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 4, 2016 What Makes for Effective Detection Proposals? J. Hosang, R. Benenson, P. Dollár and B. Schiele IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 4, 2016 Reconstructing Curvilinear Networks using Path Classifiers and Integer Programming E. T. Turetken, F. Benmansour, B. Andres, P. Głowacki and H. Pfister IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 12, 2016 Combining Eye Tracking with Optimizations for Lens Astigmatism in modern wide-angle HMDs D. Pohl, X. Zhang and A. Bulling 2016 IEEE Virtual Reality Conference (VR), 2016 Recognition of Ongoing Complex Activities by Sequence Prediction Over a Hierarchical Label Space W. Li and M. Fritz 2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016), 2016 Eyewear Computers for Human-Computer Interaction A. Bulling and K. Kunze Interactions, Volume 23, Number 3, 2016 Demo hour H. Jeong, D. Saakes, U. Lee, A. Esteves, E. Velloso, A. Bulling, K. Masai, Y. Sugiura, M. Ogata, K. Kunze, M. Inami, M. Sugimoto, A. Rathnayake and T. Dias Interactions, Volume 23, Number 1, 2016 Recognizing Fine-grained and Composite Activities Using Hand-centric Features and Script Data M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal and B. Schiele International Journal of Computer Vision, Volume 119, Number 3, 2016 Pattern Recognition B. Rosenhahn and B. Andres (Eds.) Springer, 2016 Pupil Detection for Head-mounted Eye Tracking in the Wild: An Evaluation of the State of the Art W. Fuhl, M. Tonsen, A. Bulling and E. Kasneci Machine Vision and Applications, Volume 27, Number 8, 2016 The Minimum Cost Connected Subgraph Problem in Medical Image Analysis M. Rempfler, B. Andres and B. H. Menze Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2016, 2016 I-Pic: A Platform for Privacy-Compliant Image Capture P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016 Demo: I-Pic: A Platform for Privacy-Compliant Image Capture P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016 I-Pic: A Platform for Privacy-Compliant Image Capture P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016 Long Term Boundary Extrapolation for Deterministic Motion A. Bhattacharyya, M. Malinowski and M. Fritz NIPS Workshop on Intuitive Physics, 2016 A Convnet for Non-maximum Suppression J. Hosang, R. Benenson and B. Schiele Pattern Recognition (GCPR 2016), 2016 Abstract Non-maximum suppression (NMS) is used in virtually all state-of-the-art object detection pipelines. While essential object detection ingredients such as features, classifiers, and proposal methods have been extensively researched surprisingly little work has aimed to systematically address NMS. The de-facto standard for NMS is based on greedy clustering with a fixed distance threshold, which forces to trade-off recall versus precision. We propose a convnet designed to perform NMS of a given set of detections. We report experiments on a synthetic setup, and results on crowded pedestrian detection scenes. Our approach overcomes the intrinsic limitations of greedy NMS, obtaining better recall and precision. Learning to Select Long-Track Features for Structure-From-Motion and Visual SLAM J. Scheer, M. Fritz and O. Grau Pattern Recognition (GCPR 2016), 2016 Convexification of Learning from Constraints I. Shcherbatyi and B. Andres Pattern Recognition (GCPR 2016), 2016 Special Issue Introduction D. J. Cook, A. Bulling and Z. Yu Pervasive and Mobile Computing (Proc. PerCom 2015), Volume 26, 2016 Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces M. Barz, F. Daiber and A. Bulling Proceedings ETRA 2016, 2016 3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers M. Mansouryar, J. Steil, Y. Sugano and A. Bulling Proceedings ETRA 2016, 2016 Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions L. Sesma-Sanchez, Y. Zhang, H. Gellersen and A. Bulling Proceedings ETRA 2016, 2016 Labelled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments M. Tonsen, X. Zhang, Y. Sugano and A. Bulling Proceedings ETRA 2016, 2016 Learning an Appearance-based Gaze Estimator from One Million Synthesised Images E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson and A. Bulling Proceedings ETRA 2016, 2016 Long-term Memorability of Cued-Recall Graphical Passwords with Saliency Masks F. Alt, M. Mikusz, S. Schneegass and A. Bulling Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), 2016 EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays? M. Khamis, L. Trotter, M. Tessman, C. Dannhart, A. Bulling and F. Alt Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), 2016 Generative Adversarial Text to Image Synthesis S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele and H. Lee Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016 Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task A. Mokarian Forooshani, M. Malinowski and M. Fritz Proceedings of the British Machine Vision Conference (BMVC 2016), 2016 Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016 Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze A. L. Simeone, A. Bulling, J. Alexander and H. Gellersen Proceedings of the 2016 International Working Conference on Advanced Visual Interfaces (AVI 2016), 2016 Commonsense in Parts: Mining Part-Whole Relations from the Web and Image Tags N. Tandon, C. D. Hariman, J. Urbani, A. Rohrbach, M. Rohrbach and G. Weikum Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016 Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User’s Current Visual Field D. Pohl, X. Zhang, A. Bulling and O. Grau Proceedings VRST 2016, 2016 Visual Object Class Recognition M. Stark, B. Schiele and A. Leonardis Springer Handbook of Robotics, 2016 Interactive Multicut Video Segmentation E. Levinkov, J. Tompkin, N. Bonneel, S. Kirchhoff, B. Andres and H. Pfister The 24th Pacific Conference on Computer Graphics and Applications Short Papers Proceedings (Pacific Graphics 2016), 2016 TextPursuits: Using Text for Pursuits-based Interaction and Calibration on Public Displays M. Khamis, O. Saltuk, A. Hang, K. Stolz, A. Bulling and F. Alt UbiComp’16, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2016 EyeWear 2016: First Workshop on EyeWear Computing A. Bulling, O. Cakmakci, K. Kunze and J. M. Rehg UbiComp’16 Adjunct, 2016 Challenges and Design Space of Gaze-enabled Public Displays M. Khamis, F. Alt and A. Bulling UbiComp’16 Adjunct, 2016 Solar System: Smooth Pursuit Interactions Using EOG Glasses J. Shimizu, J. Lee, M. Dhuliawala, A. Bulling, T. Starner, W. Woo and K. Kunze UbiComp’16 Adjunct, 2016 AggreGaze: Collective Estimation of Audience Attention on Public Displays Y. Sugano, X. Zhang and A. Bulling UIST 2016, 29th Annual Symposium on User Interface Software and Technology, 2016 Spatio-Temporal Image Boundary Extrapolation A. Bhattacharyya, M. Malinowski and M. Fritz Technical Report, 2016 (arXiv: 1605.07363) Abstract Boundary prediction in images as well as video has been a very active topic of research and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on predicting boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and extrapolate motion patterns. We experiment on established real-world video segmentation dataset, which provides a testbed for this new task. We show for the first time spatio-temporal boundary extrapolation in this challenging scenario. Furthermore, we show long-term prediction of boundaries in situations where the motion is governed by the laws of physics. We successfully predict boundaries in a billiard scenario without any assumptions of a strong parametric model or any object notion. We argue that our model has with minimalistic model assumptions derived a notion of 'intuitive physics' that can be applied to novel scenes. Bayesian Non-Parametrics for Multi-Modal Segmentation W.-C. Chiu PhD Thesis, Universität des Saarlandes, 2016 DeLight-Net: Decomposing Reflectance Maps into Specular Materials and Natural Illumination S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, L. Van Gool and T. Tuytelaars Technical Report, 2016 (arXiv: 1603.08240) Abstract In this paper we are extracting surface reflectance and natural environmental illumination from a reflectance map, i.e. from a single 2D image of a sphere of one material under one illumination. This is a notoriously difficult problem, yet key to various re-rendering applications. With the recent advances in estimating reflectance maps from 2D images their further decomposition has become increasingly relevant. To this end, we propose a Convolutional Neural Network (CNN) architecture to reconstruct both material parameters (i.e. Phong) as well as illumination (i.e. high-resolution spherical illumination maps), that is solely trained on synthetic data. We demonstrate that decomposition of synthetic as well as real photographs of reflectance maps, both in High Dynamic Range (HDR), and, for the first time, on Low Dynamic Range (LDR) as well. Results are compared to previous approaches quantitatively as well as qualitatively in terms of re-renderings where illumination, material, view or shape are changed. Natural Illumination from Multiple Materials Using Deep Learning S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool Technical Report, 2016 (arXiv: 1611.09325) Abstract Recovering natural illumination from a single Low-Dynamic Range (LDR) image is a challenging task. To remedy this situation we exploit two properties often found in everyday images. First, images rarely show a single material, but rather multiple ones that all reflect the same illumination. However, the appearance of each material is observed only for some surface orientations, not all. Second, parts of the illumination are often directly observed in the background, without being affected by reflection. Typically, this directly observed part of the illumination is even smaller. We propose a deep Convolutional Neural Network (CNN) that combines prior knowledge about the statistics of illumination and reflectance with an input that makes explicit use of these two observations. Our approach maps multiple partial LDR material observations represented as reflectance maps and a background image to a spherical High-Dynamic Range (HDR) illumination map. For training and testing we propose a new data set comprising of synthetic and real images with multiple materials observed under the same illumination. Qualitative and quantitative evidence shows how both multi-material and using a background are essential to improve illumination estimations. RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling Y. He, W.-C. Chiu, M. Keuper and M. Fritz Technical Report, 2016 (arXiv: 1604.02388) Abstract Beyond the success in classification, neural networks have recently shown strong results on pixel-wise prediction tasks like image semantic segmentation on RGBD data. However, the commonly used deconvolutional layers for upsampling intermediate representations to the full-resolution output still show different failure modes, like imprecise segmentation boundaries and label mistakes in particular on large, weakly textured objects (e.g. fridge, whiteboard, door). We attribute these errors in part to the rigid way, current network aggregate information, that can be either too local (missing context) or too global (inaccurate boundaries). Therefore we propose a data-driven pooling layer that integrates with fully convolutional architectures and utilizes boundary detection from RGBD image segmentation approaches. We extend our approach to leverage region-level correspondences across images with an additional temporal pooling stage. We evaluate our approach on the NYU-Depth-V2 dataset comprised of indoor RGBD video sequences and compare it to various state-of-the-art baselines. Besides a general improvement over the state-of-the-art, our approach shows particularly good results in terms of accuracy of the predicted boundaries and in segmenting previously problematic classes. End-to-End Eye Movement Detection Using Convolutional Neural Networks S. Hoppe and A. Bulling Technical Report, 2016 (arXiv: 1609.02452) Abstract Common computational methods for automated eye movement detection - i.e. the task of detecting different types of eye movement in a continuous stream of gaze data - are limited in that they either involve thresholding on hand-crafted signal features, require individual detectors each only detecting a single movement, or require pre-segmented data. We propose a novel approach for eye movement detection that only involves learning a single detector end-to-end, i.e. directly from the continuous gaze data stream and simultaneously for different eye movements without any manual feature crafting or segmentation. Our method is based on convolutional neural networks (CNN) that recently demonstrated superior performance in a variety of tasks in computer vision, signal processing, and machine learning. We further introduce a novel multi-participant dataset that contains scripted and free-viewing sequences of ground-truth annotated saccades, fixations, and smooth pursuits. We show that our CNN-based method outperforms state-of-the-art baselines by a large margin on this challenging dataset, thereby underlining the significant potential of this approach for holistic, robust, and accurate eye movement protocol analysis. A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox and B. Schiele Technical Report, 2016 (arXiv: 1607.06317) Abstract Recently, Minimum Cost Multicut Formulations have been proposed and proven to be successful in both motion trajectory segmentation and multi-target tracking scenarios. Both tasks benefit from decomposing a graphical model into an optimal number of connected components based on attractive and repulsive pairwise terms. The two tasks are formulated on different levels of granularity and, accordingly, leverage mostly local information for motion segmentation and mostly high-level information for multi-target tracking. In this paper we argue that point trajectories and their local relationships can contribute to the high-level task of multi-target tracking and also argue that high-level cues from object detection and tracking are helpful to solve motion segmentation. We propose a joint graphical model for point trajectories and object detections whose Multicuts are solutions to motion segmentation {\it and} multi-target tracking problems at once. Results on the FBMS59 motion segmentation benchmark as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark demonstrate the promise of this joint approach. To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction W. Li, S. Azimi, A. Leonardis and M. Fritz Technical Report, 2016 (arXiv: 1604.00066) Abstract Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel object and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability and related quantities from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way bypassing the need for an explicit simulation. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. The evaluation is carried out on synthetic data and compared to human judgments on the same stimuli. Tutorial on Answering Questions about Images with Deep Learning M. Malinowski and M. Fritz Technical Report, 2016 (arXiv: 1610.01076) Abstract Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task. Attentive Explanations: Justifying Decisions and Pointing to the Evidence D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Darrell and M. Rohrbach Technical Report, 2016 (arXiv: 1612.04757) Abstract Deep models are the defacto standard in visual decision models due to their impressive performance on a wide array of visual tasks. However, they are frequently seen as opaque and are unable to explain their decisions. In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions. We postulate that deep models can do this as well and propose our Pointing and Justification (PJ-X) model which can justify its decision with a sentence and point to the evidence by introspecting its decision and explanation process using an attention mechanism. Unfortunately there is no dataset available with reference explanations for visual decision making. We thus collect two datasets in two domains where it is interesting and challenging to explain decisions. First, we extend the visual question answering task to not only provide an answer but also a natural language explanation for the answer. Second, we focus on explaining human activities which is traditionally more challenging than object classification. We extensively evaluate our PJ-X model, both on the justification and pointing tasks, by comparing it to prior models and ablations using both automatic and human evaluations. Articulated People Detection and Pose Estimation in Challenging Real World Environments L. Pishchulin PhD Thesis, Universität des Saarlandes, 2016 EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras (Extended Abstract) H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele and C. Theobalt Technical Report, 2016b (arXiv: 1701.00142) Abstract Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center. They often create discomfort by possibly needed marker suits, and their recording volume is severely restricted and often constrained to indoor scenes with controlled backgrounds. We therefore propose a new method for real-time, marker-less and egocentric motion capture which estimates the full-body skeleton pose from a lightweight stereo pair of fisheye cameras that are attached to a helmet or virtual-reality headset. It combines the strength of a new generative pose estimation framework for fisheye views with a ConvNet-based body-part detector trained on a new automatically annotated and augmented dataset. Our inside-in method captures full-body motion in general indoor and outdoor scenes, and also crowded scenes. Seeing with Humans: Gaze-Assisted Neural Image Captioning Y. Sugano and A. Bulling Technical Report, 2016 (arXiv: 1608.05203) Abstract Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks. 2015 On the Interplay between Spontaneous Spoken Instructions and Human Visual Behaviour in an Indoor Guidance Task N. Koleva, S. Hoppe, M. M. Moniri, M. Staudte and A. Bulling 37th Annual Meeting of the Cognitive Science Society (COGSCI 2015), 2015 Scene Viewing and Gaze Analysis during Phonetic Segmentation Tasks A. Khan, I. Steiner, R. G. Macdonald, Y. Sugano and A. Bulling Abstracts of the 18th European Conference on Eye Movements (ECEM 2015), 2015 The Feet in Human-Computer Interaction: A Survey of Foot-Based Interaction E. Velloso, D. Schmidt, J. Alexander, H. Gellersen and A. Bulling ACM Computing Surveys, Volume 48, Number 2, 2015 Introduction to the Special Issue on Activity Recognition for Interaction A. Bulling, U. Blanke, D. Tan, J. Rekimoto and G. Abowd ACM Transactions on Interactive Intelligent Systems, Volume 4, Number 4, 2015 Efficient Output Kernel Learning for Multiple Tasks P. Jawanpuria, M. Lapin, M. Hein and B. Schiele Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015 Top-k Multiclass SVM M. Lapin, M. Hein and B. Schiele Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015 Rekonstruktion zerebraler Gefässnetzwerke aus in-vivo μMRA mittels physiologischem Vorwissen zur lokalen Gefässgeometrie M. Rempfler, M. Schneider, G. D. Ielacqua, T. Sprenger, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres and B. H. Menze Bildverarbeitung für die Medizin 2015 (BVM 2015), 2015 A Study on the Natural History of Scanning Behaviour in Patients with Visual Field Defects after Stroke T. Loetscher, C. Chen, S. Wignall, A. Bulling, S. Hoppe, O. Churches, N. A. Thomas, M. E. R. Nicholls and A. Lee BMC Neurology, Volume 15, 2015 Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-scale-translate Tasks J. Turner, J. Alexander, A. Bulling and H. Gellersen CHI 2015, 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015 The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay M. Vidal, R. Bismuth, A. Bulling and H. Gellersen CHI 2015, 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015 Abstract The eyes are a rich channel for non-verbal communication in our daily interactions. We propose social gaze interaction as a game mechanic to enhance user interactions with virtual characters. We develop a game from the ground-up in which characters are esigned to be reactive to the player’s gaze in social ways, such as etting annoyed when the player seems distracted or changing their dialogue depending on the player’s apparent focus of ttention. Results from a qualitative user study provide insights bout how social gaze interaction is intuitive for users, elicits deep feelings of immersion, and highlight the players’ self-consciousness of their own eye movements through their strong reactions to the characters Editorial of Special Issue on Shape Representations Meet Visual Recognition S. Savarese, M. Sun and M. Stark Computer Vision and Image Understanding, Volume 139, 2015 Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers M. Barz, A. Bulling and F. Daiber Technical Report, 2015 Abstract Head-mounted eye tracking has significant potential for mobile gaze-based interaction with ambient displays but current interfaces lack information about the tracker\'s gaze estimation error. Consequently, current interfaces do not exploit the full potential of gaze input as the inherent estimation error can not be dealt with. The error depends on the physical properties of the display and constantly varies with changes in position and distance of the user to the display. In this work we present a computational model of gaze estimation error for head-mounted eye trackers. Our model covers the full processing pipeline for mobile gaze estimation, namely mapping of pupil positions to scene camera coordinates, marker-based display detection, and display mapping. We build the model based on a series of controlled measurements of a sample state-of-the-art monocular head-mounted eye tracker. Results show that our model can predict gaze estimation error with a root mean squared error of 17.99~px (1.96^\\circ\$).
GazeProjector: Location-independent Gaze Interaction on and Across Multiple Displays
C. Lander, S. Gehring, A. Krüger, S. Boring and A. Bulling
Technical Report, 2015
Abstract
Mobile gaze-based interaction with multiple displays may occur from arbitrary positions and orientations. However, maintaining high gaze estimation accuracy still represents a significant challenge. To address this, we present GazeProjector, a system that combines accurate point-of-gaze estimation with natural feature tracking on displays to determine the mobile eye tracker’s position relative to a display. The detected eye positions are transformed onto that display allowing for gaze-based interaction. This allows for seamless gaze estimation and interaction on (1) multiple displays of arbitrary sizes, (2) independently of the user’s position and orientation to the display. In a user study with 12 participants we compared GazeProjector to existing well- established methods such as visual on-screen markers and a state-of-the-art motion capture system. Our results show that our approach is robust to varying head poses, orientations, and distances to the display, while still providing high gaze estimation accuracy across multiple displays without re-calibration. The system represents an important step towards the vision of pervasive gaze-based interfaces.
An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation
E. Velloso, J. Turner, J. Alexander, A. Bulling and H. Gellersen
Human-Computer Interaction -- INTERACT 2015, 2015
Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position
E. Velloso, J. Alexander, A. Bulling and H. Gellersen
Human-Computer Interaction -- INTERACT 2015, 2015
See the Difference: Direct Pre-Image Reconstruction and Pose Estimation by Differentiating HOG
W.-C. Chiu and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts
M. Keuper, E. Levinkov, N. Bonneel, G. Layoue, T. Brox and B. Andres
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Motion Trajectory Segmentation via Minimum Cost Multicuts
M. Keuper, B. Andres and T. Brox
ICCV 2015, IEEE International Conference on Computer Vision, 2015
M. Malinowski, M. Rohrbach and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Person Recognition in Personal Photo Collections
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval
G. Sharma and B. Schiele
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Rendering of Eyes for Eye-Shape Registration and Gaze Estimation
E. Wood, T. Baltrusaitis, X. Zhang, Y. Sugano, P. Robinson and A. Bulling
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Evaluation of Output Embeddings for Fine-grained Image Classification
Z. Akata, S. Reed, D. Walter, H. Lee and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Enriching Object Detection with 2D-3D Registration and Continuous Viewpoint Estimation
C. Choy, M. Stark and S. Savarese
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Efficient ConvNet-based Marker-less Motion Capture in General Scenes with a Low Number of Cameras
A. Elhayek, E. de Aguiar, J. Tompson, A. Jain, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Taking a Deeper Look at Pedestrians
J. Hosang, M. Omran, R. Benenson and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Image Retrieval using Scene Graphs
J. Johnson, R. Krishna, M. Stark, J. Li, M. Bernstein and L. Fei-Fei
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Classifier Based Graph Construction for Video Segmentation
A. Khoreva, F. Galasso, M. Hein and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
A Flexible Tensor Block Coordinate Ascent Scheme for Hypergraph Matching
Q. N. Nguyen, A. Gautier and M. Hein
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
A Dataset for Movie Description
A. Rohrbach, M. Rohrbach, N. Tandon and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Prediction of Search Targets from Fixations in Open-world Settings
H. Sattar, S. Müller, M. Fritz and A. Bulling
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Subgraph Decomposition for Multi-target Tracking
S. Tang, B. Andres, M. Andriluka and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Filtered Channel Features for Pedestrian Detection
S. Zhang, R. Benenson and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Appearance-based Gaze Estimation in the Wild
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
3D Object Class Detection in the Wild
B. Pepik, M. Stark, P. Gehler, T. Ritschel and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition Workshops (3DSI 2015), 2015
Joint Segmentation and Activity Discovery using Semantic and Temporal Priors
J. Seiter, W.-C. Chiu, M. Fritz, O. Amft and G. Tröster
IEEE International Conference on Pervasive Computing and Communication (PERCOM 2015), 2015
Teaching Robots the Use of Human Tools from Demonstration with Non-dexterous End-effectors
W. Li and M. Fritz
2015 IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2015), 2015
GyroPen: Gyroscopes for Pen-Input with Mobile Phones
T. Deselaers, D. Keysers, J. Hosang and H. Rowley
IEEE Transactions on Human-Machine Systems, Volume 45, Number 2, 2015
Appearance-based Gaze Estimation with Online Calibration from Mouse Operations
Y. Sugano, Y. Matsushita, Y. Sato and H. Koike
IEEE Transactions on Human-Machine Systems, Volume 45, Number 6, 2015
Gaze Estimation From Eye Appearance: A Head Pose-free Method via Eye Image Synthesis
F. Lu, Y. Sugano, T. Okabe and Y. Sato
IEEE Transactions on Image Processing, Volume 24, Number 11, 2015
Detecting Surgical Tools by Modelling Local Appearance and Global Shape
D. Bouget, R. Benenson, M. Omran, L. Riffaud, B. Schiele and P. Jannin
IEEE Transactions on Medical Imaging, Volume 34, Number 12, 2015
Multi-view and 3D Deformable Part Models
B. Pepik, M. Stark, P. Gehler and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 37, Number 11, 2015
Emotion Recognition from Embedded Bodily Expressions and Speech During Dyadic Interactions
P. Müller, S. Amin, P. Verma, M. Andriluka and A. Bulling
International Conference on Affective Computing and Intelligent Interaction (ACII 2015), 2015
A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems
J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy and C. Rother
International Journal of Computer Vision, Volume 115, Number 2, 2015
Abstract
Szeliski et al. published an influential study in 2006 on energy minimization methods for Markov Random Fields (MRF). This study provided valuable insights in choosing the best optimization technique for certain classes of problems. While these insights remain generally useful today, the phenomenal success of random field models means that the kinds of inference problems that have to be solved changed significantly. Specifically, the models today often include higher order interactions, flexible connectivity structures, large la\-bel-spaces of different cardinalities, or learned energy tables. To reflect these changes, we provide a modernized and enlarged study. We present an empirical comparison of 32 state-of-the-art optimization techniques on a corpus of 2,453 energy minimization instances from diverse applications in computer vision. To ensure reproducibility, we evaluate all methods in the OpenGM 2 framework and report extensive results regarding runtime and solution quality. Key insights from our study agree with the results of Szeliski et al. for the types of models they studied. However, on new and challenging types of models our findings disagree and suggest that polyhedral methods and integer programming solvers are competitive in terms of runtime and solution quality over a large range of model types.
Towards Scene Understanding with Detailed 3D Object Representations
Z. Zia, M. Stark and K. Schindler
International Journal of Computer Vision, Volume 112, Number 2, 2015
Walking Reduces Spatial Neglect
T. Loetscher, C. Chen, S. Hoppe, A. Bulling, S. Wignall, C. Owen, N. Thomas and A. Lee
Journal of the International Neuropsychological Society, 2015
Bridging the Gap Between Synthetic and Real Data
M. Fritz
Machine Learning with Interdependent and Non-Identically Distributed Data, 2015
Reconstructing Cerebrovascular Networks under Local Physiological Constraints by Integer Programming
M. Rempfler, M. Schneider, G. D. Ielacqua, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres and B. H. Menze
Medical Image Analysis, Volume 25, Number 1, 2015
Graphical Passwords in the Wild: Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes
F. Alt, S. Schneegass, A. Shirazi, M. Hassib and A. Bulling
MobileHCI’15, 17th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2015
What is Holding Back Convnets for Detection?
B. Pepik, R. Benenson, T. Ritschel and B. Schiele
Pattern Recognition (GCPR 2015), 2015
The Long-short Story of Movie Description
A. Rohrbach, M. Rohrbach and B. Schiele
Pattern Recognition (GCPR 2015), 2015
Eye Tracking for Public Displays in the Wild
Y. Zhang, M. K. Chong, A. Bulling and H. Gellersen
Personal and Ubiquitous Computing, Volume 19, Number 5, 2015
Characterizing Information Diets of Social Media Users
J. Kulshrestha, M. B. Zafar, L. E. Espin Noboa, K. Gummadi and S. Gosh
Proceedings of the 9th International AAAI Conference on Web and Social Media (ICWSM 2015), 2015
The Cityscapes Dataset
M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele
The Future of Datasets in Vision 2015 (CVPR 2015 Workshop), 2015
Latent Max-margin Metric Learning for Comparing Video Face Tubes
G. Sharma and P. Pérez
The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2015), 2015
Hard to Cheat: A Turing Test based on Answering Questions about Images
M. Malinowski and M. Fritz
Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015
(arXiv: 1501.03302)
Abstract
Progress in language and image understanding by machines has sparkled the interest of the research community in more open-ended, holistic tasks, and refueled an old AI dream of building intelligent machines. We discuss a few prominent challenges that characterize such holistic tasks and argue for "question answering about images" as a particular appealing instance of such a holistic task. In particular, we point out that it is a version of a Turing Test that is likely to be more robust to over-interpretations and contrast it with tasks like grounding and generation of descriptions. Finally, we discuss tools to measure progress in this field.
Discovery of Everyday Human Activities From Long-Term Visual Behaviour Using Topic Models
J. Steil and A. Bulling
UbiComp 2015, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Analyzing Visual Attention During Whole Body Interaction with Public Displays
R. Walter, A. Bulling, D. Lindbauer, M. Schuessler and J. Müller
UbiComp 2015, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Human Visual Behaviour for Collaborative Human-Machine Interaction
A. Bulling
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Orbits: Enabling Gaze Interaction in Smart Watches Using Moving Targets
A. Esteves, E. Velloso, A. Bulling and H. Gellersen
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Recognition of Curiosity Using Eye Movement Analysis
S. Hoppe, T. Loetscher, S. Morey and A. Bulling
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits
M. Khamis, F. Alt and A. Bulling
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Tackling Challenges of Interactive Public Displays Using Gaze
M. Khamis, A. Bulling and F. Alt
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues
F. Alt, A. Bulling, G. Gravanis and D. Buschek
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements
A. Esteves, E. Velloso, A. Bulling and H. Gellersen
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays
C. Lander, S. Gehring, A. Krüger, S. Boring and A. Bulling
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
Self-calibrating Head-mounted Eye Trackers Using Egocentric Visual Saliency
Y. Sugano and A. Bulling
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
What Makes for Effective Detection Proposals?
J. Hosang, R. Benenson, P. Dollár and B. Schiele
Technical Report, 2015
(arXiv: 1502.05082)
Abstract
Current top performing object detectors employ detection proposals to guide the search for objects, thereby avoiding exhaustive sliding window search across images. Despite the popularity and widespread use of detection proposals, it is unclear which trade-offs are made when using them during object detection. We provide an in-depth analysis of twelve proposal methods along with four baselines regarding proposal repeatability, ground truth annotation recall on PASCAL and ImageNet, and impact on DPM and R-CNN detection performance. Our analysis shows that for object detection improving proposal localisation accuracy is as important as improving recall. We introduce a novel metric, the average recall (AR), which rewards both high recall and good localisation and correlates surprisingly well with detector performance. Our findings show common strengths and weaknesses of existing methods, and provide insights and metrics for selecting and tuning proposal methods.
Richer Object Representations for Object Class Detection in Challenging Real World Image
B. Pepik
PhD Thesis, Universität des Saarlandes, 2015
GazeDPM: Early Integration of Gaze Information in Deformable Part Models
I. Shcherbatyi, A. Bulling and M. Fritz
Technical Report, 2015
(arXiv: 1505.05753)
Abstract
An increasing number of works explore collaborative human-computer systems in which human gaze is used to enhance computer vision systems. For object detection these efforts were so far restricted to late integration approaches that have inherent limitations, such as increased precision without increase in recall. We propose an early integration approach in a deformable part model, which constitutes a joint formulation over gaze and visual data. We show that our GazeDPM method improves over the state-of-the-art DPM baseline by 4% and a recent method for gaze-supported object detection by 3% on the public POET dataset. Our approach additionally provides introspection of the learnt models, can reveal salient image structures, and allows us to investigate the interplay between gaze attracting and repelling areas, the importance of view-specific models, as well as viewers' personal biases in gaze patterns. We finally study important practical aspects of our approach, such as the impact of using saliency maps instead of real fixations, the impact of the number of fixations, as well as robustness to gaze estimation error.
Labeled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments
M. Tonsen, X. Zhang, Y. Sugano and A. Bulling
Technical Report, 2015
(arXiv: 1511.05768)
Abstract
We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people with different ethnicities, a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, as well as make-up. We benchmark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution, vision aids, as well as recording location (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.
2014
A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors
A. Bulling, U. Blanke and B. Schiele
ACM Computing Surveys, Volume 46, Number 3, 2014
Pursuits: Spontaneous Eye-based Interaction for Dynamic Interfaces
M. Vidal, A. Bulling and H. Gellersen
ACM SIGMOBILE Mobile Computing and Communications Review, Volume 18, Number 4, 2014
Abstract
Although gaze is an attractive modality for pervasive interaction, real-world implementation of eye-based interfaces poses significant challenges. In particular, user calibration is tedious and time consuming. Pursuits is an innovative interaction technique that enables truly spontaneous interaction with eye-based interfaces. A user can simply walk up to the screen and readily interact with moving targets. Instead of being based on gaze location, Pursuits correlates eye pursuit movements with objects dynamically moving on the interface.
A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input
M. Malinowski and M. Fritz
Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014
Eye Tracking and Eye-based Human–computer Interaction
P. Majaranta and A. Bulling
Ubic: Bridging the Gap Between Digital Cryptography and the Physical World
M. Simkin, A. Bulling, M. Fritz and D. Schröder
Computer Security - ESORICS 2014, 2014
Estimation of Human Body Shape and Posture under Clothing
S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu and J. Lang
Computer Vision and Image Understanding, Volume 127, 2014
Face Detection Without Bells and Whistles
M. Mathias, R. Benenson, M. Pedersoli and L. Van Gool
Computer Vision - ECCV 2014, 2014
Multiple Human Pose Estimation with Temporally Consistent 3D Pictorial Structures
X. Wang, B. Schiele, P. Fua, V. Belagiannis, S. Ilic and N. Navab
Computer Vision - ECCV 2014 Workshops, 2014
First International Workshop on Video Segmentation -- Panel Discussion
T. Brox, F. Galasso, F. Li, J. M. Rehg and B. Schiele
Computer Vision -- ECCV 2014 Workshops, 2014
Ten Years of Pedestrian Detection, What Have We Learned?
R. Benenson, M. Omran, J. Hosang and B. Schiele
Computer Vision - ECCV 2014 Workshops (ECCV 2014 Workshop CVRSUAD), 2014
2D Human Pose Estimation: New Benchmark and State of the Art Analysis
M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
3D Pictorial Structures for Multiple Human Pose Estimation
V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab and S. Ilic
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation
F. Galasso, M. Keuper, T. Brox and B. Schiele
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Anytime Recognition of Objects and Scenes
S. Karayev, M. Fritz and T. Darrell
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Scalable Multitask Representation Learning for Scene Classification
M. Lapin, B. Schiele and M. Hein
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Image-based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models
K. Rematas, T. Ritschel, M. Fritz and T. Tuytelaars
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Are Cars Just 3D Boxes? - Jointly Estimating the 3D Shape of Multiple Objects
M. Z. Zia, M. Stark and K. Schindler
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Cognition-aware Computing
A. Bulling and T. O. Zander
IEEE Pervasive Computing, Volume 13, Number 3, 2014
3D Traffic Scene Understanding from Movable Platforms
A. Geiger, M. Lauer, C. Wojek, C. Stiller and R. Urtasun
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 36, Number 5, 2014
Learning Human Pose Estimation Features with Convolutional Networks
A. Jain, J. Tompson, M. Andriluka, G. W. Taylor and C. Bregler
International Conference on Learning Representations 2014 (ICLR 2014), 2014
(arXiv: 1312.7302)
Abstract
This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced.
Multi-view Priors for Learning Detectors from Sparse Viewpoint Data
B. Pepik, M. Stark, P. Gehler and B. Schiele
International Conference on Learning Representations 2014 (ICLR 2014), 2014
(arXiv: 1312.6095)
Abstract
While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
Multi-View Priors for Learning Detectors from Sparse Viewpoint Data
B. Pepik, M. Stark, P. Gehler and B. Schiele
International Conference on Learning Representations 2014 (ICLR 2014), 2014
(arXiv: http://arxiv.org/abs/1312.6095)
Abstract
While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
Detection and Tracking of Occluded People
S. Tang, M. Andriluka and B. Schiele
International Journal of Computer Vision, Volume 110, Number 1, 2014
Introduction to the PETMEI Special Issue
A. Bulling and R. Bednarik
Journal of Eye Movement Research, Volume 7, Number 3, 2014
Computer Vision - ECCV 2014
D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars (Eds.)
Springer, 2014
Candidate Sampling for Neuron Reconstruction from Anisotropic Electron Microscopy Volumes
J. Funke, J. N. P. Martel, S. Gerhard, B. Andres, D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber, H. Pfister, A. Cardona and M. Cook
Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2014, 2014
Extracting Vascular Networks under Physiological Constraints via Integer Programming
M. Rempfler, M. Schneider, G. D. Ielacqua, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres and B. H. Menze
Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2014, 2014
Learning Using Privileged Information: SVM+ and Weighted SVM
M. Lapin, M. Hein and B. Schiele
Neural Networks, Volume 53, 2014
Towards a Visual Turing Challenge
M. Malinowski and M. Fritz
NIPS 2014 Workshop on Learning Semantics, 2014
(arXiv: 1410.8027)
Abstract
As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
Expressive Models and Comprehensive Benchmark for 2D Human Pose Estimation
L. Pishchulin, M. Andriluka, P. Gehler and B. Schiele
Parts and Attributes (ECCV 2014 Workshop PA), 2014
Test-time Adaptation for 3D Human Pose Estimation
S. Amin, P. Müller, A. Bulling and M. Andriluka
Pattern Recognition (GCPR 2014), 2014
Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering
A. Khoreva, F. Galasso, M. Hein and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Learning Multi-scale Representations for Material Classification
W. Li
Pattern Recognition (GCPR 2014), 2014
Fine-grained Activity Recognition with Holistic and Pose Based Features
L. Pishchulin, M. Andriluka and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Cross-device Gaze-supported Point-to-point Content Transfer
J. Turner, A. Bulling, J. Alexander and H. Gellersen
Proceedings ETRA 2014, 2014
EyeTab: Model-based Gaze Estimation on Unmodified Tablet Computers
E. Wood and A. Bulling
Proceedings ETRA 2014, 2014
S. Ishimaru, K. Kunze, K. Kise, J. Weppner, A. Dengel, P. Lukowicz and A. Bulling
Proceedings of the 5th Augmented Human International Conference (AH 2014), 2014
Object Disambiguation for Augmented Reality Applications
W.-C. Chiu, G. Johnson, D. McCulley, O. Grau and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2014), 2014
How Good are Detection Proposals, really?
J. Hosang, R. Benenson and B. Schiele
Proceedings of the British Machine Vision Conference (BMVC 2014), 2014
Abstract
Current top performing Pascal VOC object detectors employ detection proposals to guide the search for objects thereby avoiding exhaustive sliding window search across images. Despite the popularity of detection proposals, it is unclear which trade‐offs are made when using them during object detection. We provide an in depth analysis of ten object proposal methods along with four baselines regarding ground truth annotation recall (on Pascal VOC 2007 and ImageNet 2013), repeatability, and impact on DPM detector performance. Our findings show common weaknesses of existing methods, and provide insights to choose the most adequate method for different settings.
Pupil-Canthi-Ratio: A Calibration-free Method for Tracking Horizontal Gaze Direction
Y. Zhang, A. Bulling and H. Gellersen
Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces (AVI 2014), 2014
Scalable Multitask Representation Learning for Scene Classification
M. Lapin, B. Schiele and M. Hein
Scene Understanding Workshop (SUNw 2014), 2014
Learning People Detectors for Tracking in Crowded Scenes
S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth and B. Schiele
Scene Understanding Workshop (SUNw 2014), 2014
High-Resolution 3D Layout from a Single View
M. Z. Zia, M. Stark and K. Schindler
Scene Understanding Workshop (SUNw 2014), 2014
SmudgeSafe: Geometric Image Transformations for Smudge-resistant User Authentication
S. Schneegass, F. Steimle, A. Bulling, F. Alt and A. Schmidt
UbiComp’14, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2014
GazeHorizon: Enabling Passers-by to Interact with Public Displays by Gaze
Y. Zhang, J. Müller, M. K. Chong, A. Bulling and H. Gellersen
UbiComp’14, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2014
Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction
M. Kassner, W. Patera and A. Bulling
Physically Grounded 3D Scene Interpretation with Detailed Object Models
M. Z. Zia, M. Stark and K. Schindler
Vision Meets Cognition Workshop: Functionality, Physics, Intentionality, and Causality (CVPR 2014 Workshop FPIC), 2014
Zero-Shot Learning with Structured Embeddings
Z. Akata, H. Lee and B. Schiele
Technical Report, 2014
(arXiv: 1409.8403)
Abstract
Despite significant recent advances in image classification, fine-grained classification remains a challenge. In the present paper, we address the zero-shot and few-shot learning scenarios as obtaining labeled data is especially difficult for fine-grained classification tasks. First, we embed state-of-the-art image descriptors in a label embedding space using side information such as attributes. We argue that learning a joint embedding space, that maximizes the compatibility between the input and output embeddings, is highly effective for zero/few-shot learning. We show empirically that such embeddings significantly outperforms the current state-of-the-art methods on two challenging datasets (Caltech-UCSD Birds and Animals with Attributes). Second, to reduce the amount of costly manual attribute annotations, we use alternate output embeddings based on the word-vector representations, obtained from large text-corpora without any supervision. We report that such unsupervised embeddings achieve encouraging results, and lead to further improvements when combined with the supervised ones.
Learning Multi-scale Representations for Material Classification
W. Li and M. Fritz
Technical Report, 2014
(arXiv: 1408.2938)
Abstract
The recent progress in sparse coding and deep learning has made unsupervised feature learning methods a strong competitor to hand-crafted descriptors. In computer vision, success stories of learned features have been predominantly reported for object recognition tasks. In this paper, we investigate if and how feature learning can be used for material recognition. We propose two strategies to incorporate scale information into the learning procedure resulting in a novel multi-scale coding procedure. Our results show that our learned features for material recognition outperform hand-crafted descriptors on the FMD and the KTH-TIPS2 material classification benchmarks.
A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation
M. Malinowski and M. Fritz
Technical Report, 2014
(arXiv: 1411.5190)
Abstract
Over the last two decades we have witnessed strong progress on modeling visual object classes, scenes and attributes that have significantly contributed to automated image understanding. On the other hand, surprisingly little progress has been made on incorporating a spatial representation and reasoning in the inference process. In this work, we propose a pooling interpretation of spatial relations and show how it improves image retrieval and annotations tasks involving spatial language. Due to the complexity of the spatial language, we argue for a learning-based approach that acquires a representation of spatial relations by learning parameters of the pooling operator. We show improvements on previous work on two datasets and two different tasks as well as provide additional insights on a new dataset with an explicit focus on spatial relations.
Estimating Maximally Probable Constrained Relations by Mathematical Programming
L. Qu and B. Andres
Technical Report, 2014
(arXiv: 1408.0838)
Abstract
Estimating a constrained relation is a fundamental problem in machine learning. Special cases are classification (the problem of estimating a map from a set of to-be-classified elements to a set of labels), clustering (the problem of estimating an equivalence relation on a set) and ranking (the problem of estimating a linear order on a set). We contribute a family of probability measures on the set of all relations between two finite, non-empty sets, which offers a joint abstraction of multi-label classification, correlation clustering and ranking by linear ordering. Estimating (learning) a maximally probable measure, given (a training set of) related and unrelated pairs, is a convex optimization problem. Estimating (inferring) a maximally probable relation, given a measure, is a 01-linear program. It is solved in linear time for maps. It is NP-hard for equivalence relations and linear orders. Practical solutions for all three cases are shown in experiments with real data. Finally, estimating a maximally probable measure and relation jointly is posed as a mixed-integer nonlinear program. This formulation suggests a mathematical programming approach to semi-supervised learning.
Combining Visual Recognition and Computational Linguistics : Linguistic Knowledge for Visual Recognition and Natural Language Descriptions of Visual Content
M. Rohrbach
PhD Thesis, Universität des Saarlandes, 2014
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Senina, M. Rohrbach, W. Qiu, A. Friedrich, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
Technical Report, 2014
(arXiv: 1403.6173)
Abstract
Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.
2013
Where Next in Object Recognition and how much Supervision Do We Need?
S. Ebert and B. Schiele
Advanced Topics in Computer Vision, 2013
Transfer Learning in a Transductive Setting
M. Rohrbach, S. Ebert and B. Schiele
Advances in Neural Information Processing Systems 26 (NIPS 2013), 2013
Abstract
Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels however is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three main ingredients. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expert-specified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm - so far only used for semi-supervised learning - to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets.
EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour
A. Bulling, C. Weichel and H. Gellersen
CHI 2013, The 31st Annual CHI Conference on Human Factors in Computing Systems, 2013
Abstract
Automatic annotation of life logging data is challenging. In this work we present EyeContext, a system to infer high-level contextual cues from human visual behaviour. We conduct a user study to record eye movements of four participants over a full day of their daily life, totalling 42.5 hours of eye movement data. Participants were asked to self-annotate four non-mutually exclusive cues: social (interacting with somebody vs. no interaction), cognitive (concentrated work vs. leisure), physical (physically active vs. not active), and spatial (inside vs. outside a building). We evaluate a proof-of-concept EyeContext system that combines encoding of eye movements into strings and a spectrum string kernel support vector machine (SVM) classifier. Using person-dependent training, we obtain a top performance of 85.3% precision (98.0% recall) for recognising social interactions. Our results demonstrate the large information content available in long-term human visual behaviour and opens up new venues for research on eye-based behavioural monitoring and life logging.
MotionMA: Motion Modelling and Analysis by Demonstration
E. Velloso, A. Bulling and H. Gellersen
CHI 2013, The 31st Annual CHI Conference on Human Factors in Computing Systems, 2013
SideWays: A Gaze Interface for Spontaneous Interaction with Situated Displays
Y. Zhang, A. Bulling and H. Gellersen
CHI 2013, The 31st Annual CHI Conference on Human Factors in Computing Systems, 2013
Pursuits: Eye-based Interaction with Moving Targets
M. Vidal, K. Pfeuffer, A. Bulling and H. W. Gellersen
CHI 2013 Extended Abstracts, 2013
Abstract
Eye-based interaction has commonly been based on estimation of eye gaze direction, to locate objects for interaction. We introduce Pursuits, a novel and very different eye tracking method that instead is based on following the trajectory of eye movement and comparing this with trajectories of objects in the field of view. Because the eyes naturally follow the trajectory of moving objects of interest, our method is able to detect what the user is looking at, by matching eye movement and object movement. We illustrate Pursuits with three applications that demonstrate how the method facilitates natural interaction with moving targets.
A Category-level 3D Object Dataset: Putting the Kinect to Work
A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko and T. Darrell
Consumer Depth Cameras for Computer Vision, 2013
Multi-view Pictorial Structures for 3D Human Pose Estimation
S. Amin, M. Andriluka, M. Rohrbach and B. Schiele
Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), 2013
Learning Smooth Pooling Regions for Visual Recognition
M. Malinowski and M. Fritz
Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), 2013
Abstract
From the early HMAX model to Spatial Pyramid Matching, spatial pooling has played an important role in visual recognition pipelines. By aggregating local statistics, it equips the recognition pipelines with a certain degree of robustness to translation and deformation yet preserving spatial information. Despite of its predominance in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. In this paper, we propose a flexible parameterization of the spatial pooling step and learn the pooling regions together with the classifier. We investigate a smoothness regularization term that in conjuncture with an efficient learning scheme makes learning scalable. Our framework can work with both popular pooling operators: sum-pooling and max-pooling. Finally, we show benefits of our approach for object recognition tasks based on visual words and higher level event recognition tasks based on object-bank features. In both cases, we improve over the hand-crafted spatial pooling step showing the importance of its adaptation to the task.
Segmenting Planar Superpixel Adjacency Graphs w.r.t. Non-planar Superpixel Affinity Graphs
B. Andres, J. Yarkony, B. S. Manjunath, S. Kirchhoff, E. Turetken, C. C. Fowlkes and H. Pfister
Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR 2013), 2013
AutoBAP: Automatic Coding of Body Action and Posture Units from Wearable Sensors
E. Velloso, A. Bulling and H. Gellersen
2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII 2013), 2013
Eye Pull, Eye Push: Moving Objects between Large Screens and Personal Devices with Gaze & Touch
J. Turner, J. Alexander, A. Bulling, S. Dominik and H. Gellersen
Human-Computer Interaction – INTERACT 2013, 2013
Abstract
Previous work has validated the eyes and mobile input as a viable approach for pointing at, and selecting out of reach objects. This work presents Eye Pull, Eye Push, a novel interaction concept for content transfer between public and personal devices using gaze and touch. We present three techniques that enable this interaction: Eye Cut & Paste, Eye Drag & Drop, and Eye Summon & Cast. We outline and discuss several scenarios in which these techniques can be used. In a user study we found that participants responded well to the visual feedback provided by Eye Drag & Drop during object movement. In contrast, we found that although Eye Summon & Cast significantly improved performance, participants had difficulty coordinating their hands and eyes during interaction.
A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis
F. Galasso, N. S. Nagaraja, T. Jiménez Cárdenas, T. Brox and B. Schiele
ICCV 2013, IEEE International Conference on Computer Vision, 2013
Sequential Bayesian Model Update under Structured Scene Prior for Semantic Road Scenes Labeling
E. Levinkov and M. Fritz
ICCV 2013, IEEE International Conference on Computer Vision, 2013
Handling Occlusions with Franken-classifiers
M. Mathias, R. Benenson, R. Timofte and L. van Gool
ICCV 2013, IEEE International Conference on Computer Vision, 2013
Translating Video Content to Natural Language Descriptions
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal and B. Schiele
ICCV 2013, IEEE International Conference on Computer Vision, 2013
Abstract
Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset, which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several base line approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
Learning People Detectors for Tracking in Crowded Scenes
S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth and B. Schiele
ICCV 2013, IEEE International Conference on Computer Vision, 2013
Abstract
People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a common failure case for tracking in crowded scenes. To explicitly address remaining failure cases of the tracker we explore two methods. First, we analyze typical failure cases of trackers and train a detector explicitly on those failure cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both detection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
Seeking the Strongest Rigid Detector
R. Benenson, M. Mathias, T. Tuytelaars and L. van Gool
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
Multi-class Video Co-segmentation with a Generative Multi-video Model
W.-C. Chiu and M. Fritz
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
A Comparative Study of Modern Inference Techniques for Discrete Energy Minimization Problem
J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, J. Lellmann, N. Komodakis and C. Rother
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
Occlusion Patterns for Object Class Detection
B. Pepik, M. Stark, P. Gehler and B. Schiele
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
Poselet Conditioned Pictorial Structures
L. Pishchulin, M. Andriluka, P. Gehler and B. Schiele
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
Reconstructing Loopy Curvilinear Structures Using Integer Programming
E. Turetken, F. Benmansour, B. Andres, H. Pfister and P. Fua
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
Explicit Occlusion Modeling for 3D Object Class Representations
Z. Zia, M. Stark and K. Schindler
2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), 2013
3D Object Representations for Fine-grained Categorization
J. Krause, M. Stark, J. Deng and L. Fei-Fei
2013 IEEE International Conference on Computer Vision Workshops (ICCVW 2013), 2013
Monocular Visual Scene Understanding: Understanding Multi-object Traffic Scenes
C. Wojek, S. Walk, S. Roth, K. Schindler and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 35, Number 4, 2013
Detailed 3D Representations for Object Recognition and Modeling
Z. Zia, M. Stark, B. Schiele and K. Schindler
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 35, Number 11, 2013
Learnable Pooling Regions for Image Classification
M. Malinowski and M. Fritz
International Conference on Learning Representations Workshop Proceedings (ICLR 2013), 2013
(arXiv: 1301.3516)
Abstract
Biologically inspired, from the early HMAX model to Spatial Pyramid Matching, pooling has played an important role in visual recognition pipelines. Spatial pooling, by grouping of local codes, equips these methods with a certain degree of robustness to translation and deformation yet preserving important spatial information. Despite the predominance of this approach in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. This paper proposes a model for learning task dependent pooling scheme -- including previously proposed hand-crafted pooling schemes as a particular instantiation. In our work, we investigate the role of different regularization terms showing that the smooth regularization term is crucial to achieve strong performance using the presented architecture. Finally, we propose an efficient and parallel method to train the model. Our experiments show improved performance over hand-crafted pooling schemes on the CIFAR-10 and CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on the latter.
Traffic Sign Recognition - How far are we from the solution?
M. Mathias, R. Timofte, R. Benenson and L. Van Gool
2013 International Joint Conference on Neural Networks (IJCNN 2013), 2013
I Know What You Are Reading - Recognition of Document Types Using Mobile Eye Tracking
K. Kunze, Y. Utsumi, S. Yuki, K. Kise and A. Bulling
ISWC’13, ACM International Symposium on Wearable Computers, 2013
Pattern Recognition
J. Weickert, M. Hein and B. Schiele (Eds.)
Springer, 2013
Signal Processing Technologies for Activity-aware Smart Textiles
D. Roggen, G. Tröster and A. Bulling
Multidisciplinary Know-How for Smart-Textiles Developers, 2013
Abstract
Garments made of smart textiles have an enormous potential for embedding sensors in close proximity to the body in an unobtrusive and comfortable manner. Combined with signal processing and pattern recognition technologies, complex high-level information about human behaviors or situations can be inferred from the sensor data. The goal of this chapter is to introduce the reader to the design of activity-aware systems that use body-worn sensors, such as those that can be made available through smart textiles. We start this chapter by emphasizing recent trends towards ‘}wearable{’ sensing and computing and we present several examples of activity-aware applications. Then we outline the role that smart textiles can play in activity-aware applications, but also the challenges that they pose. We conclude by discussing the design process followed to devise activity-aware systems: the choice of sensors, the available data processing methods, and the evaluation techniques. We discuss recent data processing methods that address the challenges resulting from the use of smart textiles.
Dynamic Feature Selection for Classification on a Budget
S. Karayev, M. Fritz and T. Darrell
Prediction with Sequential Models (ICML 2013 Workshop), 2013
Eye Drop: An Interaction Concept for Gaze-supported Point-to-point Content Transfer
J. Turner, A. Bulling, J. Alexander and H. Gellersen
Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia (MUM 2013), 2013
Qualitative Activity Recognition of Weight Lifting Exercises
E. Velloso, A. Bulling, H. Gellersen, W. Ugulino and H. Fuks
Proceedings of the 4th Augmented Human International Conference (AH 2013), 2013
Abstract
Research on human activity recognition has traditionally focused on discriminating between different activities, i.e. to predict \textquoteleft}{\textquoteleft}which{\textquoteright}{\textquoteright} activity was performed at a specific point in time. The quality of executing an activity, the {\textquoteleft}{\textquoteleft}how (well){\textquoteright}{\textquoteright, has only received little attention so far, even though it potentially provides useful information for a large variety of applications, such as sports training. In this work we first define quality of execution and investigate three aspects that pertain to qualitative activity recognition: the problem of specifying correct execution, the automatic and robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user. We illustrate our approach on the example problem of qualitatively assessing and providing feedback on weight lifting exercises. In two user studies we try out a sensor- and a model-based approach to qualitative activity recognition. Our results underline the potential of model-based assessment and the positive impact of real-time user feedback on the quality of execution.
Towards Scene Understanding with Detailed 3D Object Representations
Z. Zia, M. Stark and K. Schindler
Scene Understanding Workshop (SUNw 2013), 2013
Collecting a Large-scale Dataset of Fine-grained Cars
J. Krause, J. Deng, M. Stark and L. Fei-Fei
Second Workshop on Fine-Grained Visual Categorization (FGVC2), 2013
Modeling Instance Appearance for Recognition - Can We Do Better Than EM?
A. Chou, H. Wang, M. Stark and D. Koller
Structured Prediction : Tractability, Learning, and Inference (CVPR 2013 Workshop SPTLI), 2013
Grounding Action Descriptions in Videos
M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele and M. Pinkal
Transactions of the Association for Computational Linguistics, Volume 1, 2013
Pursuits: Spontaneous Interaction with Displays based on Smooth Pursuit Eye Movement and Moving Targets
M. Vidal, A. Bulling and H. Gellersen
UbiComp’13, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2013
Pursuit Calibration: Making Gaze Calibration Less Tedious and More Flexible
K. Pfeuffer, M. Vidal, J. Turner, A. Bulling and H. Gellersen
UIST’13, ACM Symposium on User Interface Software and Technology, 2013
Abstract
Eye gaze is a compelling interaction modality but requires a user calibration before interaction can commence. State of the art procedures require the user to fixate on a succession of calibration markers, a task that is often experienced as difficult and tedious. We present a novel approach, pursuit calibration, that instead uses moving targets for calibration. Users naturally perform smooth pursuit eye movements when they follow a moving target, and we use correlation of eye and target movement to detect the users attention and to sample data for calibration. Because the method knows when the users is attending to a target, the calibration can be performed implicitly, which enables more flexible design of the calibration task. We demonstrate this in application examples and user studies, and show that pursuit calibration is tolerant to interruption, can blend naturally with applications, and is able to calibrate users without their awareness.
3rd International Workshop on Pervasive Eye Tracking and Mobile Eye-based Interaction
A. Bulling and R. Bednarik (Eds.)
petmei.org, 2013
Proceedings of the 4th Augmented Human International Conference
A. Schmidt, A. Bulling and C. Holz (Eds.)
ACM, 2013
Abstract
We are very happy to present the proceedings of the 4th Augmented Human International Conference (Augmented Human 2013). Augmented Human 2013 focuses on augmenting human capabilities through technology for increased well-being and enjoyable human experience. The conference is in cooperation with ACM SIGCHI, with its proceedings to be archived in ACM\textquoteright}s Digital Library. With technological advances, computing has progressively moved beyond the desktop into new physical and social contexts. As physical artifacts gain new computational behaviors, they become reprogrammable, customizable, repurposable, and interoperable in rich ecologies and diverse contexts. They also become more complex, and require intense design effort in order to be functional, usable, and enjoyable. Designing such systems requires interdisciplinary thinking. Their creation must not only encompass software, electronics, and mechanics, but also the system{\textquoterights physical form and behavior, its social and physical milieu, and beyond.
2012
Timely Object Recognition
S. Karayev, T. Baumgarnter, M. Fritz and T. Darrell
Advances in Neural Information Processing Systems 25 (NIPS 2012), 2012
Human Context: Modeling Human-Human Interactions for Monocular 3D Pose Estimation
M. Andriluka and L. Sigal
Articulated Motion and Deformable Objects (AMDO 2012), 2012
Semi-supervised Learning on a Budget: Scaling Up to Large Datasets
S. Ebert, M. Fritz and B. Schiele
Computer Vision - ACCV 2012, 2012
Video Segmentation with Superpixels
F. Galasso, R. Cipolla and B. Schiele
Computer Vision - ACCV 2012, 2012
The Pooled NBNN Kernel: Beyond Image-to-Class and Image-to-Image
K. Rematas, M. Fritz and T. Tuytelaars
Computer Vision - ACCV 2012, 2012
What Makes a Good Detector? - Structured Priors for Learning from Few Examples
T. Gao, M. Stark and D. Koller
Computer Vision - ECCV 2012, 2012
A Discrete Chain Graph Model for 3d+t Cell Tracking with High Misdetection Robustness
B. X. Kausler, S. Martin, B. Andres, M. Lindner, U. Köthe, H. Leitte, H. Wittbrodt, L. Hufnagel and F. A. Hamprecht
Computer Vision - ECCV 2012, 2012
Recognizing Materials from Virtual Examples
W. Li and M. Fritz
Computer Vision - ECCV 2012, 2012
3D2PM - 3D Deformable Part Models
B. Pepik, P. Gehler, M. Stark and B. Schiele
Computer Vision - ECCV 2012, 2012
Script Data for Attribute-based Recognition of Composite Activities
M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal and B. Schiele
Computer Vision - ECCV 2012, 2012
Sparselet Models for Efficient Multiclass Object Detection
H. O. Song, S. Zickler, T. Althoff, R. B. Girshick, M. Fritz, C. Geyer, P. F. Felzenszwalb and T. Darrell
Computer Vision - ECCV 2012, 2012
3D Object Detection with Multiple Kinects
W. Susanto, M. Rohrbach and B. Schiele
Computer Vision - ECCV 2012, 2012
Fine-grained Categorization for 3D Scene Understanding
M. Stark, J. Krause, B. Pepik, D. Meger, J. J. Little, B. Schiele and D. Koller
Electronic Proceedings of the British Machine Vision Conference 2012 (BMVC 2012), 2012
Detection and Tracking of Occluded People
S. Tang, M. Andriluka and B. Schiele
Electronic Proceedings of the British Machine Vision Conference 2012 (BMVC 2012), 2012
RALF: A Reinforced Active Learning Formulation for Object Class Recognition
S. Ebert, M. Fritz and B. Schiele
2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), 2012
Teaching 3D Geometry to Deformable Part Models
B. Pepik, M. Stark, P. Gehler and B. Schiele
2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), 2012
Abstract
Current object class recognition systems typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications such as 3D scene understanding or 3D object tracking would benefit from more fine-grained object hypotheses incorporating 3D geometric information, such as viewpoints or the locations of individual parts. In this paper, we help narrowing the representational gap between the ideal input of a scene understanding system and object class detector output, by designing a detector particularly tailored towards 3D geometric reasoning. In particular, we extend the successful discriminatively trained deformable part models to include both estimates of viewpoint and 3D parts that are consistent across viewpoints. We experimentally verify that adding 3D geometric information comes at minimal performance loss w.r.t. 2D bounding box localization, but outperforms prior work in 3D viewpoint estimation and ultra-wide baseline matching.
Articulated People Detection and Pose Estimation: Reshaping the Future
L. Pishchulin, A. Jain, M. Andriluka, T. Thormaehlen and B. Schiele
2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), 2012
Abstract
State-of-the-art methods for human detection and pose estimation require many training samples for best performance. While large, manually collected datasets exist, the captured variations w.r.t. appearance, shape and pose are often uncontrolled thus limiting the overall performance. In order to overcome this limitation we propose a new technique to extend an existing training set that allows to explicitly control pose and shape variations. For this we build on recent advances in computer graphics to generate samples with realistic appearance and background while modifying body shape and pose. We validate the effectiveness of our approach on the task of articulated human detection and articulated pose estimation. We report close to state of the art results on the popular Image Parsing human pose estimation benchmark and demonstrate superior performance for articulated human detection. In addition we define a new challenge of combined articulated human detection and pose estimation in real-world scenes.
A Database for Fine Grained Activity Detection of Cooking Activities
M. Rohrbach, S. Amin, M. Andriluka and B. Schiele
2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012
Pedestrian Detection: An Evaluation of the State of the Art
P. Dollár, C. Wojek and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 34, Number 4, 2012
Discriminative Appearance Models for Pictorial Structures
M. Andriluka, S. Roth and B. Schiele
International Journal of Computer Vision, Volume 99, Number 3, 2012
Abstract
In this paper we consider people detection and articulated pose estimation, two closely related and challenging problems in computer vision. Conceptually, both of these problems can be addressed within the pictorial structures framework (Felzenszwalb and Huttenlocher in Int. J. Comput. Vis. 61(1):55–79, 2005; Fischler and Elschlager in IEEE Trans. Comput. C-22(1):67–92, 1973), even though previous approaches have not shown such generality. A principal difficulty for such a general approach is to model the appearance of body parts. The model has to be discriminative enough to enable reliable detection in cluttered scenes and general enough to capture highly variable appearance. Therefore, as the first important component of our approach, we propose a discriminative appearance model based on densely sampled local descriptors and AdaBoost classifiers. Secondly, we interpret the normalized margin of each classifier as likelihood in a generative model and compute marginal posteriors for each part using belief propagation. Thirdly, non-Gaussian relationships between parts are represented as Gaussians in the coordinate system of the joint between the parts. Additionally, in order to cope with shortcomings of tree-based pictorial structures models, we augment our model with additional repulsive factors in order to discourage overcounting of image evidence. We demonstrate that the combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body pose estimation.
A Geometric Approach To Robotic Laundry Folding
S. Miller, J. van den Berg, M. Fritz, T. Darrell, K. Goldberg and P. Abbeel
International Journal of Robotics Research, Volume 31, Number 2, 2012
Kernel Density Topic Models: Visual Topics Without Visual Words
K. Rematas, M. Fritz and T. Tuytelaars
NIPS 2012 Workshop Modern Nonparametric Methods in Machine Learning, 2012
Active Metric Learning for Object Recognition
S. Ebert, M. Fritz and B. Schiele
Pattern Recognition (DAGM-OAGM 2012), 2012
Semi-supervised Learning for Image Classification
S. Ebert
PhD Thesis, Universität des Saarlandes, 2012
Abstract
Object class recognition is an active topic in computer vision still presenting many challenges. In most approaches, this task is addressed by supervised learning algorithms that need a large quantity of labels to perform well. This leads either to small datasets (< 10,000 images) that capture only a subset of the real-world class distribution (but with a controlled and verified labeling procedure), or to large datasets that are more representative but also add more label noise. Therefore, semi-supervised learning is a promising direction. It requires only few labels while simultaneously making use of the vast amount of images available today. We address object class recognition with semi-supervised learning. These algorithms depend on the underlying structure given by the data, the image description, and the similarity measure, and the quality of the labels. This insight leads to the main research questions of this thesis: Is the structure given by labeled and unlabeled data more important than the algorithm itself? Can we improve this neighborhood structure by a better similarity metric or with more representative unlabeled data? Is there a connection between the quality of labels and the overall performance and how can we get more representative labels? We answer all these questions, i.e., we provide an extensive evaluation, we propose several graph improvements, and we introduce a novel active learning framework to get more representative labels.
2011
South by South-east or Sitting at the Desk: Can Orientation be a Place?
U. Blanke, R. Rehner and B. Schiele
15th Annual International Symposium on Wearable Computers (ISWC 2011), 2011
Abstract
Location is a key information for context-aware systems. While coarse-grained indoor location estimates may be obtained quite easily (e.g. based on WiFi or GSM), finer-grained estimates typically require additional infrastructure (e.g. ultrasound). This work explores an approach to estimate significant places, e.g., at the fridge, with no additional setup or infrastructure. We use a pocket-based inertial measurement sensor, which can be found in many recent phones. We analyze how the spatial layout such as geographic orientation of buildings, arrangement and type of furniture can serve as the basis to estimate typical places in a daily scenario. Initial experiments reveal that our approach can detect fine-grained locations without relying on any infrastructure or additional devices.
Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
P. Gehler, C. Rother, M. Kiefel, L. Zhang and B. Schölkopf
Advances in Neural Information Processing Systems 24 (NIPS 2011), 2011
Abstract
We address the challenging task of decoupling material properties from lighting properties given a single image. In the last two decades virtually all works have concentrated on exploiting edge information to address this problem. We take a different route by introducing a new prior on reflectance, that models reflectance values as being drawn from a sparse set of basis colors. This results in a Random Field model with global, latent variables (basis colors) and pixel-accurate output reflectance values. We show that without edge information high-quality results can be achieved, that are on par with methods exploiting this source of information. Finally, we are able to improve on state-of-the-art results by integrating edge information into our model. We believe that our new approach is an excellent starting point for future developments in this field.
Joint 3D Estimation of Objects and Scene Layout
A. Geiger, C. Wojek and R. Urtasun
Advances in Neural Information Processing Systems 24 (NIPS 2011), 2011
Disparity Statistics for Pedestrian Detection: Combining Appearance, Motion and Stereo
S. Walk, K. Schindler and B. Schiele
Computer Vision - ECCV 2010, 2011
Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes
C. Wojek, S. Roth, K. Schindler and B. Schiele
Computer Vision - ECCV 2010, 2011
Practical 3-D Object Detection Using Category and Instance-level Appearance Models
K. Saenko, S. Karayev, Y. Yia, A. Shyr, A. Janoch, J. Long, M. Fritz and T. Darrell
2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’11), 2011
Perception for the Manipulation of Socks
P. C. Wang, S. Miller, M. Fritz, T. Darrell and P. Abbeel
2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), 2011
A Probabilistic Model for Recursive Factorized Image Features
S. Karayev, M. Fritz, S. Fidler and T. Darrell
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), 2011
Learning People Detection Models from Few Training Samples
L. Pishchulin, A. Jain, C. Wojek, M. Andriluka, T. Thormaehlen and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), 2011
Evaluating Knowledge Transfer and Zero-shot Learning in a Large-scale Setting
M. Rohrbach, M. Stark and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), 2011
Monocular 3D Scene Understanding with Explicit Occlusion Reasoning
C. Wojek, S. Walk, S. Roth and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), 2011
Warp that Smile on your Face: Optimal and Smooth Deformations for Face Recognition
T. Gass, L. Pishchulin, P. Dreuw and H. Ney
IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011
A Category-level 3-D Object Dataset: Putting the Kinect to Work
A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko and T. Darrell
2011 IEEE International Conference on Computer Vision (ICCV 2011), 2011
The NBNN Kernel
T. Tuytelaars, M. Fritz, K. Saenko and T. Darrell
IEEE International Conference on Computer Vision (ICCV 2011), 2011
Revisiting 3D Geometric Models for Accurate Object Shape and Pose
M. Z. Zia, M. Stark, B. Schiele and K. Schindler
IEEE International Conference on Computer Vision (ICCV 3dRR 2011), 2011
Abstract
Geometric 3D reasoning has received renewed attention recently, in the context of visual scene understanding. The level of geometric detail, however, is typically limited to qualitative or coarse-grained quantitative representations. This is linked to the fact that today's object class detectors are tuned towards robust 2D matching rather than accurate 3D pose estimation, encouraged by 2D bounding box-based benchmarks such as Pascal VOC. In this paper, we therefore revisit ideas from the early days of computer vision, namely, 3D geometric object class representations for recognition. These representations can recover geometrically far more accurate object hypotheses than just 2D bounding boxes, including relative 3D positions of object parts. In combination with recent robust techniques for shape description and inference, our approach outperforms state-of-the-art results in 3D pose estimation, while at the same time improving 2D localization. In a series of experiments, we analyze our approach in detail, and demonstrate novel applications enabled by our geometric object class representation, such as fine-grained categorization of cars according to their 3D geometry and ultra-wide baseline matching.
Visual Grasp Affordances From Appearance-based Cues
H. O. Song, M. Fritz, C. Gu and T. Darrell
2011 IEEE International Conference on Computer Vision (ICCW 2011), 2011
I Spy with my Little Eye: Learning Optimal Filters for Cross-Modal Stereo under Projected Patterns
W.-C. Chiu, U. Blanke and M. Fritz
2011 IEEE International Conference on Computer Vision (WS 2011), 2011
The Benefits of Dense Stereo for Pedestrian Detection
C. G. Keller, M. Enzweiler, M. Rohrbach, D. F. Llorca, C. Schnörr and D. M. Gavrila
IEEE Transactions on Intelligent Transportation Systems, Volume 12, Number 4, 2011
Abstract
This paper presents a novel pedestrian detection system for intelligent vehicles. We propose the use of dense stereo for both the generation of regions of interest and pedestrian classification. Dense stereo allows the dynamic estimation of camera parameters and the road profile, which, in turn, provides strong scene constraints on possible pedestrian locations. For classification, we extract spatial features (gradient orientation histograms) directly from dense depth and intensity images. Both modalities are represented in terms of individual feature spaces, in which discriminative classifiers (linear support vector machines) are learned. We refrain from the construction of a joint feature space but instead employ a fusion of depth and intensity on the classifier level. Our experiments involve challenging image data captured in complex urban environments (i.e., undulating roads and speed bumps). Our results show a performance improvement by up to a factor of 7.5 at the classification level and up to a factor of 5 at the tracking level (reduction in false alarms at constant detection rates) over a system with static scene constraints and intensity-only classification.
Weakly Supervised Recognition of Daily Life Activities with Wearable Sensors
M. Stikic, D. Larlus, S. Ebert and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 33, Number 12, 2011
Pick your Neighborhood - Improving Labels and Neighborhood Structure for Label Propagation
S. Ebert, M. Fritz and B. Schiele
Pattern Recognition (DAGM 2011), 2011
Image Warping for Face Recognition: From Local Optimality Towards Global Optimization
L. Pishchulin, T. Gass, P. Dreuw and H. Ney
Pattern Recognition (Proc. IbPRIA 2011), 2011
The Fast and the Flexible: Extended Pseudo Two-dimensional Warping for Face Recognition
L. Pishchulin, T. Gass, P. Dreuw and H. Ney
Pattern Recognition and Image Analysis (IbPRIA 2011), 2011
Recognition of Hearing Needs From Body and Eye Movements to Improve Hearing Instruments
B. Tessendorf, A. Bulling, D. Roggen, T. Stiefmeier, M. Feilner, P. Derleth and G. Tröster
Pervasive Computing, 2011
Abstract
Hearing instruments (HIs) have emerged as true pervasive computers as they continuously adapt the hearing program to the user\textquoterights context. However, current HIs are not able to distinguish different hearing needs in the same acoustic environment. In this work, we explore how information derived from body and eye movements can be used to improve the recognition of such hearing needs. We conduct an experiment to provoke an acoustic environment in which different hearing needs arise: active conversation and working while colleagues are having a conversation in a noisy office environment. We record body movements on nine body locations, eye movements using electrooculography (EOG), and sound using commercial HIs for eleven participants. Using a support vector machine (SVM) classifier and person-independent training we improve the accuracy of 77% based on sound to an accuracy of 92% using body movements. With a view to a future implementation into a HI we then perform a detailed analysis of the sensors attached to the head. We achieve the best accuracy of 86% using eye movements compared to 84% for head movements. Our work demonstrates the potential of additional sensor modalities for future HIs and motivates to investigate the wider applicability of this approach on further hearing situations and needs.
Learning Output Kernels with Block Coordinate Descent
F. Dinuzzo, C. S. Ong, P. Gehler and G. Pillonetto
Proceedings of the 28th Internationl Conference on Machine Learning (ICML 2011), 2011
Abstract
We propose a method to learn simultaneously a vector-valued function and a kernel between its components. The obtained kernel can be used both to improve learning performance and to reveal structures in the output space which may be important in their own right. Our method is based on the solution of a suitable regularization problem over a reproducing kernel Hilbert space of vector-valued functions. Although the regularized risk functional is non-convex, we show that it is invex, implying that all local minimizers are global minimizers. We derive a block-wise coordinate descent method that efficiently exploits the structure of the objective functional. Then, we empirically demonstrate that the proposed method can improve classification accuracy. Finally, we provide a visual interpretation of the learned kernel matrix for some well known datasets.
Improving the Kinect by Cross-modal Stereo
W.-C. Chiu, U. Blanke and M. Fritz
Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), 2011
Branch&Rank: Non-linear Object Detection
A. Lehmann, P. Gehler and L. Van Gool
Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), 2011
Abstract
Branch&rank is an object detection scheme that overcomes the inherent limitation of branch&bound: this method works with arbitrary (classifier) functions whereas tight bounds exist only for simple functions. Objects are usually detected with less than 100 classifier evaluation, which paves the way for using strong (and thus costly) classifiers: We utilize non-linear SVMs with RBF- 2 kernels without a cascade-like approximation. Our approach features three key components: a ranking function that operates on sets of hypotheses and a grouping of these into different tasks. Detection efficiency results from adaptively sub-dividing the object search space into decreasingly smaller sets. This is inherited from branch&bound, while the ranking function supersedes a tight bound which is often unavailable (except for too simple function classes). The grouping makes the system effective: it separates image classification from object recognition, yet combines them in a single, structured SVM formulation. A novel aspect of branch&rank is that a better ranking function is expected to decrease the number of classifier calls during detection. We demonstrate the algorithmic properties using the VOC'07 dataset.
Explicit Occlusion Reasoning for 3D Object Detection
D. Meger, C. Wojek, B. Schiele and J. J. Little
Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), 2011
In Good Shape: Robust People Detection Based on Appearance and Shape
L. Pishchulin, A. Jain, C. Wojek, T. Thormaehlen and B. Schiele
Proceedings of the British Machine Vision Conference 2011 (BMVC 2011), 2011
Benchmark Datasets for Pose Estimation and Tracking
M. Andriluka, L. Sigal and M. Black
Visual Analysis of Humans: Looking at People, 2011
2010
Back to the Future: Learning Shape Models from 3D CAD Data
M. Stark, M. Goesele and B. Schiele
21st British Machine Vision Conference (BMVC 2010), 2010
Abstract
Recognizing 3D objects from arbitrary view points is one of the most fundamental problems in computer vision. A major challenge lies in the transition between the 3D geometry of objects and 2D representations that can be robustly matched to natural images. Most approaches thus rely on 2D natural images either as the sole source of training data for building an implicit 3D representation, or by enriching 3D models with natural image features. In this paper, we go back to the ideas from the early days of computer vision, by using 3D object models as the only source of information for building a multi-view object class detector. In particular, we use these models for learning 2D shape that can be robustly matched to 2D natural images. Our experiments confirm the validity of our approach, which outperforms current state-of-the-art techniques on a multi-view detection data set.
All for one or one for all? – Combining Heterogeneous Features for Activity Spotting
U. Blanke, M. Kreil, B. Schiele, P. Lukowicz, B. Sick and T. Gruber
2010 8th IEEE International Conference on Pervasive Computing and Communications Workshops : PerCom Workshops 2010 : 7th IEEE International Workshop on Context Modeling and Reasoning (CoMoRea 2010), 2010
Size Matters: Metric Visual Search Constraints from Monocular Metadata
M. Fritz, K. Saenko and T. Darrell
Advances in Neural Information Processing Systems 23 (NIPS 2010), 2010
Multi-Modal Learning
D. Skocaj, K. Matej, A. Vrecko, A. Leonardis, M. Fritz, M. Stark, B. Schiele, S. Hongeng and J. L. Wyatt
Cognitive Systems, 2010
Tutor-based Learning of Visual Categories Using Different Levels of Supervision
M. Fritz, G.-J. M. Kruijff and B. Schiele
Computer Vision and Image Understanding, Volume 114, Number 5, 2010
Extracting Structures in Image Collections for Object Recognition
S. Ebert, D. Larlus and B. Schiele
Computer Vision - ECCV 2010, 2010
Abstract
Many computer vision methods rely on annotated image databases without taking advantage of the increasing number of unlabeled images available. This paper explores an alternative approach involving unsupervised structure discovery and semi-supervised learning (SSL) in image collections. Focusing on object classes, the ﬁrst part of the paper contributes with an extensive evaluation of state-of-the-art image representations underlining the decisive inﬂuence of the local neighborhood structure, its direct consequences on SSL results, and the importance of developing powerful object representations. In a second part, we propose and explore promising directions to improve results by looking at the local topology between images and feature combination strategies.
Combining Language Sources and Robust Semantic Relatedness for Attribute-Based Knowledge Transfer
M. Rohrbach, M. Stark, G. Szarvas and B. Schiele
First International Workshop on Parts and Attributes in Conjunction with ECCV 2010, 2010
Abstract
Knowledge transfer between object classes has been identified as an important tool for scalable recognition. However, determining which knowledge to transfer where remains a key challenge. While most approaches employ varying levels of human supervision, we follow the idea of mining linguistic knowledge bases to automatically infer transferable knowledge. In contrast to previous work, we explicitly aim to design robust semantic relatedness measures and to combine different language sources for attribute-based knowledge transfer. On the challenging Animals with Attributes (AwA) data set, we report largely improved attribute-based zero-shot object class recognition performance that matches the performance of human supervision.
Vision Based Victim Detection from Unmanned Aerial Vehicles
M. Andriluka, P. Schnitzspan, J. Meyer, S. Kohlbrecher, K. Petersen, O. von Stryk, S. Roth and B. Schiele
2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2010
Abstract
Finding injured humans is one of the primary goals of any search and rescue operation. The aim of this paper is to address the task of automatically finding people lying on the ground in images taken from the on-board camera of an unmanned aerial vehicle (UAV). In this paper we evaluate various state-of-the-art visual people detection methods in the context of vision based victim detection from an UAV. The top performing approaches in this comparison are those that rely on flexible part-based representations and discriminatively trained part detectors. We discuss their strengths and weaknesses and demonstrate that by combining multiple models we can increase the reliability of the system. We also demonstrate that the detection performance can be substantially improved by integrating the height and pitch information provided by on-board sensors. Jointly these improvements allow us to significantly boost the detection performance over the current de-facto standard, which provides a substantial step towards making autonomous victim detection for UAVs practical.
Monocular 3D Pose Estimation and Tracking by Detection
M. Andriluka, S. Roth and B. Schiele
2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010
Abstract
Automatic recovery of 3D human pose from monocular image sequences is a challenging and important research topic with numerous applications. Although current methods are able to recover 3D pose for a single person in controlled environments, they are severely challenged by real-world scenarios, such as crowded street scenes. To address this problem, we propose a three-stage process building on a number of recent advances. The first stage obtains an initial estimate of the 2D articulation and viewpoint of the person from single frames. The second stage allows early data association across frames based on tracking-by-detection. These two stages successfully accumulate the available 2D image evidence into robust estimates of 2D limb positions over short image sequences (= tracklets). The third and final stage uses those tracklet-based estimates as robust image observations to reliably recover 3D pose. We demonstrate state-of-the-art performance on the HumanEva II benchmark, and also show the applicability of our approach to articulated 3D tracking in realistic street conditions.
Multi-cue Pedestrian Classification with Partial Occlusion Handling
M. Enzweiler, A. Eigenstetter, B. Schiele and D. M. Gavrila
2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010
What helps Where - and Why? Semantic Relatedness for Knowledge Transfer
M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych and B. Schiele
2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010
Abstract
Remarkable performance has been reported to recognize single object classes. Scalability to large numbers of classes however remains an important challenge for today's recognition methods. Several authors have promoted knowledge transfer between classes as a key ingredient to address this challenge. However, in previous work the decision which knowledge to transfer has required either manual supervision or at least a few training examples limiting the scalability of these approaches. In this work we explicitly address the question of how to automatically decide which information to transfer between classes without the need of any human intervention. For this we tap into linguistic knowledge bases to provide the semantic link between sources (what) and targets (where) of knowledge transfer. We provide a rigorous experimental evaluation of different knowledge bases and state-of-the-art techniques from Natural Language Processing which goes far beyond the limited use of language in related work. We also give insights into the applicability (why) of different knowledge sources and similarity measures for knowledge transfer.
Automatic Discovery of Meaningful Object Parts with Latent CRFs
P. Schnitzspan, S. Roth and B. Schiele
2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010
Abstract
Object recognition is challenging due to high intra-class variability caused, e.g., by articulation, viewpoint changes, and partial occlusion. Successful methods need to strike a balance between being flexible enough to model such variation and discriminative enough to detect objects in cluttered, real world scenes. Motivated by these challenges we propose a latent conditional random field (CRF) based on a flexible assembly of parts. By modeling part labels as hidden nodes and developing an EM algorithm for learning from class labels alone, this new approach enables the automatic discovery of semantically meaningful object part representations. To increase the flexibility and expressiveness of the model, we learn the pairwise structure of the underlying graphical model at the level of object part interactions. Efficient gradient-based techniques are used to estimate the structure of the domain of interest and carried forward to the multi-label or object part case. Our experiments illustrate the meaningfulness of the discovered parts and demonstrate state-of-the-art performance of the approach.
New Features and Insights for Pedestrian Detection
S. Walk, N. Majer, K. Schindler and B. Schiele
2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010
Dead Reckoning from the Pocket - An Experimental Study
U. Steinhoff and B. Schiele
IEEE 2010 International Conference on Pervasive Computing and Communications (PerCom 2010), 2010
Towards Human Motion Capturing using Gyroscopeless Orientation Estimation
U. Blanke and B. Schiele
International Symposium on Wearable Computers 2010 (ISCW 2010), 2010
Remember and Transfer what you have Learned - Recognizing Composite Activities based on Activity Spotting
U. Blanke and B. Schiele
International Symposium on Wearable Computers 2010 (ISWC 2010), 2010
A Semantic World Model for Urban Search and Rescue Based on Heterogeneous Sensors
J. Meyer, P. Schnitzspan, S. Kohlbrecher, K. Petersen, O. Schwahn, M. Andriluka, U. Klingauf, S. Roth, B. Schiele and O. von Stryk
RoboCup 2010, 14th International RoboCup Symposium, 2010
Combining Language Sources and Robust Semantic Relatedness for Attribute-based Knowledge Transfer
M. Rohrbach, M. Stark, G. Szarvas and B. Schiele
Trends and Topics in Computer Vision (ECCV 2010 Workshops), 2010
Real-time Full-body Visual Traits Recognition from Image Sequences
C. Jung, R. Tausch and C. Wojek
VMV 2010, 2010
2004
A Model for Human Interruptability: Experimental Evaluation and Automatic Estimation from Wearable Sensors
N. Kern, S. Antifakos, B. Schiele and A. Schwaninger
Eighth International Symposium on Wearable Computers (ISWC 2004), 2004
Less Contact: Heart-rate Detection Without Even Touching the User
F. Michahelles, R. Wicki and B. Schiele
Eighth International Symposium on Wearable Computers (ISWC 2004), 2004