2016
Multi-Cue Zero-Shot Learning with Strong Supervision
Z. Akata, M. Malinowski, M. Fritz and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
CP-mtML: Coupled Projection Multi-task Metric Learning for Large Scale Face Retrieval
B. Bhattarai, G. Sharma and F. Jurie
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
The Cityscapes Dataset for Semantic Urban Scene Understanding
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Moral Lineage Tracing
F. Jug, E. Levinkov, C. Blasse, E. W. Myers and B. Andres
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Weakly Supervised Object Boundaries
A. Khoreva, R. Benenson, M. Omran, M. Hein and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Abstract
State-of-the-art learning based boundary detection methods require extensive training data. Since labelling object boundaries is one of the most expensive types of annotations, there is a need to relax the requirement to carefully annotate images to make both the training more affordable and to extend the amount of training data. In this paper we propose a technique to generate weakly supervised annotations and show that bounding box annotations alone suffice to reach high-quality object boundaries without using any object-specific boundary annotations. With the proposed weak supervision techniques we achieve the top performance on the object boundary detection task, outperforming by a large margin the current fully supervised state-of-the-art methods.
Loss Functions for Top-k Error: Analysis and Insights
M. Lapin, M. Hein and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Learning Deep Representations of Fine-Grained Visual Descriptions
S. Reed, Z. Akata, H. Lee and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Deep Reflectance Maps
K. Rematas, T. Ritschel, M. Fritz, E. Gavves and T. Tuytelaars
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Abstract
Undoing the image formation process and therefore decomposing appearance into its intrinsic properties is a challenging task due to the under-constraint nature of this inverse problem. While significant progress has been made on inferring shape, materials and illumination from images only, progress in an unconstrained setting is still limited. We propose a convolutional neural architecture to estimate reflectance maps of specular materials in natural lighting conditions. We achieve this in an end-to-end learning formulation that directly predicts a reflectance map from the image itself. We show how to improve estimates by facilitating additional supervision in an indirect scheme that first predicts surface orientation and afterwards predicts the reflectance map by a learning-based sparse data interpolation. In order to analyze performance on this difficult task, we propose a new challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg) using both synthetic and real images. Furthermore, we show the application of our method to a range of image-based editing tasks on real images.
Convexity Shape Constraints for Image Segmentation
L. A. Royer, D. L. Richmond, B. Andres and D. Kainmueller
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
LOMo: Latent Ordinal Model for Facial Analysis in Videos
K. Sikka, G. Sharma and M. Bartlett
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
End-to-end People Detection in Crowded Scenes
R. Stewart and M. Andriluka
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Latent Embeddings for Zero-shot Classification
Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
How Far are We from Solving Pedestrian Detection?
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras
H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2016), Volume 35, Number 6, 2016
Learning What and Where to Draw
S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele and L. Honglak
Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016
SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull
S. Schneegass, Y. Oualil and A. Bulling
CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, 2016
Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces
P. Xu, Y. Sugano and A. Bulling
CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, 2016
GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices
M. Khamis, F. Alt, M. Hassib, E. von Zezschwitz, R. Hasholzner and A. Bulling
CHI 2016 Extended Abstracts, 2016
On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input
D. Kirst and A. Bulling
CHI 2016 Extended Abstracts, 2016
Abstract
Rotations performed with the index finger and thumb involve some of the most complex motor action among common multi-touch gestures, yet little is known about the factors affecting performance and ergonomics. This note presents results from a study where the angle, direction, diameter, and position of rotations were systematically manipulated. Subjects were asked to perform the rotations as quickly as possible without losing contact with the display, and were allowed to skip rotations that were too uncomfortable. The data show surprising interaction effects among the variables, and help us identify whole categories of rotations that are slow and cumbersome for users.
Pervasive Attentive User Interfaces
A. Bulling
Computer, Volume 49, Number 1, 2016
Towards Segmenting Consumer Stereo Videos: Benchmark, Baselines and Ensembles
W.-C. Chiu, F. Galasso and M. Fritz
Computer Vision - ACCV 2016, 2016
(Accepted/in press)
Local Higher-order Statistics (LHS) Describing Images with Statistics of Local Non-binarized Pixel Patterns
G. Sharma and F. Jurie
Computer Vision and Image Understanding, Volume 142, 2016
An Efficient Fusion Move Algorithm for the Minimum Cost Lifted Multicut Problem
T. Beier, B. Andres, U. Köthe and F. A. Hamprecht
Computer Vision - ECCV 2016, 2016
Generating Visual Explanations
L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele and T. Darrell
Computer Vision -- ECCV 2016, 2016
Abstract
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. We propose a novel loss function based on sampling and reinforcement learning that learns to generate sentences that realize a global sentence property, such as class specificity. Our results on a fine-grained bird species classification dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods.
DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka and B. Schiele
Computer Vision -- ECCV 2016, 2016
Abstract
The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. We evaluate our approach on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation. Models and code available at http://pose.mpi-inf.mpg.de
Faceless Person Recognition: Privacy Implications in Social Media
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
Computer Vision -- ECCV 2016, 2016
Grounding of Textual Phrases in Images by Reconstruction
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele
Computer Vision -- ECCV 2016, 2016
A 3D Morphable Eye Region Model for Gaze Estimation
E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson and A. Bulling
Computer Vision -- ECCV 2016, 2016
VConv-DAE: Deep Volumetric Shape Learning Without Object Labels
A. Sharma, O. Grau and M. Fritz
Computer Vision - ECCV 2016 Workshops, 2016
Abstract
With the advent of affordable depth sensors, 3D capture becomes more and more ubiquitous and already has made its way into commercial products. Yet, capturing the geometry or complete shapes of everyday objects using scanning devices (eg. Kinect) still comes with several challenges that result in noise or even incomplete shapes. Recent success in deep learning has shown how to learn complex shape distributions in a data-driven way from large scale 3D CAD Model collections and to utilize them for 3D processing on volumetric representations and thereby circumventing problems of topology and tessellation. Prior work has shown encouraging results on problems ranging from shape completion to recognition. We provide an analysis of such approaches and discover that training as well as the resulting representation are strongly and unnecessarily tied to the notion of object labels. Furthermore, deep learning research argues ~\cite{Vincent08} that learning representation with over-complete model are more prone to overfitting compared to the approach that learns from noisy data. Thus, we investigate a full convolutional volumetric denoising auto encoder that is trained in a unsupervised fashion. It outperforms prior work on recognition as well as more challenging tasks like denoising and shape completion. In addition, our approach is atleast two order of magnitude faster at test time and thus, provides a path to scaling up 3D deep learning.
Multi-Person Tracking by Multicut and Deep Matching
S. Tang, B. Andres, M. Andriluka and B. Schiele
Computer Vision - ECCV 2016 Workshops, 2016
Improved Image Boundaries for Better Video Segmentation
A. Khoreva, R. Benenson, F. Galasso, M. Hein and B. Schiele
Computer Vision -- ECCV 2016 Workshops, 2016
Abstract
Graph-based video segmentation methods rely on superpixels as starting point. While most previous work has focused on the construction of the graph edges and weights as well as solving the graph partitioning problem, this paper focuses on better superpixels for video segmentation. We demonstrate by a comparative analysis that superpixels extracted from boundaries perform best, and show that boundary estimation can be significantly improved via image and time domain cues. With superpixels generated from our better boundaries we observe consistent improvement for two video segmentation methods in two different datasets.
Eyewear Computing -- Augmenting the Human with Head-mounted Wearable Assistants
A. Bulling, O. Cakmakci, K. Kunze and J. M. Rehg (Eds.)
Schloss Dagstuhl, 2016
Attention, please!: Comparing Features for Measuring Audience Attention Towards Pervasive Displays
F. Alt, A. Bulling, L. Mecke and D. Buschek
DIS 2016, 11th ACM SIGCHI Designing Interactive Systems Conference, 2016
Sensing and Controlling Human Gaze in Daily Living Space for Human-Harmonized Information Environments
Y. Sato, Y. Sugano, A. Sugimoto, Y. Kuno and H. Koike
Human-Harmonized Information Technology, 2016
Smooth Eye Movement Interaction Using EOG Glasses
M. Dhuliawala, J. Lee, J. Shimizu, A. Bulling, K. Kunze, T. Starner and W. Woo
ICMI’16, 18th ACM International Conference on Multimodal Interaction, 2016
Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries
S. Nag Chowdhury, M. Malinowski, A. Bulling and M. Fritz
ICMR’16, ACM International Conference on Multimedia Retrieval, 2016
Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation
M. Malinowski, M. Rohrbach and M. Fritz
IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016), 2016
(Accepted/in press)
Abstract
We are addressing an open-ended question answering task about real-world images. With the help of currently available methods developed in Computer Vision and Natural Language Processing, we would like to push an architecture with a global visual representation to its limits. In our contribution, we show how to achieve competitive performance on VQA with global visual features (Residual Net) together with a carefully desgined architecture.
A Joint Learning Approach for Cross Domain Age Estimation
B. Bhattarai, G. Sharma, A. Lechervy and F. Jurie
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2016), 2016
Learning to Detect Visual Grasp Affordance
H. Oh Song, M. Fritz, D. Goehring and T. Darell
IEEE Transactions on Automation Science and Engineering, Volume 13, Number 2, 2016
Label-Embedding for Image Classification
Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 7, 2016
3D Pictorial Structures Revisited: Multiple Human Pose Estimation
V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab and S. Ilic
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 10, 2016
Leveraging the Wisdom of the Crowd for Fine-Grained Recognition
J. Deng, J. Krause, M. Stark and L. Fei-Fei
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 4, 2016
What Makes for Effective Detection Proposals?
J. Hosang, R. Benenson, P. Dollár and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 4, 2016
Reconstructing Curvilinear Networks using Path Classifiers and Integer Programming
E. T. Turetken, F. Benmansour, B. Andres, P. Głowacki and H. Pfister
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 12, 2016
Combining Eye Tracking with Optimizations for Lens Astigmatism in modern wide-angle HMDs
D. Pohl, X. Zhang and A. Bulling
2016 IEEE Virtual Reality Conference (VR), 2016
Recognition of Ongoing Complex Activities by Sequence Prediction Over a Hierarchical Label Space
W. Li and M. Fritz
2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016), 2016
Eyewear Computers for Human-Computer Interaction
A. Bulling and K. Kunze
Interactions, Volume 23, Number 3, 2016
Demo hour
H. Jeong, D. Saakes, U. Lee, A. Esteves, E. Velloso, A. Bulling, K. Masai, Y. Sugiura, M. Ogata, K. Kunze, M. Inami, M. Sugimoto, A. Rathnayake and T. Dias
Interactions, Volume 23, Number 1, 2016
Recognizing Fine-grained and Composite Activities Using Hand-centric Features and Script Data
M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
International Journal of Computer Vision, Volume 119, Number 3, 2016
Pattern Recognition
B. Rosenhahn and B. Andres (Eds.)
Springer, 2016
Pupil Detection for Head-mounted Eye Tracking in the Wild: An Evaluation of the State of the Art
W. Fuhl, M. Tonsen, A. Bulling and E. Kasneci
Machine Vision and Applications, Volume 27, Number 8, 2016
The Minimum Cost Connected Subgraph Problem in Medical Image Analysis
M. Rempfler, B. Andres and B. H. Menze
Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2016, 2016
Demo: I-Pic: A Platform for Privacy-Compliant Image Capture
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu
MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016
I-Pic: A Platform for Privacy-Compliant Image Capture
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu
MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016
Long Term Boundary Extrapolation for Deterministic Motion
A. Bhattacharyya, M. Malinowski and M. Fritz
NIPS Workshop on Intuitive Physics, 2016
A Convnet for Non-maximum Suppression
J. Hosang, R. Benenson and B. Schiele
Pattern Recognition (GCPR 2016), 2016
Abstract
Non-maximum suppression (NMS) is used in virtually all state-of-the-art object detection pipelines. While essential object detection ingredients such as features, classifiers, and proposal methods have been extensively researched surprisingly little work has aimed to systematically address NMS. The de-facto standard for NMS is based on greedy clustering with a fixed distance threshold, which forces to trade-off recall versus precision. We propose a convnet designed to perform NMS of a given set of detections. We report experiments on a synthetic setup, and results on crowded pedestrian detection scenes. Our approach overcomes the intrinsic limitations of greedy NMS, obtaining better recall and precision.
Learning to Select Long-Track Features for Structure-From-Motion and Visual SLAM
J. Scheer, M. Fritz and O. Grau
Pattern Recognition (GCPR 2016), 2016
Convexification of Learning from Constraints
I. Shcherbatyi and B. Andres
Pattern Recognition (GCPR 2016), 2016
Special Issue Introduction
D. J. Cook, A. Bulling and Z. Yu
Pervasive and Mobile Computing (Proc. PerCom 2015), Volume 26, 2016
Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces
M. Barz, F. Daiber and A. Bulling
Proceedings ETRA 2016, 2016
3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers
M. Mansouryar, J. Steil, Y. Sugano and A. Bulling
Proceedings ETRA 2016, 2016
Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions
L. Sesma-Sanchez, Y. Zhang, H. Gellersen and A. Bulling
Proceedings ETRA 2016, 2016
Labelled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments
M. Tonsen, X. Zhang, Y. Sugano and A. Bulling
Proceedings ETRA 2016, 2016
Learning an Appearance-based Gaze Estimator from One Million Synthesised Images
E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson and A. Bulling
Proceedings ETRA 2016, 2016
Long-term Memorability of Cued-Recall Graphical Passwords with Saliency Masks
F. Alt, M. Mikusz, S. Schneegass and A. Bulling
Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), 2016
EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?
M. Khamis, L. Trotter, M. Tessman, C. Dannhart, A. Bulling and F. Alt
Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), 2016
Generative Adversarial Text to Image Synthesis
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele and H. Lee
Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016
Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
A. Mokarian Forooshani, M. Malinowski and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2016), 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016
Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze
A. L. Simeone, A. Bulling, J. Alexander and H. Gellersen
Proceedings of the 2016 International Working Conference on Advanced Visual Interfaces (AVI 2016), 2016
Commonsense in Parts: Mining Part-Whole Relations from the Web and Image Tags
N. Tandon, C. D. Hariman, J. Urbani, A. Rohrbach, M. Rohrbach and G. Weikum
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016
Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User’s Current Visual Field
D. Pohl, X. Zhang, A. Bulling and O. Grau
Proceedings VRST 2016, 2016
Visual Object Class Recognition
M. Stark, B. Schiele and A. Leonardis
Springer Handbook of Robotics, 2016
Interactive Multicut Video Segmentation
E. Levinkov, J. Tompkin, N. Bonneel, S. Kirchhoff, B. Andres and H. Pfister
The 24th Pacific Conference on Computer Graphics and Applications Short Papers Proceedings (Pacific Graphics 2016), 2016
TextPursuits: Using Text for Pursuits-based Interaction and Calibration on Public Displays
M. Khamis, O. Saltuk, A. Hang, K. Stolz, A. Bulling and F. Alt
UbiComp’16, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2016
EyeWear 2016: First Workshop on EyeWear Computing
A. Bulling, O. Cakmakci, K. Kunze and J. M. Rehg
UbiComp’16 Adjunct, 2016
Challenges and Design Space of Gaze-enabled Public Displays
M. Khamis, F. Alt and A. Bulling
UbiComp’16 Adjunct, 2016
Solar System: Smooth Pursuit Interactions Using EOG Glasses
J. Shimizu, J. Lee, M. Dhuliawala, A. Bulling, T. Starner, W. Woo and K. Kunze
UbiComp’16 Adjunct, 2016
AggreGaze: Collective Estimation of Audience Attention on Public Displays
Y. Sugano, X. Zhang and A. Bulling
UIST 2016, 29th Annual Symposium on User Interface Software and Technology, 2016
Lifting of Multicuts
B. Andres, A. Fuksova and J.-H. Lange
Technical Report, 2016
(arXiv: 1503.03791)
Abstract
For every simple, undirected graph $G = (V, E)$, a one-to-one relation exists between the decompositions and the multicuts of $G$. A decomposition of $G$ is a partition $\Pi$ of $V$ such that, for every $U \in \Pi$, the subgraph of $G$ induced by $U$ is connected. A multicut of $G$ is an $M \subseteq E$ such that, for every (chordless) cycle $C \subseteq E$ of $G$, $|M \cap C| \neq 1$. The multicut related to a decomposition is the set of edges that straddle distinct components. The characteristic function $x \in \{0, 1\}^E$ of a multicut $M = x^{-1}(1)$ of $G$ makes explicit, for every pair $\{v,w\} \in E$ of neighboring nodes, whether $v$ and $w$ are in distinct components. In order to make explicit also for non-neighboring nodes, specifically, for all $\{v,w\} \in E'$ with $E \subseteq E' \subseteq {V \choose 2}$, whether $v$ and $w$ are in distinct components, we define a lifting of the multicuts of $G$ to multicuts of $G' = (V, E')$. We show that, if $G$ is connected, the convex hull of the characteristic functions of those multicuts of $G'$ that are lifted from $G$ is an $|E'|$-dimensional polytope in $\mathbb{R}^{E'}$. We establish properties of trivial facets of this polytope.
Long-Term Image Boundary Extrapolation
A. Bhattacharyya, M. Malinowski, B. Schiele and M. Fritz
Technical Report, 2016
(arXiv: 1611.08841)
Abstract
Boundary prediction in images and videos has been a very active topic of research and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on predicting boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and extrapolate motion patterns. We experiment on established real-world video segmentation dataset, which provides a testbed for this new task. We show for the first time spatio-temporal boundary extrapolation, that in contrast to prior work on RGB extrapolation maintains a crisp result. Furthermore, we show long-term prediction of boundaries in situations where the motion is governed by the laws of physics. We argue that our model has with minimalistic model assumptions derived a notion of "intuitive physics".
Spatio-Temporal Image Boundary Extrapolation
A. Bhattacharyya, M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1605.07363)
Abstract
Boundary prediction in images as well as video has been a very active topic of research and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on predicting boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and extrapolate motion patterns. We experiment on established real-world video segmentation dataset, which provides a testbed for this new task. We show for the first time spatio-temporal boundary extrapolation in this challenging scenario. Furthermore, we show long-term prediction of boundaries in situations where the motion is governed by the laws of physics. We successfully predict boundaries in a billiard scenario without any assumptions of a strong parametric model or any object notion. We argue that our model has with minimalistic model assumptions derived a notion of 'intuitive physics' that can be applied to novel scenes.
Bayesian Non-Parametrics for Multi-Modal Segmentation
W.-C. Chiu
PhD Thesis, Universität des Saarlandes, 2016
Natural Illumination from Multiple Materials Using Deep Learning
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool
Technical Report, 2016
(arXiv: 1611.09325)
Abstract
Recovering natural illumination from a single Low-Dynamic Range (LDR) image is a challenging task. To remedy this situation we exploit two properties often found in everyday images. First, images rarely show a single material, but rather multiple ones that all reflect the same illumination. However, the appearance of each material is observed only for some surface orientations, not all. Second, parts of the illumination are often directly observed in the background, without being affected by reflection. Typically, this directly observed part of the illumination is even smaller. We propose a deep Convolutional Neural Network (CNN) that combines prior knowledge about the statistics of illumination and reflectance with an input that makes explicit use of these two observations. Our approach maps multiple partial LDR material observations represented as reflectance maps and a background image to a spherical High-Dynamic Range (HDR) illumination map. For training and testing we propose a new data set comprising of synthetic and real images with multiple materials observed under the same illumination. Qualitative and quantitative evidence shows how both multi-material and using a background are essential to improve illumination estimations.
DeLight-Net: Decomposing Reflectance Maps into Specular Materials and Natural Illumination
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, L. Van Gool and T. Tuytelaars
Technical Report, 2016
(arXiv: 1603.08240)
Abstract
In this paper we are extracting surface reflectance and natural environmental illumination from a reflectance map, i.e. from a single 2D image of a sphere of one material under one illumination. This is a notoriously difficult problem, yet key to various re-rendering applications. With the recent advances in estimating reflectance maps from 2D images their further decomposition has become increasingly relevant. To this end, we propose a Convolutional Neural Network (CNN) architecture to reconstruct both material parameters (i.e. Phong) as well as illumination (i.e. high-resolution spherical illumination maps), that is solely trained on synthetic data. We demonstrate that decomposition of synthetic as well as real photographs of reflectance maps, both in High Dynamic Range (HDR), and, for the first time, on Low Dynamic Range (LDR) as well. Results are compared to previous approaches quantitatively as well as qualitatively in terms of re-renderings where illumination, material, view or shape are changed.
RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Y. He, W.-C. Chiu, M. Keuper and M. Fritz
Technical Report, 2016
(arXiv: 1604.02388)
Abstract
Beyond the success in classification, neural networks have recently shown strong results on pixel-wise prediction tasks like image semantic segmentation on RGBD data. However, the commonly used deconvolutional layers for upsampling intermediate representations to the full-resolution output still show different failure modes, like imprecise segmentation boundaries and label mistakes in particular on large, weakly textured objects (e.g. fridge, whiteboard, door). We attribute these errors in part to the rigid way, current network aggregate information, that can be either too local (missing context) or too global (inaccurate boundaries). Therefore we propose a data-driven pooling layer that integrates with fully convolutional architectures and utilizes boundary detection from RGBD image segmentation approaches. We extend our approach to leverage region-level correspondences across images with an additional temporal pooling stage. We evaluate our approach on the NYU-Depth-V2 dataset comprised of indoor RGBD video sequences and compare it to various state-of-the-art baselines. Besides a general improvement over the state-of-the-art, our approach shows particularly good results in terms of accuracy of the predicted boundaries and in segmenting previously problematic classes.
End-to-End Eye Movement Detection Using Convolutional Neural Networks
S. Hoppe and A. Bulling
Technical Report, 2016
(arXiv: 1609.02452)
Abstract
Common computational methods for automated eye movement detection - i.e. the task of detecting different types of eye movement in a continuous stream of gaze data - are limited in that they either involve thresholding on hand-crafted signal features, require individual detectors each only detecting a single movement, or require pre-segmented data. We propose a novel approach for eye movement detection that only involves learning a single detector end-to-end, i.e. directly from the continuous gaze data stream and simultaneously for different eye movements without any manual feature crafting or segmentation. Our method is based on convolutional neural networks (CNN) that recently demonstrated superior performance in a variety of tasks in computer vision, signal processing, and machine learning. We further introduce a novel multi-participant dataset that contains scripted and free-viewing sequences of ground-truth annotated saccades, fixations, and smooth pursuits. We show that our CNN-based method outperforms state-of-the-art baselines by a large margin on this challenging dataset, thereby underlining the significant potential of this approach for holistic, robust, and accurate eye movement protocol analysis.
Articulated Multi-person Tracking in the Wild
E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres and B. Schiele
Technical Report, 2016
(arXiv: 1612.01465)
Abstract
In this paper we propose an approach for articulated tracking of multiple people in unconstrained videos. Our starting point is a model that resembles existing architectures for single-frame pose estimation but is several orders of magnitude faster. We achieve this in two ways: (1) by simplifying and sparsifying the body-part relationship graph and leveraging recent methods for faster inference, and (2) by offloading a substantial share of computation onto a feed-forward convolutional architecture that is able to detect and associate body joints of the same person even in clutter. We use this model to generate proposals for body joint locations and formulate articulated tracking as spatio-temporal grouping of such proposals. This allows to jointly solve the association problem for all people in the scene by propagating evidence from strong detections through time and enforcing constraints that each proposal can be assigned to one person only. We report results on a public MPII Human Pose benchmark and on a new dataset of videos with multiple people. We demonstrate that our model achieves state-of-the-art results while using only a fraction of time and is able to leverage temporal information to improve state-of-the-art for crowded scenes.
A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox and B. Schiele
Technical Report, 2016
(arXiv: 1607.06317)
Abstract
Recently, Minimum Cost Multicut Formulations have been proposed and proven to be successful in both motion trajectory segmentation and multi-target tracking scenarios. Both tasks benefit from decomposing a graphical model into an optimal number of connected components based on attractive and repulsive pairwise terms. The two tasks are formulated on different levels of granularity and, accordingly, leverage mostly local information for motion segmentation and mostly high-level information for multi-target tracking. In this paper we argue that point trajectories and their local relationships can contribute to the high-level task of multi-target tracking and also argue that high-level cues from object detection and tracking are helpful to solve motion segmentation. We propose a joint graphical model for point trajectories and object detections whose Multicuts are solutions to motion segmentation {\it and} multi-target tracking problems at once. Results on the FBMS59 motion segmentation benchmark as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark demonstrate the promise of this joint approach.
InstanceCut: from Edges to Instances with MultiCut
A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy and C. Rother
Technical Report, 2016
(arXiv: 1611.08272)
Abstract
This work addresses the task of instance-aware semantic segmentation. Our key motivation is to design a simple method with a new modelling-paradigm, which therefore has a different trade-off between advantages and disadvantages compared to known approaches. Our approach, we term InstanceCut, represents the problem by two output modalities: (i) an instance-agnostic semantic segmentation and (ii) all instance-boundaries. The former is computed from a standard convolutional neural network for semantic segmentation, and the latter is derived from a new instance-aware edge detection model. To reason globally about the optimal partitioning of an image into instances, we combine these two modalities into a novel MultiCut formulation. We evaluate our approach on the challenging CityScapes dataset. Despite the conceptual simplicity of our approach, we achieve the best result among all published methods, and perform particularly well for rare object classes.
Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification
M. Lapin, M. Hein and B. Schiele
Technical Report, 2016
(arXiv: 1612.03663)
Abstract
Top-k error is currently a popular performance measure on large scale image classification benchmarks such as ImageNet and Places. Despite its wide acceptance, our understanding of this metric is limited as most of the previous research is focused on its special case, the top-1 error. In this work, we explore two directions that shed more light on the top-k error. First, we provide an in-depth analysis of established and recently proposed single-label multiclass methods along with a detailed account of efficient optimization algorithms for them. Our results indicate that the softmax loss and the smooth multiclass SVM are surprisingly competitive in top-k error uniformly across all k, which can be explained by our analysis of multiclass top-k calibration. Further improvements for a specific k are possible with a number of proposed top-k loss functions. Second, we use the top-k methods to explore the transition from multiclass to multilabel learning. In particular, we find that it is possible to obtain effective multilabel classifiers on Pascal VOC using a single label per image for training, while the gap between multiclass and multilabel methods on MS COCO is more significant. Finally, our contribution of efficient algorithms for training with the considered top-k and multilabel loss functions is of independent interest.
Visual Stability Prediction and Its Application to Manipulation
W. Li, A. Leonardis and M. Fritz
Technical Report, 2016
(arXiv: 1609.04861)
Abstract
Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel objects and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an {\em end-to-end} approach that directly predicts stability from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way---bypassing the need for an explicit simulation at run-time. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. We first evaluate the approach on synthetic data and compared the results to human judgments on the same stimuli. Further, we extend this approach to reason about future states of such towers that in turn enables successful stacking.
To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction
W. Li, S. Azimi, A. Leonardis and M. Fritz
Technical Report, 2016
(arXiv: 1604.00066)
Abstract
Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel object and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability and related quantities from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way bypassing the need for an explicit simulation. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. The evaluation is carried out on synthetic data and compared to human judgments on the same stimuli.
Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
M. Malinowski, M. Rohrbach and M. Fritz
Technical Report, 2016
(arXiv: 1605.02697)
Abstract
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. To study human consensus, which is related to the ambiguities inherent in this challenging task, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Moreover, we also extend our analysis to VQA, a large-scale question answering about images dataset, where we investigate some particular design choices and show the importance of stronger visual models. At the same time, we achieve strong performance of our model that still uses a global image representation. Finally, based on such analysis, we refine our Ask Your Neurons on DAQUAR, which also leads to a better performance on this challenging task.
Tutorial on Answering Questions about Images with Deep Learning
M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1610.01076)
Abstract
Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task.
Attentive Explanations: Justifying Decisions and Pointing to the Evidence
D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Darrell and M. Rohrbach
Technical Report, 2016
(arXiv: 1612.04757)
Abstract
Deep models are the defacto standard in visual decision models due to their impressive performance on a wide array of visual tasks. However, they are frequently seen as opaque and are unable to explain their decisions. In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions. We postulate that deep models can do this as well and propose our Pointing and Justification (PJ-X) model which can justify its decision with a sentence and point to the evidence by introspecting its decision and explanation process using an attention mechanism. Unfortunately there is no dataset available with reference explanations for visual decision making. We thus collect two datasets in two domains where it is interesting and challenging to explain decisions. First, we extend the visual question answering task to not only provide an answer but also a natural language explanation for the answer. Second, we focus on explaining human activities which is traditionally more challenging than object classification. We extensively evaluate our PJ-X model, both on the justification and pointing tasks, by comparing it to prior models and ablations using both automatic and human evaluations.
Articulated People Detection and Pose Estimation in Challenging Real World Environments
L. Pishchulin
PhD Thesis, Universität des Saarlandes, 2016
Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling
H. Sattar, A. Bulling and M. Fritz
Technical Report, 2016
(arXiv: 1611.10162)
Abstract
Previous work focused on predicting visual search targets from human fixations but, in the real world, a specific target is often not known, e.g. when searching for a present for a friend. In this work we instead study the problem of predicting the mental picture, i.e. only an abstract idea instead of a specific target. This task is significantly more challenging given that mental pictures of the same target category can vary widely depending on personal biases, and given that characteristic target attributes can often not be verbalised explicitly. We instead propose to use gaze information as implicit information on users' mental picture and present a novel gaze pooling layer to seamlessly integrate semantic and localized fixation information into a deep image representation. We show that we can robustly predict both the mental picture's category as well as attributes on a novel dataset containing fixation data of 14 users searching for targets on a subset of the DeepFahion dataset. Our results have important implications for future search interfaces and suggest deep gaze pooling as a general-purpose approach for gaze-supported computer vision systems.
Seeing with Humans: Gaze-Assisted Neural Image Captioning
Y. Sugano and A. Bulling
Technical Report, 2016
(arXiv: 1608.05203)
Abstract
Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.
A Message Passing Algorithm for the Minimum Cost Multicut Problem
P. Swoboda and B. Andres
Technical Report, 2016
(arXiv: 1612.05441)
Abstract
We propose a dual decomposition and linear program relaxation of the NP -hard minimum cost multicut problem. Unlike other polyhedral relaxations of the multicut polytope, it is amenable to efficient optimization by message passing. Like other polyhedral elaxations, it can be tightened efficiently by cutting planes. We define an algorithm that alternates between message passing and efficient separation of cycle- and odd-wheel inequalities. This algorithm is more efficient than state-of-the-art algorithms based on linear programming, including algorithms written in the framework of leading commercial software, as we show in experiments with large instances of the problem from applications in computer vision, biomedical image analysis and data mining.
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
Technical Report, 2016
(arXiv: 1611.08860)
Abstract
While appearance-based gaze estimation methods have traditionally exploited information encoded solely from the eyes, recent results from a multi-region method indicated that using the full face image can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through evaluation on the recent MPIIGaze and EYEDIAP gaze estimation datasets, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and particularly pronounced for the most challenging extreme head poses.
2015
Efficient Output Kernel Learning for Multiple Tasks
P. Jawanpuria, M. Lapin, M. Hein and B. Schiele
Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015
Top-k Multiclass SVM
M. Lapin, M. Hein and B. Schiele
Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015
Bridging the Gap Between Synthetic and Real Data
M. Fritz
Machine Learning with Interdependent and Non-Identically Distributed Data, 2015