They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers
M. Khamis, L. Bandelow, S. Schick, D. Casadevall, A. Bulling and F. Alt
16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017
EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging
C. Lander, S. Gehring, M. Löchtefeld, A. Bulling and A. Krüger
16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017
Long-Term On-Board Prediction of Pedestrians in Traffic Scenes
A. Bhattacharyya, M. Fritz and B. Schiele
1st Conference on Robot Learning (CoRL 2017), 2017
STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Y. He, W.-C. Chiu, M. Keuper and M. Fritz
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Learning Non-maximum Suppression
J. Hosang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
ArtTrack: Articulated Multi-Person Tracking in the Wild
E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Gaze Embeddings for Zero-Shot Image Classification
N. Karessli, Z. Akata, B. Schiele and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Simple Does It: Weakly Supervised Instance and Semantic Segmentation
A. Khoreva, R. Benenson, J. Hosang, M. Hein and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Learning Video Object Segmentation from Static Images
A. Khoreva, F. Perazzi, R. Benenson, B. Schiele and A. Sorkine-Hornung
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
InstanceCut: from Edges to Instances with MultiCut
A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy and C. Rother
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Exploiting Saliency for Object Segmentation from Image Level Labels
S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Generating Descriptions with Grounded and Co-Referenced People
A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Domain Based Approach to Social Relation Recognition
Q. Sun, B. Schiele and M. Fritz
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Message Passing Algorithm for the Minimum Cost Multicut Problem
P. Swoboda and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Multiple People Tracking by Lifted Multicut and Person Re-identification
S. Tang, M. Andriluka, B. Andres and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Zero-shot learning - The Good, the Bad and the Ugly
Y. Xian, B. Schiele and Z. Akata
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
CityPersons: A Diverse Dataset for Pedestrian Detection
S. Zhang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Convnets have enabled significant progress in pedestrian detection recently, but there are still open questions regarding suitable architectures and training data. We revisit CNN design and point out key adaptations, enabling plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset. To achieve further improvement from more and better data, we introduce CityPersons, a new set of person annotations on top of the Cityscapes dataset. The diversity of CityPersons allows us for the first time to train one single CNN model that generalizes well over multiple benchmarks. Moreover, with additional training with CityPersons, we obtain top results using FasterRCNN on Caltech, improving especially for more difficult cases (heavy occlusion and small scale) and providing higher localization quality.
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), 2017
Visual Stability Prediction and Its Application to Manipulation
W. Li, A. Leonardis and M. Fritz
AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, 2017
Everyday Eye Tracking for Real-World Consumer Behavior Analysis
A. Bulling and M. Wedel
A Handbook of Process Tracing Methods for Decision Research, 2017
(Accepted/in press)
ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones
M. X. Huang, J. Li, G. Ngai and H. V. Leong
CHI’17, 35th Annual ACM Conference on Human Factors in Computing Systems, 2017
Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications
M. Klauck, Y. Sugano and A. Bulling
CHI 2017 Extended Abstracts, 2017
Lucid Data Dreaming for Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
DAVIS Challenge on Video Object Segmentation 2017, 2017
GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication
M. Khamis,, M. Hassib, E. von Zezschwitz, A. Bulling and F. Alt
ICMI’17, 19th ACM International Conference on Multimodal Interaction, 2017
What Is Around The Camera?
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Adversarial Image Perturbation for Privacy Protection -- A Game Theory Perspective
S. J. Oh, M. Fritz and B. Schiele
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images
T. Orekondy, B. Schiele and M. Fritz
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Efficient Algorithms for Moral Lineage Tracing
M. Rempfler, J.-H. Lange, F. Jug, C. Blasse, E. W. Myers, B. H. Menze and B. Andres
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz and B. Schiele
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Paying Attention to Descriptions Generated by Image Captioning Models
H. R. Tavakoli, R. Shetty, A. Borji and J. Laaksonen
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling
H. Sattar, A. Bulling and M. Fritz
2017 IEEE International Conference on Computer Vision Workshops (MBCC @ICCV 2017), 2017
Previous work focused on predicting visual search targets from human fixations but, in the real world, a specific target is often not known, e.g. when searching for a present for a friend. In this work we instead study the problem of predicting the mental picture, i.e. only an abstract idea instead of a specific target. This task is significantly more challenging given that mental pictures of the same target category can vary widely depending on personal biases, and given that characteristic target attributes can often not be verbalised explicitly. We instead propose to use gaze information as implicit information on users' mental picture and present a novel gaze pooling layer to seamlessly integrate semantic and localized fixation information into a deep image representation. We show that we can robustly predict both the mental picture's category as well as attributes on a novel dataset containing fixation data of 14 users searching for targets on a subset of the DeepFahion dataset. Our results have important implications for future search interfaces and suggest deep gaze pooling as a general-purpose approach for gaze-supported computer vision systems.
Visual Stability Prediction for Robotic Manipulation
W. Li, A. Leonardis and M. Fritz
IEEE International Conference on Robotics and Automation (ICRA 2017), 2017
MARCOnI -- ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes
A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 3, 2017
MARCOnI-ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes
A. Elhayek, E. de Aguiar, A. Jain, J. Thompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 33, Number 3, 2017
Reflectance and Natural Illumination from Single-Material Specular
S. Georgoulis, K. Rematas, T. Ritschel, E. Gavves, M. Fritz, L. Van Gool, and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
Novel Views of Objects from a Single Image
K. Rematas, C. Nguyen, T. Ritschel, M. Fritz and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 8, 2017
Expanded Parts Model for Semantic Description of Humans in Still Images
G. Sharma, F. Jurie and C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 1, 2017
Discriminatively Trained Latent Ordinal Model for Video Classification
K. Sikka and G. Sharma
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
Towards Reaching Human Performance in Pedestrian Detection
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background- versus-foreground errors. To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of training and test annotations.
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
(Accepted/in press)
A Compact Representation of Human Actions by Sliding Coordinate Coding
R. Ding, Q. Sun, M. Liu and H. Liu
International Journal of Advanced Robotic Systems, Volume 14, Number 6, 2017
Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
M. Malinowski, M. Rohrbach and M. Fritz
International Journal of Computer Vision, Volume First Online, 2017
Movie Description
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele
International Journal of Computer Vision, Volume 123, Number 1, 2017
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015.
Cell Lineage Tracing in Lens-Free Microscopy Videos
M. Rempfler, S. Kumar, V. Stierle, P. Paulitschke, B. Andres and B. H. Menze
Medical Image Computing and Computer Assisted Intervention -- MICCAI 2017, 2017
Building Statistical Shape Spaces for 3D Human Modeling
L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt and B. Schiele
Pattern Recognition, Volume 67, 2017
Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes
Q. Sun, H. Liu and T. Harada
Pattern Recognition, Volume 64, 2017
Learning Dilation Factors for Semantic Segmentation of Street Scenes
Y. He, M. Keuper, B. Schiele and M. Fritz
Pattern Recognition (GCPR 2017), 2017
A Comparative Study of Local Search Algorithms for Correlation Clustering
E. Levinkov, A. Kirillov and B. Andres
Pattern Recognition (GCPR 2017), 2017
Look Together: Using Gaze for Assisting Co-located Collaborative Search
Y. Zhang, K. Pfeuffer, M. K. Chong, J. Alexander, A. Bulling and H. Gellersen
Personal and Ubiquitous Computing, Volume 21, Number 1, 2017
GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices
M. Khamis, R. Hasholzner, A. Bulling and F. Alt
Pervasive Displays 2017 (PerDis 2017), 2017
Analysis and Optimization of Graph Decompositions by Lifted Multicuts
A. Horňáková, J.-H. Lange and B. Andres
Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017
EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays
M. Khamis, D. Buschek, T. Thieron, F. Alt and A. Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 4, 2017
(Accepted/in press)
InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation
M. Tonsen, J. Steil, Y. Sugano and A. Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 3, 2017
Efficiently Summarising Event Sequences with Rich Interleaving Patterns
A. Bhattacharyya and J. Vreeken
Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), 2017
EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays
M. Khamis, A. Hoesl, A. Klimczak, M. Reiss, F. Alt and A. Bulling
UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017
Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery
X. Zhang, Y. Sugano and A. Bulling
UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017
Advanced Steel Microstructure Classification by Deep Learning Methods
S. M. Azimi, D. Britz, M. Engstler, M. Fritz and F. Mücklich
Technical Report, 2017
(arXiv: 1706.06480)
The inner structure of a material is called microstructure. It stores the genesis of a material and determines all its physical and chemical properties. While microstructural characterization is widely spread and well known, the microstructural classification is mostly done manually by human experts, which opens doors for huge uncertainties. Since the microstructure could be a combination of different phases with complex substructures its automatic classification is very challenging and just a little work in this field has been carried out. Prior related works apply mostly designed and engineered features by experts and classify microstructure separately from feature extraction step. Recently Deep Learning methods have shown surprisingly good performance in vision applications by learning the features from data together with the classification step. In this work, we propose a deep learning method for microstructure classification in the examples of certain microstructural constituents of low carbon steel. This novel method employs pixel-wise segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by max-voting scheme. Our system achieves 93.94% classification accuracy, drastically outperforming the state-of-the-art method of 48.89% accuracy, indicating the effectiveness of pixel-wise approaches. Beyond the success presented in this paper, this line of research offers a more robust and first of all objective way for the difficult task of steel quality appreciation.
Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty
A. Bhattacharyya, M. Fritz and B. Schiele
Technical Report, 2017
(arXiv: 1711.09026)
Progress towards advanced systems for assisted and autonomous driving is leveraging recent advances in recognition and segmentation methods. Yet, we are still facing challenges in bringing reliable driving to inner cities, as those are composed of highly dynamic scenes observed from a moving platform at considerable speeds. Anticipation becomes a key element in order to react timely and prevent accidents. In this paper we argue that it is necessary to predict at least 1 second and we thus propose a new model that jointly predicts ego motion and people trajectories over such large time horizons. We pay particular attention to modeling the uncertainty of our estimates arising from the non-deterministic nature of natural traffic scenes. Our experimental results show that it is indeed possible to predict people trajectories at the desired time horizons and that our uncertainty estimates are informative of the prediction error. We also show that both sequence modeling of trajectories as well as our novel method of long term odometry prediction are essential for best performance.
Analysis and Improvement of the Visual Object Detection Pipeline
J. Hosang
PhD Thesis, Universität des Saarlandes, 2017
Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression.
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
A. Khan, I. Steiner, Y. Sugano, A. Bulling and R. Macdonald
Technical Report, 2017
(arXiv: 1712.04798)
Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy.
Learning to Segment in Images and Videos with Different Forms of Supervision
A. Khoreva, B. Schiele, R. Szeliski and T. Brox
PhD Thesis, Universität des Saarlandes, 2017
Much progress has been made in image and video segmentation over the last years. To a large extent, the success can be attributed to the strong appearance models completely learned from data, in particular using deep learning methods. However,to perform best these methods require large representative datasets for training with expensive pixel-level annotations, which in case of videos are prohibitive to obtain. Therefore, there is a need to relax this constraint and to consider alternative forms of supervision, which are easier and cheaper to collect. In this thesis, we aim to develop algorithms for learning to segment in images and videos with different levels of supervision. First, we develop approaches for training convolutional networks with weaker forms of supervision, such as bounding boxes or image labels, for object boundary estimation and semantic/instance labelling tasks. We propose to generate pixel-level approximate groundtruth from these weaker forms of annotations to train a network, which allows to achieve high-quality results comparable to the full supervision quality without any modifications of the network architecture or the training procedure. Second, we address the problem of the excessive computational and memory costs inherent to solving video segmentation via graphs. We propose approaches to improve the runtime and memory efficiency as well as the output segmentation quality by learning from the available training data the best representation of the graph. In particular, we contribute with learning must-link constraints, the topology and edge weights of the graph as well as enhancing the graph nodes - superpixels - themselves. Third, we tackle the task of pixel-level object tracking and address the problem of the limited amount of densely annotated video data for training convolutional networks. We introduce an architecture which allows training with static images only and propose an elaborate data synthesis scheme which creates a large number of training examples close to the target domain from the given first frame mask. With the proposed techniques we show that densely annotated consequent video data is not necessary to achieve high-quality temporally coherent video segmentationresults. In summary, this thesis advances the state of the art in weakly supervised image segmentation, graph-based video segmentation and pixel-level object tracking and contributes with the new ways of training convolutional networks with a limited amount of pixel-level annotated training data.
Lucid Data Dreaming for Multiple Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
Technical Report, 2017
(arXiv: 1703.09554)
Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k ~ 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the object tracking task.
Decomposition of Trees and Paths via Correlation
J.-H. Lange and B. Andres
Technical Report, 2017
(arXiv: 1706.06822v2)
We study the problem of decomposing (clustering) a tree with respect to costs attributed to pairs of nodes, so as to minimize the sum of costs for those pairs of nodes that are in the same component (cluster). For the general case and for the special case of the tree being a star, we show that the problem is NP-hard. For the special case of the tree being a path, this problem is known to be polynomial time solvable. We characterize several classes of facets of the combinatorial polytope associated with a formulation of this clustering problem in terms of lifted multicuts. In particular, our results yield a complete totally dual integral (TDI) description of the lifted multicut polytope for paths, which establishes a connection to the combinatorial properties of alternative formulations such as set partitioning.
Image Classification with Limited Training Data and Class Ambiguity
M. Lapin
PhD Thesis, Universität des Saarlandes, 2017
Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations.
Discrete-Continuous Splitting for Weakly Supervised Learning
E. Laude, J.-H. Lange, F. R. Schmidt, B. Andres and D. Cremers
Technical Report, 2017
(arXiv: 1705.05020)
This paper introduces a novel algorithm for a class of weakly supervised learning tasks. The considered tasks are posed as joint optimization problems in the continuous model parameters and the (a-priori unknown) discrete label variables. In contrast to prior approaches such as convex relaxations, we decompose the nonconvex problem into purely discrete and purely continuous subproblems in a way that is amenable to distributed optimization by the Alternating Direction Method of Multipliers (ADMM). This approach preserves integrality of the discrete label variables and, for a reparameterized variant of the algorithm using kernels, guarantees global convergence to a critical point. The resulting method implicitly alternates between a discrete and a continuous variable update, however, it is inherently different from a discrete-continuous coordinate descent scheme (hard EM). In diverse experiments we show that our method can learn a classifier from weak supervision that takes the form of hard and soft constraints on the labeling and outperforms hard EM in this task.
Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning
W. Li, J. Bohg and M. Fritz
Technical Report, 2017
(arXiv: 1711.00267)
Understanding physical phenomena is a key component of human intelligence and enables physical interaction with previously unseen environments. In this paper, we study how an artificial agent can autonomously acquire this intuition through interaction with the environment. We created a synthetic block stacking environment with physics simulation in which the agent can learn a policy end-to-end through trial and error. Thereby, we bypass to explicitly model physical knowledge within the policy. We are specifically interested in tasks that require the agent to reach a given goal state that may be different for every new trial. To this end, we propose a deep reinforcement learning framework that learns policies which are parametrized by a goal. We validated the model on a toy example navigating in a grid world with different target positions and in a block stacking task with different target structures of the final tower. In contrast to prior work, our policies show better generalization across different goals.
Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image
M. Malinowski
PhD Thesis, Universität des Saarlandes, 2017
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
Pose Guided Person Image Generation
L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars and L. Van Gool
Technical Report, 2017
(arXiv: 1705.09368)
This paper proposes the novel Pose Guided Person Generation Network (PG$^2$) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose. Our generation framework PG$^2$ utilizes the pose information explicitly and consists of two key stages: pose integration and image refinement. In the first stage the condition image and the target pose are fed into a U-Net-like network to generate an initial but coarse image of the person with the target pose. The second stage then refines the initial and blurry result by training a U-Net-like generator in an adversarial way. Extensive experimental results on both 128$\times$64 re-identification images and 256$\times$256 fashion photos show that our model generates high-quality person images with convincing details.
Disentangled Person Image Generation
L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1712.02621)
Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.
Single-Shot Multi-Person 3D Body Pose Estimation From Monocular RGB Input
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll and C. Theobalt
Technical Report, 2017
(arXiv: 1712.03453)
We propose a new efficient single-shot method for multi-person 3D pose estimation in general scenes from a monocular RGB camera. Our fully convolutional DNN-based approach jointly infers 2D and 3D joint locations on the basis of an extended 3D location map supported by body part associations. This new formulation enables the readout of full body poses at a subset of visible joints without the need for explicit bounding box tracking. It therefore succeeds even under strong partial body occlusions by other people and objects in the scene. We also contribute the first training data set showing real images of sophisticated multi-person interactions and occlusions. To this end, we leverage multi-view video-based performance capture of individual people for ground truth annotation and a new image compositing for user-controlled synthesis of large corpora of real multi-person images. We also propose a new video-recorded multi-person test set with ground truth 3D annotations. Our method achieves state-of-the-art performance on challenging multi-person scenes.
Whitening Black-Box Neural Networks
S. J. Oh, M. Augustin, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1711.01768)
Many deployed learned models are black boxes: given input, returns output. Internal information about the model, such as the architecture, optimisation procedure, or training data, is not disclosed explicitly as it might contain proprietary information or make the system more vulnerable. This work shows that such attributes of neural networks can be exposed from a sequence of queries. This has multiple implications. On the one hand, our work exposes the vulnerability of black-box neural networks to different types of attacks -- we show that the revealed internal information helps generate more effective adversarial examples against the black box model. On the other hand, this technique can be used for better protection of private content from automatic recognition models using adversarial examples. Our paper suggests that it is actually hard to draw a line between white box and black box models.
Person Recognition in Social Media Photos
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
Technical Report, 2017
(arXiv: 1710.03224)
People nowadays share large parts of their personal lives through social media. Being able to automatically recognise people in personal photos may greatly enhance user convenience by easing photo album organisation. For human identification task, however, traditional focus of computer vision has been face recognition and pedestrian re-identification. Person recognition in social media photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses) and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnet features from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearance gap between training and testing samples. We present an in-depth analysis of the importance of different features according to time and viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, social groups, and events. Compared the conference version of the paper, this paper additionally presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results even compared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including the head viewpoint-wise breakdown of performance, and (5) results on the open-world setup.
Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images
T. Orekondy, M. Fritz and B. Schiele
Technical Report, 2017
(arXiv: 1712.01066)
Images convey a broad spectrum of personal information. If such images are shared on social media platforms, this personal information is leaked which conflicts with the privacy of depicted persons. Therefore, we aim for automated approaches to redact such private information and thereby protect privacy of the individual. By conducting a user study we find that obfuscating the image regions related to the private information leads to privacy while retaining utility of the images. Moreover, by varying the size of the regions different privacy-utility trade-offs can be achieved. Our findings argue for a "redaction by segmentation" paradigm. Hence, we propose the first sizable dataset of private images "in the wild" annotated with pixel and instance level labels across a broad range of privacy classes. We present the first model for automatic redaction of diverse private information.
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach
Technical Report, 2017
(arXiv: 1711.07373)
Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotations is one of the bottlenecks of generating multimodal explanations. Thus, we propose two large-scale datasets with annotations that visually and textually justify a classification decision for various activities, i.e. ACT-X, and for question answering, i.e. VQA-X. We also introduce a multimodal methodology for generating visual and textual explanations simultaneously. We quantitatively show that training with the textual explanations not only yields better textual justification models, but also models that better localize the evidence that support their decision.
Generation and Grounding of Natural Language Descriptions for Visual Data
A. Rohrbach
PhD Thesis, universität des Saarlandes, 2017
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.
Visual Decoding of Targets During Visual Search From Human Eye Fixations
H. Sattar, M. Fritz and A. Bulling
Technical Report, 2017
(arXiv: 1706.05993)
What does human gaze reveal about a users' intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user's mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model's predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of user
A^4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation
R. Shetty, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1711.01921)
Text-based analysis methods allow to reveal privacy relevant author attributes such as gender, age and identify of the text's author. Such methods can compromise the privacy of an anonymous author even when the author tries to remove privacy sensitive content. In this paper, we propose an automatic method, called Adversarial Author Attribute Anonymity Neural Translation ($A^4NT$), to combat such text-based adversaries. We combine sequence-to-sequence language models used in machine translation and generative adversarial networks to obfuscate author attributes. Unlike machine translation techniques which need paired data, our method can be trained on unpaired corpora of text containing different authors. Importantly, we propose and evaluate techniques to impose constraints on our $A^4NT$ to preserve the semantics of the input text. $A^4NT$ learns to make minimal changes to the input text to successfully fool author attribute classifiers, while aiming to maintain the meaning of the input. We show through experiments on two different datasets and three settings that our proposed method is effective in fooling the author attribute classifiers and thereby improving the anonymity of authors.
Natural and Effective Obfuscation by Head Inpainting
Q. Sun, L. Ma, S. J. Oh, L. Van Gool, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1711.09001)
As more and more personal photos are shared online, being able to obfuscate identities in such photos is becoming a necessity for privacy protection. People have largely resorted to blacking out or blurring head regions, but they result in poor user experience while being surprisingly ineffective against state of the art person recognizers. In this work, we propose a novel head inpainting obfuscation technique. Generating a realistic head inpainting in social media photos is challenging because subjects appear in diverse activities and head orientations. We thus split the task into two sub-tasks: (1) facial landmark generation from image context (e.g. body pose) for seamless hypothesis of sensible head pose, and (2) facial landmark conditioned head inpainting. We verify that our inpainting method generates realistic person images, while achieving superior obfuscation performance against automatic person recognizers.
People Detection and Tracking in Crowded Scenes
S. Tang, B. Schiele, M. Black and L. V. Gool
PhD Thesis, Universität des Saarlandes, 2017
People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance.
GazeDirector: Fully Articulated Eye Gaze Redirection in Video
E. Wood, T. Baltrusaitis, L.-P. Morency, P. Robinson and A. Bulling
Technical Report, 2017
(arXiv: 1704.08763)
We present GazeDirector, a new approach for eye gaze redirection that uses model-fitting. Our method first tracks the eyes by fitting a multi-part eye region model to video frames using analysis-by-synthesis, thereby recovering eye region shape, texture, pose, and gaze simultaneously. It then redirects gaze by 1) warping the eyelids from the original image using a model-derived flow field, and 2) rendering and compositing synthesized 3D eyeballs onto the output image in a photorealistic manner. GazeDirector allows us to change where people are looking without person-specific training data, and with full articulation, i.e. we can precisely specify new gaze directions in 3D. Quantitatively, we evaluate both model-fitting and gaze synthesis, with experiments for gaze estimation and redirection on the Columbia gaze dataset. Qualitatively, we compare GazeDirector against recent work on gaze redirection, showing better results especially for large redirection angles. Finally, we demonstrate gaze redirection on YouTube videos by introducing new 3D gaze targets and by manipulating visual behavior.
Feature Generating Networks for Zero-Shot Learning
Y. Xian, T. Lorenz, B. Schiele and Z. Akata
Technical Report, 2017
(arXiv: 1712.00981)
Suffering from the extreme training data imbalance between seen and unseen classes, most of existing state-of-the-art approaches fail to achieve satisfactory results for the challenging generalized zero-shot learning task. To circumvent the need for labeled examples of unseen classes, we propose a novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information, offering a shortcut directly from a semantic descriptor of a class to a class-conditional feature distribution. Our proposed approach, pairing a Wasserstein GAN with a classification loss, is able to generate sufficiently discriminative CNN features to train softmax classifiers or any multimodal embedding method. Our experimental results demonstrate a significant boost in accuracy over the state of the art on five challenging datasets -- CUB, FLO, SUN, AWA and ImageNet -- in both the zero-shot learning and generalized zero-shot learning settings.
Zero-Shot Learning -- A Comprehensive Evaluation of the Good, the Bad and the Ugly
Y. Xian, C. H. Lampert, B. Schiele and Z. Akata
Technical Report, 2017
(arXiv: 1707.00600)
Due to the importance of zero-shot learning, i.e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.
Towards Segmenting Consumer Stereo Videos: Benchmark, Baselines and Ensembles
W.-C. Chiu, F. Galasso and M. Fritz
Computer Vision -- ACCV 2016, 2016