They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers
M. Khamis, L. Bandelow, S. Schick, D. Casadevall, A. Bulling and F. Alt
16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017
EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging
C. Lander, S. Gehring, M. Löchtefeld, A. Bulling and A. Krüger
16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017
Long-Term On-Board Prediction of Pedestrians in Traffic Scenes
A. Bhattacharyya, M. Fritz and B. Schiele
1st Conference on Robot Learning (CoRL 2017), 2017
Gradient-free Policy Architecture Search and Adaptation
S. Ebrahimi, A. Rohrbach and T. Darrell
1st Conference on Robot Learning (CoRL 2017), 2017
STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Y. He, W.-C. Chiu, M. Keuper and M. Fritz
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Learning Non-maximum Suppression
J. Hosang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
ArtTrack: Articulated Multi-Person Tracking in the Wild
E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Gaze Embeddings for Zero-Shot Image Classification
N. Karessli, Z. Akata, B. Schiele and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Learning Video Object Segmentation from Static Images
A. Khoreva, F. Perazzi, R. Benenson, B. Schiele and A. Sorkine-Hornung
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Simple Does It: Weakly Supervised Instance and Semantic Segmentation
A. Khoreva, R. Benenson, J. Hosang, M. Hein and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
InstanceCut: from Edges to Instances with MultiCut
A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy and C. Rother
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Exploiting Saliency for Object Segmentation from Image Level Labels
S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Generating Descriptions with Grounded and Co-Referenced People
A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Domain Based Approach to Social Relation Recognition
Q. Sun, B. Schiele and M. Fritz
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Message Passing Algorithm for the Minimum Cost Multicut Problem
P. Swoboda and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Multiple People Tracking by Lifted Multicut and Person Re-identification
S. Tang, M. Andriluka, B. Andres and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Zero-shot learning - The Good, the Bad and the Ugly
Y. Xian, B. Schiele and Z. Akata
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
CityPersons: A Diverse Dataset for Pedestrian Detection
S. Zhang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Convnets have enabled significant progress in pedestrian detection recently, but there are still open questions regarding suitable architectures and training data. We revisit CNN design and point out key adaptations, enabling plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset. To achieve further improvement from more and better data, we introduce CityPersons, a new set of person annotations on top of the Cityscapes dataset. The diversity of CityPersons allows us for the first time to train one single CNN model that generalizes well over multiple benchmarks. Moreover, with additional training with CityPersons, we obtain top results using FasterRCNN on Caltech, improving especially for more difficult cases (heavy occlusion and small scale) and providing higher localization quality.
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), 2017
Visual Stability Prediction and Its Application to Manipulation
W. Li, A. Leonardis and M. Fritz
AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, 2017
Pose Guided Person Image Generation
L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars and L. Van Gool
Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017
ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones
M. X. Huang, J. Li, G. Ngai and H. V. Leong
CHI’17, 35th Annual ACM Conference on Human Factors in Computing Systems, 2017
Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications
M. Klauck, Y. Sugano and A. Bulling
CHI 2017 Extended Abstracts, 2017
Lucid Data Dreaming for Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
DAVIS Challenge on Video Object Segmentation 2017, 2017
GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication
M. Khamis,, M. Hassib, E. von Zezschwitz, A. Bulling and F. Alt
ICMI’17, 19th ACM International Conference on Multimodal Interaction, 2017
What Is Around The Camera?
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Adversarial Image Perturbation for Privacy Protection -- A Game Theory Perspective
S. J. Oh, M. Fritz and B. Schiele
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images
T. Orekondy, B. Schiele and M. Fritz
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Efficient Algorithms for Moral Lineage Tracing
M. Rempfler, J.-H. Lange, F. Jug, C. Blasse, E. W. Myers, B. H. Menze and B. Andres
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz and B. Schiele
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Paying Attention to Descriptions Generated by Image Captioning Models
H. R. Tavakoli, R. Shetty, A. Borji and J. Laaksonen
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling
H. Sattar, A. Bulling and M. Fritz
2017 IEEE International Conference on Computer Vision Workshops (MBCC @ICCV 2017), 2017
Previous work focused on predicting visual search targets from human fixations but, in the real world, a specific target is often not known, e.g. when searching for a present for a friend. In this work we instead study the problem of predicting the mental picture, i.e. only an abstract idea instead of a specific target. This task is significantly more challenging given that mental pictures of the same target category can vary widely depending on personal biases, and given that characteristic target attributes can often not be verbalised explicitly. We instead propose to use gaze information as implicit information on users' mental picture and present a novel gaze pooling layer to seamlessly integrate semantic and localized fixation information into a deep image representation. We show that we can robustly predict both the mental picture's category as well as attributes on a novel dataset containing fixation data of 14 users searching for targets on a subset of the DeepFahion dataset. Our results have important implications for future search interfaces and suggest deep gaze pooling as a general-purpose approach for gaze-supported computer vision systems.
Visual Stability Prediction for Robotic Manipulation
W. Li, A. Leonardis and M. Fritz
IEEE International Conference on Robotics and Automation (ICRA 2017), 2017
MARCOnI -- ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes
A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 3, 2017
Novel Views of Objects from a Single Image
K. Rematas, C. Nguyen, T. Ritschel, M. Fritz and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 8, 2017
Expanded Parts Model for Semantic Description of Humans in Still Images
G. Sharma, F. Jurie and C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 1, 2017
A Compact Representation of Human Actions by Sliding Coordinate Coding
R. Ding, Q. Sun, M. Liu and H. Liu
International Journal of Advanced Robotic Systems, Volume 14, Number 6, 2017
Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
M. Malinowski, M. Rohrbach and M. Fritz
International Journal of Computer Vision, Volume 125, Number 1-3, 2017
Movie Description
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele
International Journal of Computer Vision, Volume 123, Number 1, 2017
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015.
Cell Lineage Tracing in Lens-Free Microscopy Videos
M. Rempfler, S. Kumar, V. Stierle, P. Paulitschke, B. Andres and B. H. Menze
Medical Image Computing and Computer Assisted Intervention -- MICCAI 2017, 2017
Building Statistical Shape Spaces for 3D Human Modeling
L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt and B. Schiele
Pattern Recognition, Volume 67, 2017
Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes
Q. Sun, H. Liu and T. Harada
Pattern Recognition, Volume 64, 2017
Learning Dilation Factors for Semantic Segmentation of Street Scenes
Y. He, M. Keuper, B. Schiele and M. Fritz
Pattern Recognition (GCPR 2017), 2017
A Comparative Study of Local Search Algorithms for Correlation Clustering
E. Levinkov, A. Kirillov and B. Andres
Pattern Recognition (GCPR 2017), 2017
Look Together: Using Gaze for Assisting Co-located Collaborative Search
Y. Zhang, K. Pfeuffer, M. K. Chong, J. Alexander, A. Bulling and H. Gellersen
Personal and Ubiquitous Computing, Volume 21, Number 1, 2017
GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices
M. Khamis, R. Hasholzner, A. Bulling and F. Alt
Pervasive Displays 2017 (PerDis 2017), 2017
Analysis and Optimization of Graph Decompositions by Lifted Multicuts
A. Horňáková, J.-H. Lange and B. Andres
Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017
EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays
M. Khamis, D. Buschek, T. Thieron, F. Alt and A. Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 4, 2017
InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation
M. Tonsen, J. Steil, Y. Sugano and A. Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 3, 2017
Efficiently Summarising Event Sequences with Rich Interleaving Patterns
A. Bhattacharyya and J. Vreeken
Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), 2017
Are you stressed? Your eyes and the mouse can tell
J. Wang, M. X. Huang, G. Ngai and H. V. Leong
Seventh International Conference on Affective Computing and Intelligent Interaction (ACII 2017), 2017
EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays
M. Khamis, A. Hoesl, A. Klimczak, M. Reiss, F. Alt and A. Bulling
UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017
Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery
X. Zhang, Y. Sugano and A. Bulling
UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017
Analysis and Improvement of the Visual Object Detection Pipeline
J. Hosang
PhD Thesis, Universität des Saarlandes, 2017
Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression.
Learning to Segment in Images and Videos with Different Forms of Supervision
A. Khoreva
PhD Thesis, Universität des Saarlandes, 2017
Much progress has been made in image and video segmentation over the last years. To a large extent, the success can be attributed to the strong appearance models completely learned from data, in particular using deep learning methods. However,to perform best these methods require large representative datasets for training with expensive pixel-level annotations, which in case of videos are prohibitive to obtain. Therefore, there is a need to relax this constraint and to consider alternative forms of supervision, which are easier and cheaper to collect. In this thesis, we aim to develop algorithms for learning to segment in images and videos with different levels of supervision. First, we develop approaches for training convolutional networks with weaker forms of supervision, such as bounding boxes or image labels, for object boundary estimation and semantic/instance labelling tasks. We propose to generate pixel-level approximate groundtruth from these weaker forms of annotations to train a network, which allows to achieve high-quality results comparable to the full supervision quality without any modifications of the network architecture or the training procedure. Second, we address the problem of the excessive computational and memory costs inherent to solving video segmentation via graphs. We propose approaches to improve the runtime and memory efficiency as well as the output segmentation quality by learning from the available training data the best representation of the graph. In particular, we contribute with learning must-link constraints, the topology and edge weights of the graph as well as enhancing the graph nodes - superpixels - themselves. Third, we tackle the task of pixel-level object tracking and address the problem of the limited amount of densely annotated video data for training convolutional networks. We introduce an architecture which allows training with static images only and propose an elaborate data synthesis scheme which creates a large number of training examples close to the target domain from the given first frame mask. With the proposed techniques we show that densely annotated consequent video data is not necessary to achieve high-quality temporally coherent video segmentationresults. In summary, this thesis advances the state of the art in weakly supervised image segmentation, graph-based video segmentation and pixel-level object tracking and contributes with the new ways of training convolutional networks with a limited amount of pixel-level annotated training data.
Lucid Data Dreaming for Multiple Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
Technical Report, 2017
(arXiv: 1703.09554)
Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k ~ 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the object tracking task.
Decomposition of Trees and Paths via Correlation
J.-H. Lange and B. Andres
Technical Report, 2017
(arXiv: 1706.06822v2)
We study the problem of decomposing (clustering) a tree with respect to costs attributed to pairs of nodes, so as to minimize the sum of costs for those pairs of nodes that are in the same component (cluster). For the general case and for the special case of the tree being a star, we show that the problem is NP-hard. For the special case of the tree being a path, this problem is known to be polynomial time solvable. We characterize several classes of facets of the combinatorial polytope associated with a formulation of this clustering problem in terms of lifted multicuts. In particular, our results yield a complete totally dual integral (TDI) description of the lifted multicut polytope for paths, which establishes a connection to the combinatorial properties of alternative formulations such as set partitioning.
Image Classification with Limited Training Data and Class Ambiguity
M. Lapin
PhD Thesis, Universität des Saarlandes, 2017
Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations.
Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning
W. Li, J. Bohg and M. Fritz
Technical Report, 2017
(arXiv: 1711.00267)
Understanding physical phenomena is a key component of human intelligence and enables physical interaction with previously unseen environments. In this paper, we study how an artificial agent can autonomously acquire this intuition through interaction with the environment. We created a synthetic block stacking environment with physics simulation in which the agent can learn a policy end-to-end through trial and error. Thereby, we bypass to explicitly model physical knowledge within the policy. We are specifically interested in tasks that require the agent to reach a given goal state that may be different for every new trial. To this end, we propose a deep reinforcement learning framework that learns policies which are parametrized by a goal. We validated the model on a toy example navigating in a grid world with different target positions and in a block stacking task with different target structures of the final tower. In contrast to prior work, our policies show better generalization across different goals.
Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image
M. Malinowski
PhD Thesis, Universität des Saarlandes, 2017
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
Person Recognition in Social Media Photos
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
Technical Report, 2017
(arXiv: 1710.03224)
People nowadays share large parts of their personal lives through social media. Being able to automatically recognise people in personal photos may greatly enhance user convenience by easing photo album organisation. For human identification task, however, traditional focus of computer vision has been face recognition and pedestrian re-identification. Person recognition in social media photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses) and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnet features from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearance gap between training and testing samples. We present an in-depth analysis of the importance of different features according to time and viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, social groups, and events. Compared the conference version of the paper, this paper additionally presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results even compared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including the head viewpoint-wise breakdown of performance, and (5) results on the open-world setup.
Whitening Black-Box Neural Networks
S. J. Oh, M. Augustin, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1711.01768)
Many deployed learned models are black boxes: given input, returns output. Internal information about the model, such as the architecture, optimisation procedure, or training data, is not disclosed explicitly as it might contain proprietary information or make the system more vulnerable. This work shows that such attributes of neural networks can be exposed from a sequence of queries. This has multiple implications. On the one hand, our work exposes the vulnerability of black-box neural networks to different types of attacks -- we show that the revealed internal information helps generate more effective adversarial examples against the black box model. On the other hand, this technique can be used for better protection of private content from automatic recognition models using adversarial examples. Our paper suggests that it is actually hard to draw a line between white box and black box models.
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach
Technical Report, 2017
(arXiv: 1711.07373)
Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotations is one of the bottlenecks of generating multimodal explanations. Thus, we propose two large-scale datasets with annotations that visually and textually justify a classification decision for various activities, i.e. ACT-X, and for question answering, i.e. VQA-X. We also introduce a multimodal methodology for generating visual and textual explanations simultaneously. We quantitatively show that training with the textual explanations not only yields better textual justification models, but also models that better localize the evidence that support their decision.
Generation and Grounding of Natural Language Descriptions for Visual Data
A. Rohrbach
PhD Thesis, Universität des Saarlandes, 2017
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.
Visual Decoding of Targets During Visual Search From Human Eye Fixations
H. Sattar, M. Fritz and A. Bulling
Technical Report, 2017
(arXiv: 1706.05993)
What does human gaze reveal about a users' intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user's mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model's predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of user
People detection and tracking in crowded scenes
S. Tang
PhD Thesis, Universität des Saarlandes, 2017
People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance.
Towards Segmenting Consumer Stereo Videos: Benchmark, Baselines and Ensembles
W.-C. Chiu, F. Galasso and M. Fritz
Computer Vision -- ACCV 2016, 2016