2017
Learning Non-maximum Suppression
J. Hosang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Gaze Embeddings for Zero-Shot Image Classification
N. Karessli, Z. Akata, B. Schiele and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Simple Does It: Weakly Supervised Instance and Semantic Segmentation
A. Khoreva, R. Benenson, J. Hosang, M. Hein and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Learning Video Object Segmentation from Static Images
A. Khoreva, F. Perazzi, R. Benenson, B. Schiele and A. Sorkine-Hornung
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Exploiting Saliency for Object Segmentation from Image Level Labels
S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Generating Descriptions with Grounded and Co-Referenced People
A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Zero-shot learning - The Good, the Bad and the Ugly
Y. Xian, B. Schiele and Z. Akata
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications
M. Klauck, Y. Sugano and A. Bulling
CHI 2017 Extended Abstracts, 2017
(Accepted/in press)
Visual Stability Prediction for Robotic Manipulation
W. Li, A. Leonardis and M. Fritz
IEEE International Conference on Robotics and Automation (ICRA 2017), 2017
(Accepted/in press)
MARCOnI-ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes
A. Elhayek, E. de Aguiar, A. Jain, J. Thompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 33, Number 3, 2017
Expanded Parts Model for Semantic Description of Humans in Still Images
G. Sharma, F. Jurie and C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 1, 2017
Towards Reaching Human Performance in Pedestrian Detection
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017
Abstract
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background- versus-foreground errors. To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of training and test annotations.
Movie Description
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele
International Journal of Computer Vision, Volume 123, Number 1, 2017
Abstract
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015.
Building Statistical Shape Spaces for 3D Human Modeling
L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt and B. Schiele
Pattern Recognition, Volume 67, 2017
Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes
Q. Sun, H. Liu and T. Harada
Pattern Recognition, Volume 64, 2017
Look Together: Using Gaze for Assisting Co-located Collaborative Search
Y. Zhang, K. Pfeuffer, M. K. Chong, J. Alexander, A. Bulling and H. Gellersen
Personal and Ubiquitous Computing, Volume 21, Number 1, 2017
Efficiently Summarising Event Sequences with Rich Interleaving Patterns
A. Bhattacharyya and J. Vreeken
Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), 2017
(Accepted/in press)
Lucid Data Dreaming for Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
Technical Report, 2017
(arXiv: 1703.09554)
Abstract
Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k ~ 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the object tracking task.
Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images
T. Orekondy, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1703.10660)
Abstract
With an increasing number of users sharing information online, privacy implications entailing such actions are a major concern. For explicit content, such as user profile or GPS data, devices (e.g. mobile phones) as well as web services (e.g. Facebook) offer to set privacy settings in order to enforce the users' privacy preferences. We propose the first approach that extends this concept to image content in the spirit of a Visual Privacy Advisor. First, we categorize personal information in images into 68 image attributes and collect a dataset, which allows us to train models that predict such information directly from images. Second, we run a user study to understand the privacy preferences of different users w.r.t. such attributes. Third, we propose models that predict user specific privacy score from images in order to enforce the users' privacy preferences. Our model is trained to predict the user specific privacy risk and even outperforms the judgment of the users, who often fail to follow their own privacy preferences on image data.
Efficient Algorithms for Moral Lineage Tracing
M. Rempfler, J.-H. Lange, F. Jug, C. Blasse, E. W. Myers, B. H. Menze and B. Andres
Technical Report, 2017
(arXiv: 1702.04111)
Abstract
Lineage tracing, the joint segmentation and tracking of living cells as they move and divide in a sequence of light microscopy images, is a challenging task. Jug et al. have proposed a mathematical abstraction of this task, the moral lineage tracing problem (MLTP) whose feasible solutions define a segmentation of every image and a lineage forest of cells. Their branch-and-cut algorithm, however, is prone to many cuts and slow convergences for large instances. To address this problem, we make three contributions: Firstly, we improve the branch-and-cut algorithm by separating tighter cutting planes. Secondly, we define two primal feasible local search algorithms for the MLTP. Thirdly, we show in experiments that our algorithms decrease the runtime on the problem instances of Jug et al. considerably and find solutions on larger instances in reasonable time.
Generation and Grounding of Natural Language Descriptions for Visual Data
A. Rohrbach
PhD Thesis, universität des Saarlandes, 2017
Abstract
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.