# Personal Information

## Research Interests

• Human-Computer Interaction
• Ubiquitous Computing
• Eye Tracking
• Machine Learning and Pattern Recognition
• Egocentric Computer Vision

## Education

• PhD in Information Technology and Electrical Engineering (October 2006 - June 2010)
Swiss Federal Institute of Technology (ETH) Zurich, Switzerland
• MSc in Computer Science (October 2001 - June 2006)
Technical University of Karlsruhe, Germany

## Short-Bio

Andreas Bulling is Full Professor of Human-Computer Interaction and Cognitive Systems at the University of Stuttgart and head of the Perceptual User Interfaces Group at the Max Planck Institute for Informatics. He received his MSc. (Dipl.-Inform.) in Computer Science from the Karlsruhe Institute of Technology (KIT), Germany, focusing on embedded systems, robotics, and biomedical engineering. He holds a PhD in Information Technology and Electrical Engineering from the Swiss Federal Institute of Technology (ETH) Zurich, Switzerland. Andreas was previously a Feodor Lynen Research Fellow and a Marie Curie Research Fellow in the Computer Laboratory at the University of Cambridge, UK, a postdoctoral research associate in the School of Computing and Communications at Lancaster University, UK, as well as a Junior Research Fellow at Wolfson College, Cambridge. Andreas is UbiComp steering committee member and serves on the editorial boards of the Proceedings of the ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies, ACM Transactions on Interactive Intelligent Systems, and the Journal of Eye Movement Research. He also served as co-chair, TPC member and reviewer for major conferences, most recently as TPC co-chair for ACM UbiComp 2016 and IEEE PerCom 2015 as well as associate chair for ACM ETRA 2016 and 2018, as well as ACM CHI 2013, 2014, 2018, and 2019. He received an ERC Starting Grant in 2018.

# Publications

2023
A Polyhedral Study of Lifted Multicuts
B. Andres, S. Di Gregorio, J. Irmai and J.-H. Lange
Discrete Optimization, Volume 47, 2023
H. Chen, R. Tao, Y. Fan, Y. Wang, M. Savvides, J. Wang, B. Raj, X. Xie and B. Schiele
Eleventh International Conference on Learning Representations (ICLR 2023), 2023
(Accepted/in press)
Neural Architecture Design and Robustness: A Dataset
S. Jung, J. Lukasik and M. Keuper
Eleventh International Conference on Learning Representations (ICLR 2023), 2023
(Accepted/in press)
Abstract
Deep learning models have proven to be successful in a wide <br>range of machine learning tasks. Yet, they are often highly sensitive to <br>perturbations on the input data which can lead to incorrect decisions <br>with high confidence, hampering their deployment for practical <br>use-cases. Thus, finding architectures that are (more) robust against <br>perturbations has received much attention in recent years. Just like the <br>search for well-performing architectures in terms of clean accuracy, <br>this usually involves a tedious trial-and-error process with one <br>additional challenge: the evaluation of a network's robustness is <br>significantly more expensive than its evaluation for clean accuracy. <br>Thus, the aim of this paper is to facilitate better streamlined research <br>on architectural design choices with respect to their impact on <br>robustness as well as, for example, the evaluation of surrogate measures <br>for robustness. We therefore borrow one of the most commonly considered <br>search spaces for neural architecture search for image classification, <br>NAS-Bench-201, which contains a manageable size of 6466 non-isomorphic <br>network designs. We evaluate all these networks on a range of common <br>adversarial attacks and corruption types and introduce a database on <br>neural architecture design and robustness evaluations. We further <br>present three exemplary use cases of this dataset, in which we (i) <br>benchmark robustness measurements based on Jacobian and Hessian matrices <br>for their robustness predictability, (ii) perform neural architecture <br>search on robust accuracies, and (iii) provide an initial analysis of <br>how architectural design choices affect robustness. We find that <br>carefully crafting the topology of a network can have substantial impact <br>on its robustness, where networks with the same parameter count range in <br>mean adversarial robust accuracy from 20%-41%.
FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning
Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, B. Schiele and X. Xie
Eleventh International Conference on Learning Representations (ICLR 2023), 2023
(Accepted/in press)
Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds
D. Dai, A. B. Vasudevan, J. Matas and L. Van Gool
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 45, Number 1, 2023
Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation
E. Levinkov, A. Kardoost, B. Andres and M. Keuper
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 45, Number 1, 2023
Abstract
Minimum cost lifted multicut problem is a generalization of the multicut problem and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. Its main advantage is that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e. hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is NP-hard and the branch-and-bound algorithm is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: geometric multiple model fitting, homography and motion estimation, motion segmentation.
Urban Scene Semantic Segmentation With Low-Cost Coarse Annotation
A. Das, Y. Xian, Y. He, Z. Akata and B. Schiele
2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), 2023
Intra-Source Style Augmentation for Improved Domain Generalization
Y. Li, D. Zhang, M. Keuper and A. Khoreva
2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), 2023
Revisiting Consistency Regularization for Semi-supervised Learning
Y. Fan, A. Kukleva, D. Dai and B. Schiele
International Journal of Computer Vision, Volume 131, 2023
Learning Comprehensive Global Features in Person Re-identification: Ensuring Discriminativeness of more Local Regions
J. Xi, J. Huang, S. Zheng, Q. Zhou, B. Schiele, X.-S. Hua and Q. Sun
Pattern Recognition, Volume 134, 2023
Online Hyperparameter Optimization for Class-Incremental Learning
Y. Liu, Y. Li, B. Schiele and Q. Sun
Proceedings of the 37th AAAI Conference on Artificial Intelligence, 2023
(Accepted/in press)
Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering
D. M. H. Nguyen, H. Nguyen, M. T. N. Truong, T. Cao, B. T. Nguyen, N. Ho, P. Swoboda, S. Albarqouni, P. Xie and D. Sonntag
Proceedings of the 37th AAAI Conference on Artificial Intelligence, 2023
(Accepted/in press)
L. H. Abdel Khaliq
PhD Thesis, Universität des Saarlandes, 2023
2022
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval
A. Chaudhuri, M. Mancini, Y. Chen, Z. Akata and A. Dutta
33rd British Machine Vision Conference (BMVC 2022), 2022
Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment
Y. Ma, Y. Chen and Z. Akata
33rd British Machine Vision Conference (BMVC 2022), 2022
SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Y. Zhou, W. Xiang, C. Li, B. Wang, X. Wei, L. Zhang, M. Keuper and X. Hua
33rd British Machine Vision Conference (BMVC 2022), 2022
Relational Proxies: Emergent Relationships as Fine-Grained Discriminators
A. Chaudhuri, M. Mancini, Z. Akata and A. Dutta
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
Robust Models are less Over-Confident
J. Grabinski, P. Gavrikov, J. Keuper and M. Keuper
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
Trading off Image Quality for Robustness is not Necessary with Regularized Deterministic Autoencoders
A. Saseendran, K. Skubch and M. Keuper
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
Motion Transformer with Global Intention Localization and Local Movement Refinement
S. Shi, L. Jiang, D. Dai and B. Schiele
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds
H. Wang, L. Ding, S. Dong, S. Shi, A. Li, J. Li, Z. Li and L. Wang
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
USB: A Unified Semi-supervised Learning Benchmark for Classification
Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo, H. Qi, Z. Wu, Y.-F. Li, S. Nakamura, W. Ye, M. Savvides, B. Raj, T. Shinozaki, B. Schiele, J. Wang, X. Xie and Y. Zhang
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
Towards Efficient 3D Object Detection with Knowledge Distillation
J. Yang, S. Shi, R. Ding, Z. Wang and X. Qi
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
Abstracting Sketches Through Simple Primitives
S. Alaniz, M. Mancini, A. Dutta, D. Marcos and Z. Akata
Computer Vision -- ECCV 2022, 2022
MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection
X. Chen, S. Shi, B. Zhu, K. C. Cheung, H. Xu and H. Li
Computer Vision -- ECCV 2022, 2022
Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation using Bounding Boxes
J. Chibane, F. Engelmann, A. T. Tran and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
E. Corona, G. Pons-Moll, G. Alenyà and F. Moreno-Noguer
Computer Vision -- ECCV 2022, 2022
DODA: Data-Oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation
R. Ding, J. Yang, L. Jiang and X. Qi
Computer Vision -- ECCV 2022, 2022
TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation
R. Gong, M. Danelljan, D. Dai, D. P. Paudel, A. Chhatkuli, F. Yu and L. Van Gool
Computer Vision -- ECCV 2022, 2022
Class-Agnostic Object Counting Robust to Intraclass Diversity
S. Gong, S. Zhang, J. Yang, D. Dai and B. Schiele
Computer Vision -- ECCV 2022, 2022
FrequencyLowCut Pooling - Plug & Play against Catastrophic Overfitting
J. Grabinski, S. Jung, J. Keuper and M. Keuper
Computer Vision -- ECCV 2022, 2022
Improving Robustness by Enhancing Weak Subnets
Y. Guo, D. Stutz and B. Schiele
Computer Vision -- ECCV 2022, 2022
A Comparative Study of Graph Matching Algorithms in Computer Vision
S. Haller, L. Feineis, L. Hutschenreiter, F. Bernard, C. Rother, D. Kainmüller, P. Swoboda and B. Savchynskyy
Computer Vision -- ECCV 2022, 2022
HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
Computer Vision -- ECCV 2022, 2022
Skeleton-Free Pose Transfer for Stylized 3D Characters
Z. Liao, J. Yang, J. Saito, G. Pons-Moll and Y. Zhou
Computer Vision -- ECCV 2022, 2022
CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video
W. Lin, A. Kukleva, K. Sun, H. Possegger, H. Kuehne and H. Bischof
Computer Vision -- ECCV 2022, 2022
Learning Where To Look - Generative NAS is Surprisingly Efficient
J. Lukasik, S. Jung and M. Keuper
Computer Vision -- ECCV 2022, 2022
Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
O.-B. Mercea, T. Hummel, A. S. Koepke and Z. Akata
Computer Vision -- ECCV 2022, 2022
HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance
S. Shimada, V. Golyanik, Z. Li, P. Pérez, W. Xu and C. Theobalt
Computer Vision -- ECCV 2022, 2022
Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields
G. Tiwari, D. Antic, J. E. Lenssen, N. Sarafianos, T. Tung and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
CHORE: Contact, Human and Object Reconstruction from a Single RGB Image
X. Xie, B. L. Bhatnagar and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
COUCH: Towards Controllable Human-Chair Interactions
X. Zhang, B. L. Bhatnagar, S. Starke, V. Guzov and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
K. Zhou, B. L. Bhatnagar, J. E. Lenssen and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
H. Cao, X. Hong, H. Tost, A. Meyer-Lindenberg and E. Schwarz
Frontiers in Psychiatry, Volume 13, 2022
Semantic Image Synthesis with Semantically Coupled VQ-Model
S. Alaniz, T. Hummel and Z. Akata
ICLR Workshop on Deep Generative Models for Highly Structured Data (ICLR 2022 DGM4HSD), 2022
RAMA: A Rapid Multicut Algorithm on GPU
A. Abbas and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
FastDOG: Fast Discrete Optimization on GPU
A. Abbas and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
BEHAVE: Dataset and Method for Tracking Human Object Interactions
B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
B-cos Networks: Alignment is All We Need for Interpretability
M. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation
S. Cai, A. Obukhov, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Decoupling Zero-Shot Semantic Segmentation
J. Ding, N. Xue, G.-S. Xia and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories<br>that have not been seen in the training. Existing works formulate ZS3 as a<br>pixel-level zero-shot classification problem, and transfer semantic knowledge<br>from seen classes to unseen ones with the help of language models pre-trained<br>only with texts. While simple, the pixel-level ZS3 formulation shows the<br>limited capability to integrate vision-language models that are often<br>pre-trained with image-text pairs and currently demonstrate great potential for<br>vision tasks. Inspired by the observation that humans often perform<br>segment-level semantic labeling, we propose to decouple the ZS3 into two<br>sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.<br>2) a zero-shot classification task on segments. The former sub-task does not<br>involve category information and can be directly transferred to group pixels<br>for unseen classes. The latter subtask performs at segment-level and provides a<br>natural way to leverage large-scale vision-language models pre-trained with<br>image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we<br>propose a simple and effective zero-shot semantic segmentation model, called<br>ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by<br>large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff<br>in terms of mIoU for unseen classes. Code will be released at<br>https://github.com/dingjiansw101/ZegFormer.<br>
PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking
A. Doering, D. Chen, S. Zhang, B. Schiele and J. Gall
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
Y. Fan, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with<br>decoupled representation learning and classifier learning for imbalanced SSL.<br>To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)<br>for classifier learning. Furthermore, the current evaluation protocol for<br>imbalanced SSL focuses only on balanced test sets, which has limited<br>practicality in real-world scenarios. Therefore, we further conduct a<br>comprehensive evaluation under various shifted test distributions. In<br>experiments, we show that our approach outperforms other methods over a large<br>range of shifted distributions, achieving state-of-the-art performance on<br>benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our<br>code will be made publicly available.<br>
Bi-level Alignment for Cross-Domain Crowd Counting
S. Gong, S. Zhang, J. Yang, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
LiDAR Snowfall Simulation for Robust 3D Object Detection
M. Hahner, C. Sakaridis, M. Bijelic, F. Heide, F. Yu, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
As acquiring pixel-wise annotations of real-world images for semantic<br>segmentation is a costly process, a model can instead be trained with more<br>accessible synthetic data and adapted to real images without requiring their<br>annotations. This process is studied in unsupervised domain adaptation (UDA).<br>Even though a large number of methods propose new adaptation strategies, they<br>are mostly based on outdated network architectures. As the influence of recent<br>network architectures has not been systematically studied, we first benchmark<br>different network architectures for UDA and then propose a novel UDA method,<br>DAFormer, based on the benchmark results. The DAFormer network consists of a<br>Transformer encoder and a multi-level context-aware feature fusion decoder. It<br>is enabled by three simple but crucial training strategies to stabilize the<br>training and to avoid overfitting DAFormer to the source domain: While the Rare<br>Class Sampling on the source domain improves the quality of pseudo-labels by<br>mitigating the confirmation bias of self-training towards common classes, the<br>Thing-Class ImageNet Feature Distance and a learning rate warmup promote<br>feature transfer from ImageNet pretraining. DAFormer significantly improves the<br>state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for<br>Synthia->Cityscapes and enables learning even difficult classes such as train,<br>bus, and truck well. The implementation is available at<br>https://github.com/lhoyer/DAFormer.<br>
Large Loss Matters in Weakly Supervised Multi-Label Classification
Y. Kim, J. M. Kim, Z. Akata and J. Lee
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Stratified Transformer for 3D Point Cloud Segmentation
X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi and J. Jia
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding
X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai and C.-W. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Although considerable progress has been made in semantic scene understanding<br>under clear weather, it is still a tough problem under adverse weather<br>conditions, such as dense fog, due to the uncertainty caused by imperfect<br>observations. Besides, difficulties in collecting and labeling foggy images<br>hinder the progress of this field. Considering the success in semantic scene<br>understanding under clear weather, we think it is reasonable to transfer<br>knowledge learned from clear images to the foggy domain. As such, the problem<br>becomes to bridge the domain gap between clear images and foggy images. Unlike<br>previous methods that mainly focus on closing the domain gap caused by fog --<br>defogging the foggy images or fogging the clear images, we propose to alleviate<br>the domain gap by considering fog influence and style variation simultaneously.<br>The motivation is based on our finding that the style-related gap and the<br>fog-related gap can be divided and closed respectively, by adding an<br>intermediate domain. Thus, we propose a new pipeline to cumulatively adapt<br>style, fog and the dual-factor (style and fog). Specifically, we devise a<br>unified framework to disentangle the style factor and the fog factor<br>separately, and then the dual-factor from images in different domains.<br>Furthermore, we collaborate the disentanglement of three factors with a novel<br>cumulative loss to thoroughly disentangle these three factors. Our method<br>achieves the state-of-the-art performance on three benchmarks and shows<br>generalization ability in rainy and snowy scenes.<br>
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
O.-B. Mercea, L. Riesch, A. S. Koepke and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking
D. H. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the<br>computer vision field due to its superior performance in real-world<br>applications such as video surveillance with crowded scenes or in vast space.<br>In this work, we propose a mathematically elegant multi-camera multiple object<br>tracking approach based on a spatial-temporal lifted multicut formulation. Our<br>model utilizes state-of-the-art tracklets produced by single-camera trackers as<br>proposals. As these tracklets may contain ID-Switch errors, we refine them<br>through a novel pre-clustering obtained from 3D geometry projections. As a<br>result, we derive a better tracking graph without ID switches and more precise<br>affinity costs for the data association phase. Tracklets are then matched to<br>multi-camera trajectories by solving a global lifted multicut formulation that<br>incorporates short and long-range temporal interactions on tracklets located in<br>the same camera as well as inter-camera ones. Experimental results on the<br>WildTrack dataset yield near-perfect result, outperforming state-of-the-art<br>trackers on Campus while being on par on the PETS-09 dataset. We will make our<br>implementations available upon acceptance of the paper.<br>
S. Rao, M. Böhle and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching
P. Roetzer, P. Swoboda, D. Cremers and F. Bernard
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
T. Sun, M. Segù, J. Postels, Y. Wang, L. Van Gool, B. Schiele, F. Tombari and F. Yu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Generalized Few-shot Semantic Segmentation
Z. Tian, X. Lai, L. Jiang, S. Liu, M. Shu, H. Zhao and J. Jia
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Scribble-Supervised LiDAR Semantic Segmentation
O. Unal, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Sound and Visual Representation Learning with Multiple Pretraining Tasks
A. B. Vasudevan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
RBGNet: Ray-based Grouping for 3D Object Detection
H. Wang, S. Shi, Z. Yang, R. Fang, Q. Qian, H. Li, B. Schiele and L. Wang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Q. Wang, O. Fink, L. Van Gool and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
A Unified Query-based Paradigm for Point Cloud Understanding
Z. Yang, L. Jiang, Y. Sun, B. Schiele and J. Jia
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Adiabatic Quantum Computing for Multi Object Tracking
J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform
S. Li, X. Chen, Y. Liu, D. Dai, C. Stachniss and J. Gall
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Improving Depth Estimation Using Map-Based Depth Priors
V. Patil, A. Liniger, D. Dai and L. Van Gool
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization
N. Vödisch, O. Unal, K. Li, L. Van Gool and D. Dai
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Learnable Online Graph Representations for 3D Multi-Object Tracking
J.-N. Zaech, D. Dai, A. Liniger, M. Danelljan and L. Van Gool
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Y. Chen, M. Mancini, X. Zhu and Z. Akata
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
DWDN: Deep Wiener Deconvolution Network for Non-Blind Image Deblurring
J. Dong, S. Roth and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 12, 2022
Q. Sun, Y. Liu, Z. Chen, T.-S. Chua and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 3, 2022
Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation
Y. Xian, B. Korbar, M. Douze, L. Torresani, B. Schiele and Z. Akata
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 12, 2022
Hyperspectral Image Super-Resolution with RGB Image Super-Resolution as an Auxiliary Task
K. Li, D. Dai and L. van Gool
2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022), 2022
ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks
D. H. M. Nguyen, D. M. Nguyen, T. T. N. Mai, T. Nguyen, K. T. Tran, A. T. Nguyen, B. T. Pham and B. T. Nguyen
Information Sciences, Volume 591, 2022
MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes
Z. Li, S. Shimada, B. Schiele, C. Theobalt and V. Golyanik
International Conference on 3D Vision, 2022
Abstract
3D human motion capture from monocular RGB images respecting interactions of<br>a subject with complex and possibly deformable environments is a very<br>challenging, ill-posed and under-explored problem. Existing methods address it<br>only weakly and do not model possible surface deformations often occurring when<br>humans interact with scene surfaces. In contrast, this paper proposes<br>MoCapDeform, i.e., a new framework for monocular 3D human motion capture that<br>is the first to explicitly model non-rigid deformations of a 3D scene for<br>improved 3D human pose estimation and deformable environment reconstruction.<br>MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the<br>camera space. It first localises a subject in the input monocular video along<br>with dense contact labels using a new raycasting based strategy. Next, our<br>human-environment interaction constraints are leveraged to jointly optimise<br>global 3D human poses and non-rigid surface deformations. MoCapDeform achieves<br>superior accuracy than competing methods on several datasets, including our<br>newly recorded one with deforming background scenes.<br>
PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection
S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang and H. Li
International Journal of Computer Vision, Volume 131, 2022
OASIS: Only Adversarial Supervision for Semantic Image Synthesis
V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Journal of Computer Vision, Volume 130, 2022
Attribute Prototype Network for Any-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
International Journal of Computer Vision, Volume 130, 2022
DPER: Direct Parameter Estimation for Randomly Missing Data
T. T. Nguyen, K. M. Nguyen-Duy, D. H. M. Nguyen, B. T. Nguyen and B. A. Wade
Knowledge-Based Systems, Volume 240, 2022
Aliasing and Adversarial Robust Generalization of CNNs
J. Grabinski, J. Keuper and M. Keuper
Machine Learning, Volume 111, 2022
Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks
S. Jung and M. Keuper
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022), 2022
Abstract
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial<br>optimization problem of partitioning a real-valued edge-weighted graph such as<br>to minimize the total cost of the partition. While graph convolutional neural<br>networks (GNN) have proven to be promising in the context of combinatorial<br>optimization, most of them are only tailored to or tested on positive-valued<br>edge weights, i.e. they do not comply to the nature of the multicut problem. We<br>therefore adapt various GNN architectures including Graph Convolutional<br>Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to<br>facilitate the efficient encoding of real-valued edge costs. Moreover, we<br>employ a reformulation of the multicut ILP constraints to a polynomial program<br>as loss function that allows to learn feasible multicut solutions in a scalable<br>way. Thus, we provide the first approach towards end-to-end trainable<br>multicuts. Our findings support that GNN approaches can produce good solutions<br>in practice while providing lower computation times and largely improved<br>scalability compared to LP solvers and optimized heuristics, especially when<br>considering large instances.<br>
TATL: Task Agnostic Transfer Learning for Skin Attributes Detection
D. H. M. Nguyen, T. T. Nguyen, H. Vu, Q. Pham, B. T. Nguyen, D. Sonntag and M.-D. Nguyen
Medical Image Analysis, Volume 78, 2022
Impact of Realistic Properties of the Point Spread Function on Classification Tasks to Reveal a Possible Distribution Shift
P. Müller, A. Braun and M. Keuper
NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2022 Workshop DistShift), 2022
Optimizing Edge Detection for Image Segmentation with Multicut Penalties
S. Jung, S. Ziegler, A. Kardoost and M. Keuper
Pattern Recognition (DAGM GCPR 2022), 2022
Abstract
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph<br>decomposition by optimizing binary edge labels over edge costs. While the<br>formulation of a MP from independently estimated costs per edge is highly<br>flexible and intuitive, solving the MP is NP-hard and time-expensive. As a<br>remedy, recent work proposed to predict edge probabilities with awareness to<br>potential conflicts by incorporating cycle constraints in the prediction<br>process. We argue that such formulation, while providing a first step towards<br>end-to-end learnable edge weights, is suboptimal, since it is built upon a<br>loose relaxation of the MP. We therefore propose an adaptive CRF that allows to<br>progressively consider more violated constraints and, in consequence, to issue<br>solutions with higher validity. Experiments on the BSDS500 benchmark for<br>natural image segmentation as well as on electron microscopic recordings show<br>that our approach yields more precise edge detection and image segmentation.<br>
Keypoint Message Passing for Video-Based Person Re-identification
D. Chen, A. Doering, S. Zhang, J. Yang, J. Gall and B. Schiele
Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022
PlanT: Explainable Planning Transformers via Object-Level Representations
K. Renz, K. Chitta, O.-B. Mercea, A. S. Koepke, Z. Akata and A. Geiger
Proceedings of the 6th Annual Conference on Robot Learning (CoRL 2022), 2022
Abstract
Planning an optimal route in a complex environment requires efficient<br>reasoning about the surrounding scene. While human drivers prioritize important<br>objects and ignore details not relevant to the decision, learning-based<br>planners typically extract features from dense, high-dimensional grid<br>representations containing all vehicle and road context information. In this<br>paper, we propose PlanT, a novel approach for planning in the context of<br>self-driving that uses a standard transformer architecture. PlanT is based on<br>imitation learning with a compact object-level input representation. On the<br>Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the<br>driving score of the expert) while being 5.3x faster than equivalent<br>pixel-based planning baselines during inference. Combining PlanT with an<br>off-the-shelf perception module provides a sensor-based driving system that is<br>more than 10 points better in terms of driving score than the existing state of<br>the art. Furthermore, we propose an evaluation protocol to quantify the ability<br>of planners to identify relevant objects, providing insights regarding their<br>decision-making. Our results indicate that PlanT can focus on the most relevant<br>object in the scene, even when this object is geometrically distant.<br>
HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection
T. Broedermann, C. Sakaridis, D. Dai and L. Van Gool
Technical Report, 2022
(arXiv: 2206.15157)
Abstract
Besides standard cameras, autonomous vehicles typically include multiple<br>additional sensors, such as lidars and radars, which help acquire richer<br>information for perceiving the content of the driving scene. While several<br>recent works focus on fusing certain pairs of sensors - such as camera and<br>lidar or camera and radar - by using architectural components specific to the<br>examined setting, a generic and modular sensor fusion architecture is missing<br>from the literature. In this work, we focus on 2D object detection, a<br>fundamental high-level task which is defined on the 2D image domain, and<br>propose HRFuser, a multi-resolution sensor fusion architecture that scales<br>straightforwardly to an arbitrary number of input modalities. The design of<br>HRFuser is based on state-of-the-art high-resolution networks for image-only<br>dense prediction and incorporates a novel multi-window cross-attention block as<br>the means to perform fusion of multiple modalities at multiple resolutions.<br>Even though cameras alone provide very informative features for 2D detection,<br>we demonstrate via extensive experiments on the nuScenes and Seeing Through Fog<br>datasets that our model effectively leverages complementary features from<br>additional modalities, substantially improving upon camera-only performance and<br>consistently outperforming state-of-the-art fusion methods for 2D detection<br>both in normal and adverse conditions. The source code will be made publicly<br>available.<br>
An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised Learning
H. Chen, Y. Fan, Y. Wang, J. Wang, B. Schiele, X. Xie, M. Savvides and B. Raj
Technical Report, 2022
(arXiv: 2211.11086)
Abstract
Semi-supervised learning (SSL) has shown great promise in leveraging<br>unlabeled data to improve model performance. While standard SSL assumes uniform<br>data distribution, we consider a more realistic and challenging setting called<br>imbalanced SSL, where imbalanced class distributions occur in both labeled and<br>unlabeled data. Although there are existing endeavors to tackle this challenge,<br>their performance degenerates when facing severe imbalance since they can not<br>reduce the class imbalance sufficiently and effectively. In this paper, we<br>study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance<br>by simply supplementing labeled data with pseudo-labels, according to the<br>difference in class distribution from the most frequent class. Such a simple<br>baseline turns out to be highly effective in reducing class imbalance. It<br>outperforms existing methods by a significant margin, e.g., 12.8%, 13.6%, and<br>16.7% over previous SOTA on CIFAR100-LT, FOOD101-LT, and ImageNet127<br>respectively. The reduced imbalance results in faster convergence and better<br>pseudo-label accuracy of SimiS. The simplicity of our method also makes it<br>possible to be combined with other re-balancing techniques to improve the<br>performance further. Moreover, our method shows great robustness to a wide<br>range of data distributions, which holds enormous potential in practice. Code<br>will be publicly available.<br>
Leveraging Self-Supervised Training for Unintentional Action Recognition
E. Duka, A. Kukleva and B. Schiele
Technical Report, 2022
(arXiv: 2209.11870)
Abstract
Unintentional actions are rare occurrences that are difficult to define<br>precisely and that are highly dependent on the temporal context of the action.<br>In this work, we explore such actions and seek to identify the points in videos<br>where the actions transition from intentional to unintentional. We propose a<br>multi-stage framework that exploits inherent biases such as motion speed,<br>motion direction, and order to recognize unintentional actions. To enhance<br>representations via self-supervised training for the task of unintentional<br>action recognition we propose temporal transformations, called Temporal<br>Transformations of Inherent Biases of Unintentional Actions (T2IBUA). The<br>multi-stage approach models the temporal information on both the level of<br>individual frames and full clips. These enhanced representations show strong<br>performance for unintentional action recognition tasks. We provide an extensive<br>ablation study of our framework and report results that significantly improve<br>over the state-of-the-art.<br>
Normalization Perturbation: A Simple Domain Generalization Method for Real-World Domain Shifts
Q. Fan, M. Segu, Y.-W. Tai, F. Yu, C.-K. Tang, B. Schiele and D. Dai
Technical Report, 2022
(arXiv: 2211.04393)
Abstract
Improving model's generalizability against domain shifts is crucial,<br>especially for safety-critical applications such as autonomous driving.<br>Real-world domain styles can vary substantially due to environment changes and<br>sensor noises, but deep models only know the training domain style. Such domain<br>style gap impedes model generalization on diverse real-world domains. Our<br>proposed Normalization Perturbation (NP) can effectively overcome this domain<br>style overfitting problem. We observe that this problem is mainly caused by the<br>biased distribution of low-level features learned in shallow CNN layers. Thus,<br>we propose to perturb the channel statistics of source domain features to<br>synthesize various latent styles, so that the trained deep model can perceive<br>diverse potential domains and generalizes well even without observations of<br>target domain data in training. We further explore the style-sensitive channels<br>for effective style synthesis. Normalization Perturbation only relies on a<br>single source domain and is surprisingly effective and extremely easy to<br>implement. Extensive experiments verify the effectiveness of our method for<br>generalizing models under real-world domain shifts.<br>
Visually Plausible Human-Object Interaction Capture from Wearable Sensors
V. Guzov, T. Sattler and G. Pons-Moll
Technical Report, 2022
(arXiv: 2205.02830)
Abstract
In everyday lives, humans naturally modify the surrounding environment<br>through interactions, e.g., moving a chair to sit on it. To reproduce such<br>interactions in virtual spaces (e.g., metaverse), we need to be able to capture<br>and model them, including changes in the scene geometry, ideally from<br>ego-centric input alone (head camera and body-worn inertial sensors). This is<br>an extremely hard problem, especially since the object/scene might not be<br>visible from the head camera (e.g., a human not looking at a chair while<br>sitting down, or not looking at the door handle while opening a door). In this<br>paper, we present HOPS, the first method to capture interactions such as<br>dragging objects and opening doors from ego-centric data alone. Central to our<br>method is reasoning about human-object interactions, allowing to track objects<br>even when they are not visible from the head camera. HOPS localizes and<br>registers both the human and the dynamic object in a pre-scanned static scene.<br>HOPS is an important first step towards advanced AR/VR applications based on<br>immersive virtual universes, and can provide human-centric training data to<br>teach machines to interact with their surroundings. The supplementary video,<br>data, and code will be available on our project page at<br>http://virtualhumans.mpi-inf.mpg.de/hops/<br>
Lifted Edges as Connectivity Priors for Multicut and Disjoint Paths
A. Horňáková
PhD Thesis, Universität des Saarlandes, 2022
L. Hoyer, D. Dai, H. Wang and L. Van Gool
Technical Report, 2022
(arXiv: 2212.01322)
Abstract
Deep Gradient Learning for Efficient Camouflaged Object Detection
G.-P. Ji, D.-P. Fan, Y.-C. Chou, D. Dai, A. Liniger and L. Van Gool
Technical Report, 2022
(arXiv: 2205.12853)
Abstract
This paper introduces DGNet, a novel deep framework that exploits object<br>gradient supervision for camouflaged object detection (COD). It decouples the<br>task into two connected branches, i.e., a context and a texture encoder. The<br>essential connection is the gradient-induced transition, representing a soft<br>grouping between context and texture features. Benefiting from the simple but<br>efficient framework, DGNet outperforms existing state-of-the-art COD models by<br>a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80<br>fps) and achieves comparable results to the cutting-edge model<br>JCSOD-CVPR$_{21}$ with only 6.82% parameters. Application results also show<br>that the proposed DGNet performs well in polyp segmentation, defect detection,<br>and transparent object segmentation tasks. Codes will be made available at<br>https://github.com/GewelsJI/DGNet.<br>
Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation
V. Lazova, V. Guzov, K. Olszewski, S. Tulyakov and G. Pons-Moll
Technical Report, 2022
(arXiv: 2204.10850)
Abstract
We present a novel method for performing flexible, 3D-aware image content<br>manipulation while enabling high-quality novel view synthesis. While NeRF-based<br>approaches are effective for novel view synthesis, such models memorize the<br>radiance for every point in a scene within a neural network. Since these models<br>are scene-specific and lack a 3D scene representation, classical editing such<br>as shape manipulation, or combining scenes is not possible. Hence, editing and<br>combining NeRF-based scenes has not been demonstrated. With the aim of<br>obtaining interpretable and controllable scene representations, our model<br>couples learnt scene-specific feature volumes with a scene agnostic neural<br>rendering network. With this hybrid representation, we decouple neural<br>rendering from scene-specific geometry and appearance. We can generalize to<br>novel scenes by optimizing only the scene-specific 3D feature representation,<br>while keeping the parameters of the rendering network fixed. The rendering<br>function learnt during the initial training stage can thus be easily applied to<br>new scenes, making our approach more flexible. More importantly, since the<br>feature volumes are independent of the rendering model, we can manipulate and<br>combine scenes by editing their corresponding feature volumes. The edited<br>volume can then be plugged into the rendering model to synthesize high-quality<br>novel views. We demonstrate various scene manipulations, including mixing<br>scenes, deforming objects and inserting objects into scenes, while still<br>producing photo-realistic results.<br>
Discovering Class-Specific GAN Controls for Semantic Image Synthesis
E. Schönfeld, J. Borges, V. Sushko, B. Schiele and A. Khoreva
Technical Report, 2022
(arXiv: 2212.01455)
Abstract
Prior work has extensively studied the latent space structure of GANs for<br>unconditional image synthesis, enabling global editing of generated images by<br>the unsupervised discovery of interpretable latent directions. However, the<br>discovery of latent directions for conditional GANs for semantic image<br>synthesis (SIS) has remained unexplored. In this work, we specifically focus on<br>addressing this gap. We propose a novel optimization method for finding<br>spatially disentangled class-specific directions in the latent space of<br>pretrained SIS models. We show that the latent directions found by our method<br>can effectively control the local appearance of semantic classes, e.g.,<br>changing their internal structure, texture or color independently from each<br>other. Visual inspection and quantitative evaluation of the discovered GAN<br>controls on various datasets demonstrate that our method discovers a diverse<br>set of unique and semantically meaningful latent directions for class-specific<br>edits.<br>
MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge -- Motion Prediction
S. Shi, L. Jiang, D. Dai and B. Schiele
Technical Report, 2022
(arXiv: 2209.10033)
Abstract
In this report, we present the 1st place solution for motion prediction track<br>in 2022 Waymo Open Dataset Challenges. We propose a novel Motion Transformer<br>framework for multimodal motion prediction, which introduces a small set of<br>novel motion query pairs for generating better multimodal future trajectories<br>by jointly performing the intention localization and iterative motion<br>refinement. A simple model ensemble strategy with non-maximum-suppression is<br>adopted to further boost the final performance. Our approach achieves the 1st<br>place on the motion prediction leaderboard of 2022 Waymo Open Dataset<br>Challenges, outperforming other methods with remarkable margins. Code will be<br>available at https://github.com/sshaoshuai/MTR.<br>
Understanding and Improving Robustness and Uncertainty Estimation in Deep Learning
D. Stutz
PhD Thesis, Universität des Saarlandes, 2022
Abstract
Structured Prediction Problem Archive
P. Swoboda, A. Horňáková, P. Rötzer, B. Savchynskyy and A. Abbas
Technical Report, 2022
(arXiv: 2202.03574)
Abstract
Structured prediction problems are one of the fundamental tools in machine<br>learning. In order to facilitate algorithm development for their numerical<br>solution, we collect in one place a large number of datasets in easy to read<br>formats for a diverse set of problem classes. We provide archival links to<br>datasets, description of the considered problems and problem formats, and a<br>short summary of problem characteristics including size, number of instances<br>etc. For reference we also give a non-exhaustive selection of algorithms<br>proposed in the literature for their solution. We hope that this central<br>repository will make benchmarking and comparison to established works easier.<br>We welcome submission of interesting new datasets and algorithms for inclusion<br>in our archive.<br>
On Fragile Features and Batch Normalization in Adversarial Training
N. P. Walter, D. Stutz and B. Schiele
Technical Report, 2022
(arXiv: 2204.12393)
Abstract
Modern deep learning architecture utilize batch normalization (BN) to<br>stabilize training and improve accuracy. It has been shown that the BN layers<br>alone are surprisingly expressive. In the context of robustness against<br>adversarial examples, however, BN is argued to increase vulnerability. That is,<br>BN helps to learn fragile features. Nevertheless, BN is still used in<br>adversarial training, which is the de-facto standard to learn robust features.<br>In order to shed light on the role of BN in adversarial training, we<br>investigate to what extent the expressiveness of BN can be used to robustify<br>fragile features in comparison to random features. On CIFAR10, we find that<br>adversarially fine-tuning just the BN layers can result in non-trivial<br>adversarial robustness. Adversarially training only the BN layers from scratch,<br>in contrast, is not able to convey meaningful adversarial robustness. Our<br>results indicate that fragile features can be used to learn models with<br>moderate adversarial robustness, while random features cannot<br>
Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in Driving Scenes
Y.-H. Wu, D. Zhang, L. Zhang, X. Zhan, D. Dai, Y. Liu and M.-M. Cheng
Technical Report, 2022
(arXiv: 2208.08621)
Abstract
Current efficient LiDAR-based detection frameworks are lacking in exploiting<br>object relations, which naturally present in both spatial and temporal manners.<br>To this end, we introduce a simple, efficient, and effective two-stage<br>detector, termed as Ret3D. At the core of Ret3D is the utilization of novel<br>intra-frame and inter-frame relation modules to capture the spatial and<br>temporal relations accordingly. More Specifically, intra-frame relation module<br>(IntraRM) encapsulates the intra-frame objects into a sparse graph and thus<br>allows us to refine the object features through efficient message passing. On<br>the other hand, inter-frame relation module (InterRM) densely connects each<br>object in its corresponding tracked sequences dynamically, and leverages such<br>temporal information to further enhance its representations efficiently through<br>a lightweight transformer network. We instantiate our novel designs of IntraRM<br>and InterRM with general center-based or anchor-based detectors and evaluate<br>them on Waymo Open Dataset (WOD). With negligible extra overhead, Ret3D<br>achieves the state-of-the-art performance, being 5.5% and 3.2% higher than the<br>recent competitor in terms of the LEVEL 1 and LEVEL 2 mAPH metrics on vehicle<br>detection, respectively.<br>
Tracking Human Object Interaction from Single RGB Camera
X. Xie
PhD Thesis, Universität des Saarlandes, 2022
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
K. Zhou, B. Lal Bhatnagar, J. E. Lenssen and G. Pons-Moll
Technical Report, 2022
(arXiv: 2205.07982)
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction<br>sequences using a data prior. Existing hand trackers, especially those that<br>rely on very few cameras, often produce visually unrealistic results with<br>hand-object intersection or missing contacts. Although correcting such errors<br>requires reasoning about temporal aspects of interaction, most previous work<br>focus on static grasps and contacts. The core of our method are TOCH fields, a<br>novel spatio-temporal representation for modeling correspondences between hands<br>and objects during interaction. The key component is a point-wise<br>object-centric representation which encodes the hand position relative to the<br>object. Leveraging this novel representation, we learn a latent manifold of<br>plausible TOCH fields with a temporal denoising auto-encoder. Experiments<br>demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object<br>interaction models, which are limited to static grasps and contacts. More<br>importantly, our method produces smooth interactions even before and after<br>contact. Using a single trained TOCH model, we quantitatively and qualitatively<br>demonstrate its usefulness for 1) correcting erroneous reconstruction results<br>from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,<br>and 3) grasp transfer across objects. We will release our code and trained<br>model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch/<br>
Hypergraph Transformer for Skeleton-based Action Recognition
Y. Zhou, C. Li, Z.-Q. Cheng, Y. Geng, X. Xie and M. Keuper
Technical Report, 2022
(arXiv: 2211.09590)
Abstract
Skeleton-based action recognition aims to predict human actions given human<br>joint coordinates with skeletal interconnections. To model such off-grid data<br>points and their co-occurrences, Transformer-based formulations would be a<br>natural choice. However, Transformers still lag behind state-of-the-art methods<br>using graph convolutional networks (GCNs). Transformers assume that the input<br>is permutation-invariant and homogeneous (partially alleviated by positional<br>encoding), which ignores an important characteristic of skeleton data, i.e.,<br>bone connectivity. Furthermore, each type of body joint has a clear physical<br>meaning in human motion, i.e., motion retains an intrinsic relationship<br>regardless of the joint coordinates, which is not explored in Transformers. In<br>fact, certain re-occurring groups of body joints are often involved in specific<br>actions, such as the subconscious hand movement for keeping balance. Vanilla<br>attention is incapable of describing such underlying relations that are<br>persistent and beyond pair-wise. In this work, we aim to exploit these unique<br>aspects of skeleton data to close the performance gap between Transformers and<br>GCNs. Specifically, we propose a new self-attention (SA) extension, named<br>Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order<br>relations into the model. The K-hop relative positional embeddings are also<br>employed to take bone connectivity into account. We name the resulting model<br>Hyperformer, and it achieves comparable or better performance w.r.t. accuracy<br>and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D<br>120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the<br>significantly improved performance reached by our Hyperformer demonstrates the<br>underestimated potential of Transformer models in this field.<br>
2021
Real-time Deep Dynamic Characters
M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2021), Volume 40, Number 4, 2021
Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach
A. Abbas and P. Swoboda
Advances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021
Fine-Grained Zero-Shot Learning with DNA as Side Information
S. Badirli, Z. Akata, G. Mohler, C. Picard and M. M. Dundar
Advances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021
RMM: Reinforced Memory Management for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
Advances in Neural Information Processing Systems 34 (NeurIPS 2021), 2021
Shape your Space: A Gaussian Mixture Regularization Approach to Deterministic Autoencoders
A. Saseendran, K. Skubch, S. Falkner and M. Keuper
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
Monocular 3D Multi-Person Pose Estimation via Predicting Factorized Correction Factors
Y. Guo, L. Ma, Z. Li, X. Wang and F. Wang
Computer Vision and Image Understanding, Volume 213, 2021
Learning to Teach and Learn for Semi-supervised Few-shot Image Classification
X. Li, J. Huang, Y. Liu, Q. Zhou, S. Zheng, B. Schiele and Q. Sun
Computer Vision and Image Understanding, Volume 212, 2021
mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets
R. Gong, D. Dai, Y. Chen, W. Li and L. Van Gool
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather
M. Hahner, C. Sakaridis, D. Dai and L. Van Gool
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths
A. Horňáková, T. Kaiser, P. Swoboda, M. Rolinek, B. Rosenhahn and R. Henschel
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
M. Kayser, O.-M. Camburu, L. Salewski, C. Emde, V. Do, Z. Akata and T. Lukasiewicz
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Keep CALM and Improve Visual Feature Attribution
J. M. Kim, J. Choe, Z. Akata and S. J. Oh
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting
A. Kukleva, H. Kuehne and B. Schiele
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection
F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang and B. Schiele
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding
C. Sakaridis, D. Dai and L. Van Gool
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Relating Adversarially Robust Generalization to Flat Minima
D. Stutz, M. Hein and B. Schiele
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
G. Sun, T. Probst, D. P. Paudel, N. Popovic, M. Kanakis, J. Patel, D. Dai and L. Van Gool
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing
G. Tiwari, N. Sarafianos, T. Tung and G. Pons-Moll
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
Q. Wang, D. Dai, L. Hoyer, L. Van Gool and O. Fink
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data
N. Yu, V. Skripniuk, S. Abdelnabi and M. Fritz
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Dual Contrastive Loss and Attention for GANs
N. Yu, G. Liu, A. Dundar, A. Tao, B. Catanzaro, L. Davis and M. Fritz
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
End-to-End Urban Driving by Imitating a Reinforcement Learning Coach
Z. Zhang, A. Liniger, D. Dai, F. Yu and L. Van Gool
ICCV 2021, IEEE/CVF International Conference on Computer Vision, 2021
Learning Decision Trees Recurrently Through Communication
S. Alaniz, D. Marcos, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers
A. Bhattacharyya, D. O. Reino, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Convolutional Dynamic Alignment Networks for Interpretable Classifications
M. D. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Y. Chen, Y. Xian, A. S. Koepke and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Stereo Radiance Fields (SRF): Learning View Synthesis from Sparse Views of Novel Scenes
J. Chibane, A. Bansal, V. Lazova and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Learning Spatially-Variant MAP Models for Non-blind Image Deblurring
J. Dong, S. Roth and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors
V. Guzov, A. Mir, T. Sattler, and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Adaptive Aggregation Networks for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Open World Compositional Zero-Shot Learning
M. Mancini, M. F. Naeem, Y. Xian and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Learning Graph Embeddings for Compositional Zero-shot Learning
M. F. Naeem, Y. Xian, F. Tombari and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
SMPLicit: Topology-aware Generative Model for Clothed People
G. Pons-Moll, F. Moreno-Noguer, E. Corona, A. Pumarola and G. Alenyà
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
D-NeRF: Neural Radiance Fields for Dynamic Scenes
A. Pumarola, E. Corona, G. Pons-Moll and F. Moreno-Noguer
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs
H.-P. Wang, N. Yu and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Deep Outlier Handling for Image Deblurring
J. Dong and J. Pan
IEEE Transactions on Image Processing, Volume 30, 2021
Y. Liu, Q. Sun, X. He, A.-A. Liu, Y. Su and T.-S. Chua
IEEE Transactions on Neural Networks and Learning Systems, Volume 32, Number 6, 2021
A Deeper Look into DeepCap
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 45, Number 4, 2021
Abstract
Human performance capture is a highly important computer vision problem with<br>many applications in movie production and virtual/augmented reality. Many<br>previous performance capture approaches either required expensive multi-view<br>setups or did not recover dense space-time coherent geometry with<br>frame-to-frame correspondences. We propose a novel deep learning approach for<br>monocular dense human performance capture. Our method is trained in a weakly<br>supervised manner based on multi-view supervision completely removing the need<br>for training data with 3D ground truth annotations. The network architecture is<br>based on two separate networks that disentangle the task into a pose estimation<br>and a non-rigid surface deformation step. Extensive qualitative and<br>quantitative evaluations show that our approach outperforms the state of the<br>art in terms of quality and robustness. This work is an extended version of<br>DeepCap where we provide more detailed explanations, comparisons and results as<br>well as applications.<br>
Future Moment Assessment for Action Query
Q. Ke, M. Fritz and B. Schiele
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences
R. G. VidalMata, W. J. Scheirer, A. Kukleva, D. Cox and H. Kuehne
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data
T. Nguyen, D. H. M. Nguyen, H. Nguyen, B. T. Nguyen and B. A. Wade
Information Sciences, Volume 567, 2021
You Only Need Adversarial Supervision for Semantic Image Synthesis
E. Schönfeld, V. Sushko, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Conference on Learning Representations (ICLR 2021), 2021
Norm-Aware Embedding for Efficient Person Search and Tracking
D. Chen, S. Zhang, J. Yang and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
Guest Editorial: Special Issue on “Computer Vision for All Seasons: Adverse Weather and Lighting Conditions”
D. Dai, R. T. Tan, V. Patel, J. Matas, B. Schiele and L. Van Gool
International Journal of Computer Vision, Volume 129, 2021
DLOW: Domain Flow and Applications
R. Gong, W. Li, Y. Chen, D. Dai and L. Van Gool
International Journal of Computer Vision, Volume 129, 2021
Semantic Bottlenecks: Quantifying and Improving Inspectability of Deep Representations
M. Losch, M. Fritz and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
Guided Attention in CNNs for Occluded Pedestrian Detection and Re-identification
S. Zhang, D. Chen, J. Yang and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
SampleFix: Learning to Correct Programs by Sampling Diverse Fixes
H. Hajipour, A. Bhattacharyya, C.-A. Staicu and M. Fritz
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), 2021
DARTS for Inverse Problems: a Study on Stability
J. Geiping, J. Lukasik, M. Keuper and M. Moeller
NeurIPS 2021 Workshop on Deep Learning and Inverse Problems (NeurIPS 2021 Deep Inverse Workshop), 2021
Internalized Biases in Fréchet Inception Distance
S. Jung and M. Keuper
NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2021 Workshop DistShift), 2021
(SP)2Net for Generalized Zero-Label Semantic Segmentation
A. Das, Y. Xian, Y. He, B. Schiele and Z. Akata
Pattern Recognition (GCPR 2021), 2021
Revisiting Consistency Regularization for Semi-supervised Learning
Y. Fan, A. Kukleva and B. Schiele
Pattern Recognition (GCPR 2021), 2021
Efficient Message Passing for 0–1 ILPs with Binary Decision Diagrams
J.-H. Lange and P. Swoboda
Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021
Bit Error Robustness for Energy-Efficient DNN Accelerators
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Proceedings of the 4th MLSys Conference, 2021
Abstract
Deep neural network (DNN) accelerators received considerable attention in<br>past years due to saved energy compared to mainstream hardware. Low-voltage<br>operation of DNN accelerators allows to further reduce energy consumption<br>significantly, however, causes bit-level failures in the memory storing the<br>quantized DNN weights. In this paper, we show that a combination of robust<br>fixed-point quantization, weight clipping, and random bit error training<br>(RandBET) improves robustness against random bit errors in (quantized) DNN<br>weights significantly. This leads to high energy savings from both low-voltage<br>operation as well as low-precision quantization. Our approach generalizes<br>across operating voltages and accelerators, as demonstrated on bit errors from<br>profiled SRAM arrays. We also discuss why weight clipping alone is already a<br>quite effective way to achieve robustness against bit errors. Moreover, we<br>specifically discuss the involved trade-offs regarding accuracy, robustness and<br>precision: Without losing more than 1% in accuracy compared to a normally<br>trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher<br>energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even<br>for 4-bit DNNs.<br>
Compositional Mixture Representations for Vision and Text
S. Alaniz, M. Federici and Z. Akata
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), 2021
Probabilistic Compositional Embeddings for Multimodal Image Retrieval
A. Neculai, Y. Chen and Z. Akata
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), 2021
A Closer Look at Self-training for Zero-Label Semantic Segmentation
G. Pastore, F. Cermelli, Y. Xian, M. Mancini, Z. Akata and B. Caputo
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
InfoScrub: Towards Attribute Privacy by Targeted Obfuscation
H.-P. Wang, T. Orekondy and M. Fritz
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis
Y. He, N. Yu, M. Keuper and M. Fritz
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2021), 2021
Spectral Distribution Aware Image Generation
S. Jung and M. Keuper
Thirty-Fifth AAAI Conference on Artificial Intelligence Technical Tracks 2, 2021
FastDOG: Fast Discrete Optimization on GPU
A. Abbas and P. Swoboda
Technical Report, 2021
(arXiv: 2111.10270)
Abstract
We present a massively parallel Lagrange decomposition method for solving 0-1<br>integer linear programs occurring in structured prediction. We propose a new<br>iterative update scheme for solving the Lagrangean dual and a perturbation<br>technique for decoding primal solutions. For representing subproblems we follow<br>Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and<br>dual algorithms require little synchronization between subproblems and<br>optimization over BDDs needs only elementary operations without complicated<br>control flow. This allows us to exploit the parallelism offered by GPUs for all<br>components of our method. We present experimental results on combinatorial<br>problems from MAP inference for Markov Random Fields, quadratic assignment and<br>cell tracking for developmental biology. Our highly parallel GPU implementation<br>improves upon the running times of the algorithms from Lange et al. (2021) by<br>up to an order of magnitude. In particular, we come close to or outperform some<br>state-of-the-art specialized heuristics while being problem agnostic.<br>
Long-term future prediction under uncertainty and multi-modality
A. Bhattacharyya
PhD Thesis, Universität des Saarlandes, 2021
Optimising for Interpretability: Convolutional Dynamic Alignment Networks
M. D. Böhle, M. Fritz and B. Schiele
Technical Report, 2021
(arXiv: 2109.13004)
Abstract
We introduce a new family of neural network models called Convolutional<br>Dynamic Alignment Networks (CoDA Nets), which are performant classifiers with a<br>high degree of inherent interpretability. Their core building blocks are<br>Dynamic Alignment Units (DAUs), which are optimised to transform their inputs<br>with dynamically computed weight vectors that align with task-relevant<br>patterns. As a result, CoDA Nets model the classification prediction through a<br>series of input-dependent linear transformations, allowing for linear<br>decomposition of the output into individual input contributions. Given the<br>alignment of the DAUs, the resulting contribution maps align with<br>discriminative input patterns. These model-inherent decompositions are of high<br>visual quality and outperform existing attribution methods under quantitative<br>metrics. Further, CoDA Nets constitute performant classifiers, achieving on par<br>results to ResNet and VGG models on e.g. CIFAR-10 and TinyImagenet. Lastly,<br>CoDA Nets can be combined with conventional neural network models to yield<br>powerful classifiers that more easily scale to complex datasets such as<br>Imagenet whilst exhibiting an increased interpretable depth, i.e., the output<br>can be explained well in terms of contributions from intermediate layers within<br>the network.<br>
Where and When: Space-Time Attention for Audio-Visual Explanations
Y. Chen, T. Hummel, A. S. Koepke and Z. Akata
Technical Report, 2021
(arXiv: 2105.01517)
Abstract
Explaining the decision of a multi-modal decision-maker requires to determine<br>the evidence from both modalities. Recent advances in XAI provide explanations<br>for models trained on still images. However, when it comes to modeling multiple<br>sensory modalities in a dynamic world, it remains underexplored how to<br>demystify the mysterious dynamics of a complex multi-modal model. In this work,<br>we take a crucial step forward and explore learnable explanations for<br>audio-visual recognition. Specifically, we propose a novel space-time attention<br>network that uncovers the synergistic dynamics of audio and visual data over<br>both space and time. Our model is capable of predicting the audio-visual video<br>events, while justifying its decision by localizing where the relevant visual<br>cues appear, and when the predicted sounds occur in videos. We benchmark our<br>model on three audio-visual video event datasets, comparing extensively to<br>multiple recent multi-modal representation learners and intrinsic explanation<br>models. Experimental results demonstrate the clear superior performance of our<br>model over the existing methods on audio-visual video event recognition.<br>Moreover, we conduct an in-depth study to analyze the explainability of our<br>model based on robustness analysis via perturbation tests and pointing games<br>using human annotations.<br>
(SP)2Net for Generalized Zero-Label Semantic Segmentation
A. Das
PhD Thesis, Universität des Saarlandes, 2021
Self-Supervised Representation Learning to Recognize Unintentional Actions
E. Duka
PhD Thesis, Universität des Saarlandes, 2021
R. Gong, M. Danelljan, D. Dai, W. Wang, D. P. Paudel, A. Chhatkuli, F. Yu and L. Van Gool
Technical Report, 2021
(arXiv: 2109.04813)
Abstract
Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
L. Hoyer, D. Dai, Q. Wang, Y. Chen and L. Van Gool
Technical Report, 2021
(arXiv: 2108.12545)
Abstract
Training deep networks for semantic segmentation requires large amounts of<br>labeled training data, which presents a major challenge in practice, as<br>labeling segmentation masks is a highly labor-intensive process. To address<br>this issue, we present a framework for semi-supervised and domain-adaptive<br>semantic segmentation, which is enhanced by self-supervised monocular depth<br>estimation (SDE) trained only on unlabeled image sequences.<br> In particular, we utilize SDE as an auxiliary task comprehensively across the<br>entire learning framework: First, we automatically select the most useful<br>samples to be annotated for semantic segmentation based on the correlation of<br>sample diversity and difficulty between SDE and semantic segmentation. Second,<br>we implement a strong data augmentation by mixing images and labels using the<br>geometry of the scene. Third, we transfer knowledge from features learned<br>during SDE to semantic segmentation by means of transfer and multi-task<br>learning. And fourth, we exploit additional labeled synthetic data with<br>Cross-Domain DepthMix and Matching Geometry Sampling to align synthetic and<br>real data.<br> We validate the proposed model on the Cityscapes dataset, where all four<br>contributions demonstrate significant performance gains, and achieve<br>state-of-the-art results for semi-supervised semantic segmentation as well as<br>for semi-supervised domain adaptation. In particular, with only 1/30 of the<br>Cityscapes labels, our method achieves 92% of the fully-supervised baseline<br>performance and even 97% when exploiting additional data from GTA. The source<br>code is available at<br>https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.<br>
Learning Graph Embeddings for Open World Compositional Zero-Shot Learning
M. Mancini, M. F. Naeem, Y. Xian and Z. Akata
Technical Report, 2021
(arXiv: 2105.01017)
Abstract
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions<br>of state and object visual primitives seen during training. A problem with<br>standard CZSL is the assumption of knowing which unseen compositions will be<br>available at test time. In this work, we overcome this assumption operating on<br>the open world setting, where no limit is imposed on the compositional space at<br>test time, and the search space contains a large number of unseen compositions.<br>To address this problem, we propose a new approach, Compositional Cosine Graph<br>Embeddings (Co-CGE), based on two principles. First, Co-CGE models the<br>dependency between states, objects and their compositions through a graph<br>convolutional neural network. The graph propagates information from seen to<br>unseen concepts, improving their representations. Second, since not all unseen<br>compositions are equally feasible, and less feasible ones may damage the<br>learned representations, Co-CGE estimates a feasibility score for each unseen<br>composition, using the scores as margins in a cosine similarity-based loss and<br>as weights in the adjacency matrix of the graphs. Experiments show that our<br>approach achieves state-of-the-art performances in standard CZSL while<br>outperforming previous methods in the open world scenario.<br>
From Pixels to People
M. Omran
PhD Thesis, Universität des Saarlandes, 2021
Abstract
Abstract<br>Humans are at the centre of a significant amount of research in computer vision.<br>Endowing machines with the ability to perceive people from visual data is an immense<br>scientific challenge with a high degree of direct practical relevance. Success in automatic<br>perception can be measured at different levels of abstraction, and this will depend on<br>which intelligent behaviour we are trying to replicate: the ability to localise persons in<br>an image or in the environment, understanding how persons are moving at the skeleton<br>and at the surface level, interpreting their interactions with the environment including<br>with other people, and perhaps even anticipating future actions. In this thesis we tackle<br>different sub-problems of the broad research area referred to as "looking at people",<br>aiming to perceive humans in images at different levels of granularity.<br>We start with bounding box-level pedestrian detection: We present a retrospective<br>analysis of methods published in the decade preceding our work, identifying various<br>strands of research that have advanced the state of the art. With quantitative exper-<br>iments, we demonstrate the critical role of developing better feature representations<br>and having the right training distribution. We then contribute two methods based<br>on the insights derived from our analysis: one that combines the strongest aspects of<br>past detectors and another that focuses purely on learning representations. The latter<br>method outperforms more complicated approaches, especially those based on hand-<br>crafted features. We conclude our work on pedestrian detection with a forward-looking<br>analysis that maps out potential avenues for future research.<br>We then turn to pixel-level methods: Perceiving humans requires us to both separate<br>them precisely from the background and identify their surroundings. To this end, we<br>introduce Cityscapes, a large-scale dataset for street scene understanding. This has since<br>established itself as a go-to benchmark for segmentation and detection. We additionally<br>develop methods that relax the requirement for expensive pixel-level annotations, focusing<br>on the task of boundary detection, i.e. identifying the outlines of relevant objects and<br>surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and<br>labelling to fine-grained spatial understanding. We contribute a method for recovering<br>3D human shape and pose, which marries the advantages of learning-based and model-<br>based approaches.<br>We conclude the thesis with a detailed discussion of benchmarking practices in<br>computer vision. Among other things, we argue that the design of future datasets<br>should be driven by the general goal of combinatorial robustness besides task-specific<br>considerations.
Specialized Head Relational Framework for Spatio-Temporal Action Localization
S. Paturri
PhD Thesis, Universität des Saarlandes, 2021
Adversarial Content Manipulation for Analyzing and Improving Model Robustness
R. Shetty
PhD Thesis, Universität des Saarlandes, 2021
Adversarial Robustness of Convolutional Dynamic Alignment Networks
N. Singh
PhD Thesis, Universität des Saarlandes, 2021
Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Technical Report, 2021
(arXiv: 2104.08323)
Abstract
Deep neural network (DNN) accelerators received considerable attention in<br>recent years due to the potential to save energy compared to mainstream<br>hardware. Low-voltage operation of DNN accelerators allows to further reduce<br>energy consumption significantly, however, causes bit-level failures in the<br>memory storing the quantized DNN weights. Furthermore, DNN accelerators have<br>been shown to be vulnerable to adversarial attacks on voltage controllers or<br>individual bits. In this paper, we show that a combination of robust<br>fixed-point quantization, weight clipping, as well as random bit error training<br>(RandBET) or adversarial bit error training (AdvBET) improves robustness<br>against random or adversarial bit errors in quantized DNN weights<br>significantly. This leads not only to high energy savings for low-voltage<br>operation as well as low-precision quantization, but also improves security of<br>DNN accelerators. Our approach generalizes across operating voltages and<br>accelerators, as demonstrated on bit errors from profiled SRAM arrays, and<br>achieves robustness against both targeted and untargeted bit-level attacks.<br>Without losing more than 0.8%/2% in test accuracy, we can reduce energy<br>consumption on CIFAR10 by 20%/30% for 8/4-bit quantization using RandBET.<br>Allowing up to 320 adversarial bit errors, AdvBET reduces test error from above<br>90% (chance level) to 26.22% on CIFAR10.<br>
K. Zhou, B. L. Bhatnagar, B. Schiele and G. Pons-Moll
Technical Report, 2021
(arXiv: 2102.01161)
Abstract
Most learning methods for 3D data (point clouds, meshes) suffer significant<br>performance drops when the data is not carefully aligned to a canonical<br>orientation. Aligning real world 3D data collected from different sources is<br>non-trivial and requires manual intervention. In this paper, we propose the<br>Adjoint Rigid Transform (ART) Network, a neural module which can be integrated<br>with a variety of 3D networks to significantly boost their performance. ART<br>learns to rotate input shapes to a learned canonical orientation, which is<br>crucial for a lot of tasks such as shape reconstruction, interpolation,<br>non-rigid registration, and latent disentanglement. ART achieves this with<br>self-supervision and a rotation equivariance constraint on predicted rotations.<br>The remarkable result is that with only self-supervision, ART facilitates<br>learning a unique canonical orientation for both rigid and nonrigid shapes,<br>which leads to a notable boost in performance of aforementioned tasks. We will<br>release our code and pre-trained models for further research.<br>
2020
Hierarchical Online Instance Matching for Person Search
D. Chen, S. Zhang, W. Ouyang, J. Yang and B. Schiele
AAAI Technical Track: Vision, 2020
Manipulating Attributes of Natural Scenes via Hallucination
L. Karacan, Z. Akata, A. Erdem and E. Erdem
ACM Transactions on Graphics, Volume 39, Number 1, 2020
XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2020), Volume 39, Number 4, 2020
LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration
B. L. Bhatnagar, C. Sminchisescu, C. Theobalt and G. Pons-Moll
Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020
GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators
D. Chen, T. Orekondy and M. Fritz
Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020
Neural Unsigned Distance Fields for Implicit Function Learning
J. Chibane, A. Mir and G. Pons-Moll
Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020
Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring
J. Dong, S. Roth and B. Schiele
Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020
Attribute Prototype Network for Zero-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020
GAN-Leaks: A Taxonomy of Membership Inference Attacks against GANs
D. Chen, N. Yu, Y. Zhang and M. Fritz
CCS ’20, ACM SIGSAC Conference on Computer and Communications Security, 2020
Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction
B. L. Bhatnagar, C. Sminchisescu, C. Theobalt and G. Pons-Moll
Computer Vision -- ECCV 2020, 2020
Kinematic 3D Object Detection in Monocular Video
G. Brazil, G. Pons-Moll, X. Liu and B. Schiele
Computer Vision -- ECCV 2020, 2020
NASA: Neural Articulated Shape Approximation
B. Deng, J. P. Lewis, T. Jeruzalski, G. Pons-Moll, G. Hinton, M. Norouzi and A. Tagliasacchi
Computer Vision -- ECCV 2020, 2020
Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation
Y. He, S. Rahimian, B. Schiele and M. Fritz
Computer Vision -- ECCV 2020, 2020
An Ensemble of Epoch-wise Empirical Bayes for Few-shot Learning
Y. Liu, B. Schiele and Q. Sun
Computer Vision -- ECCV 2020, 2020
Towards Recognizing Unseen Categories in Unseen Domains
M. Mancini, Z. Akata, E. Ricci and B. Caputo
Computer Vision -- ECCV 2020, 2020
Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers
M. Rolínek, P. Swoboda, D. Zietlow, A. Paulus, V. Musil and G. Martius
Computer Vision -- ECCV 2020, 2020
Towards Automated Testing and Robustification by Semantic Adversarial Data Generation
R. Shetty, M. Fritz and B. Schiele
Computer Vision -- ECCV 2020, 2020
SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing
G. Tiwari, B. L. Bhatnagar, T. Tung and G. Pons-Moll
Computer Vision -- ECCV 2020, 2020
Inclusive GAN: Improving Data and Minority Coverage in Generative Models
N. Yu, K. Li, P. Zhou, J. Malik, L. Davis and M. Fritz
Computer Vision -- ECCV 2020, 2020
Unsupervised Shape and Pose Disentanglement for 3D Meshes
K. Zhou, B. L. Bhatnagar and G. Pons-Moll
Computer Vision -- ECCV 2020, 2020
Implicit Feature Networks for Texture Completion from Partial 3D Data
J. Chibane and G. Pons-Moll
Computer Vision -- ECCV Workshops 2020, 2020
Synthetic Convolutional Features for Improved Semantic Segmentation
Y. He, B. Schiele and M. Fritz
Computer Vision -- ECCV Workshops 2020, 2020
S. Rao, D. Stutz and B. Schiele
Computer Vision -- ECCV Workshops 2020, 2020
SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results
A. Saint, A. Kacem, K. Cherenkova, K. Papadopoulos, J. Chibane, G. Pons-Moll, G. Gusev, D. Fofi, D. Aouada and B. Ottersten
Computer Vision -- ECCV Workshops 2020, 2020
Body Shape Privacy in Images: Understanding Privacy and Preventing Automatic Shape Extraction
H. Sattar, K. Krombholz, G. Pons-Moll and M. Fritz
Computer Vision -- ECCV Workshops 2020, 2020
Abstract
Modern approaches to pose and body shape estimation have recently achieved<br>strong performance even under challenging real-world conditions. Even from a<br>single image of a clothed person, a realistic looking body shape can be<br>inferred that captures a users' weight group and body shape type well. This<br>opens up a whole spectrum of applications -- in particular in fashion -- where<br>virtual try-on and recommendation systems can make use of these new and<br>automatized cues. However, a realistic depiction of the undressed body is<br>regarded highly private and therefore might not be consented by most people.<br>Hence, we ask if the automatic extraction of such information can be<br>effectively evaded. While adversarial perturbations have been shown to be<br>effective for manipulating the output of machine learning models -- in<br>particular, end-to-end deep learning approaches -- state of the art shape<br>estimation methods are composed of multiple stages. We perform the first<br>investigation of different strategies that can be used to effectively<br>manipulate the automatic shape estimation while preserving the overall<br>appearance of the original image.<br>
Generalized Many-Way Few-Shot Video Classification
Y. Xian, B. Korbar, M. Douze, B. Schiele, Z. Akata and L. Torresani
Computer Vision -- ECCV Workshops 2020, 2020
Sparse Recovery with Integrality Constraints
J.-H. Lange, M. E. Pfetsch, B. M.Seib and A. M.Tillmann
Discrete Applied Mathematics, Volume 283, 2020
Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing
V. Agarwal, R. Shetty and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Normalizing Flows With Multi-Scale Autoregressive Priors
A. Bhattacharyya, S. Mahajan, M. Fritz, B. Schiele and S. Roth
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Norm-Aware Embedding for Efficient Person Search
D. Chen, S. Zhang, J. Yang and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion
J. Chibane, T. Alldieck and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Evaluating Weakly Supervised Object Localization Methods Right
J. Choe, S. J. Oh, S. Lee, S. Chun, Z. Akata and H. Shim
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
DeepCap: Monocular Human Performance Capture Using Weak Supervision
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Learning Interactions and Relationships between Movie Characters
A. Kukleva, M. Tapaswi and I. Laptev
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Mnemonics Training: Multi-Class Incremental Learning Without Forgetting
Y. Liu, Y. Su, A.-A. Liu, B. Schiele and Q. Sun
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Learning to Dress 3D People in Generative Clothing
Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang and M. J. Black
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Learning to Transfer Texture from Clothing Images to 3D Humans
A. Mir, T. Alldieck and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style
C. Patel, Z. Liao and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
A U-Net Based Discriminator for Generative Adversarial Networks
E. Schönfeld, B. Schiele and A. Khoreva
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020
Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering
M. Keuper, S. Tang, B. Andres, T. Brox and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 42, Number 1, 2020
Person Recognition in Personal Photo Collections
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 42, Number 1, 2020
SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera
D. Tome, T. Alldieck, P. Peluse, G. Pons-Moll, L. Agapito, H. Badino and F. de la Torre
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll and Y. Liu
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 42, Number 10, 2020
Learning Robust Representations via Multi-View Information Bottleneck
M. Federici, A. Dutta, P. Forré, N. Kushman and Z. Akata
International Conference on Learning Representations (ICLR 2020), 2020
Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks
T. Orekondy, B. Schiele and M. Fritz
International Conference on Learning Representations (ICLR 2020), 2020
Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based Image Retrieval
A. Dutta and Z. Akata
International Journal of Computer Vision, Volume 128, 2020
Deep Gaze Pooling: Inferring and Visually Decoding Search Intents from Human Gaze Fixations
H. Sattar, M. Fritz and A. Bulling
Neurocomputing, Volume 387, 2020
Haar Wavelet based Block Autoregressive Flows for Trajectories
A. Bhattacharyya, C.-N. Straehle, M. Fritz and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Analyzing the Dependency of ConvNets on Spatial Information
Y. Fan, Y. Xian, M. M. Losch and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Long-Term Anticipation of Activities with Cycle Consistency
Y. A. Farha, Q. Ke, B. Schiele and J. Gall
Pattern Recognition (GCPR 2020), 2020
On the Lifted Multicut Polytope for Trees
J.-H. Lange and B. Andres
Pattern Recognition (GCPR 2020), 2020
Semantic Bottlenecks: Quantifying & Improving Inspectability of Deep Representations
M. Losch, M. Fritz and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Long-Tailed Recognition Using Class-Balanced Experts
S. Sharma, N. Yu, M. Fritz and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Anticipating Averted Gaze in Dyadic Interactions
P. Müller, E. Sood and A. Bulling
Proceedings ETRA 2020 Full Papers, 2020
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings
X. Hong, R. Shetty, A. Sayeed, K. Mehra, V. Demberg and B. Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL 2020), 2020
Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning
A. M. G. Salem, A. Bhattacharyya, M. Backes, M. Fritz and Y. Zhang
Proceedings of the 29th USENIX Security Symposium, 2020
Lifted Disjoint Paths with Application in Multiple Object Tracking
A. Horňáková, R. Henschel, B. Rosenhahn and P. Swoboda
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020
Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks
D. Stutz, M. Hein and B. Schiele
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020
A Primal-Dual Solver for Large-Scale Tracking-by-Assignment
S. Haller, M. Prakash, L. Hutschenreiter, T. Pietzsch, C. Rother, F. Jug, P. Swoboda and B. Savchynskyy
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
L. Salewski, A. S. Koepke, H. P. A. Lensch and Z. Akata
xxAI -- Beyond Explainable AI (xxAI @ICML 2020), 2020
PoseTrackReID: Dataset Description
A. Doering, D. Chen, S. Zhang, B. Schiele and J. Gall
Technical Report, 2020
(arXiv: 2011.06243)
Abstract
Current datasets for video-based person re-identification (re-ID) do not<br>include structural knowledge in form of human pose annotations for the persons<br>of interest. Nonetheless, pose information is very helpful to disentangle<br>useful feature information from background or occlusion noise. Especially<br>real-world scenarios, such as surveillance, contain a lot of occlusions in<br>human crowds or by obstacles. On the other hand, video-based person re-ID can<br>benefit other tasks such as multi-person pose tracking in terms of robust<br>feature matching. For that reason, we present PoseTrackReID, a large-scale<br>dataset for multi-person pose tracking and video-based person re-ID. With<br>PoseTrackReID, we want to bridge the gap between person re-ID and multi-person<br>pose tracking. Additionally, this dataset provides a good benchmark for current<br>state-of-the-art methods on multi-frame person re-ID.<br>
Analyzing the Dependency of ConvNets on Spatial Information
Y. Fan, Y. Xian, M. M. Losch and B. Schiele
Technical Report, 2020
(arXiv: 2002.01827)
Abstract
Intuitively, image classification should profit from using spatial<br>information. Recent work, however, suggests that this might be overrated in<br>standard CNNs. In this paper, we are pushing the envelope and aim to further<br>investigate the reliance on spatial information. We propose spatial shuffling<br>and GAP+FC to destroy spatial information during both training and testing<br>phases. Interestingly, we observe that spatial information can be deleted from<br>later layers with small performance drops, which indicates spatial information<br>at later layers is not necessary for good performance. For example, test<br>accuracy of VGG-16 only drops by 0.03% and 2.66% with spatial information<br>completely removed from the last 30% and 53% layers on CIFAR100, respectively.<br>Evaluation on several object recognition datasets (CIFAR100, Small-ImageNet,<br>ImageNet) with a wide range of CNN architectures (VGG16, ResNet50, ResNet152)<br>shows an overall consistent pattern.<br>
Improved Methods and Analysis for Semantic Image Segmentation
Y. He
PhD Thesis, Universität des Saarlandes, 2020
Abstract
Modern deep learning has enabled amazing developments of computer vision in recent years (Hinton and Salakhutdinov, 2006; Krizhevsky et al., 2012). As a fundamental task, semantic segmentation aims to predict class labels for each pixel of images, which empowers machines perception of the visual world. In spite of recent successes of fully convolutional networks (Long etal., 2015), several challenges remain to be addressed. In this thesis, we focus on this topic, under different kinds of input formats and various types of scenes. Speciﬁcally, our study contains two aspects: (1) Data-driven neural modules for improved performance. (2) Leverage of datasets w.r.t.training systems with higher performances and better data privacy guarantees. In the ﬁrst part of this thesis, we improve semantic segmentation by designing new modules which are compatible with existing architectures. First, we develop a spatio-temporal data-driven pooling, which brings additional information of data (i.e. superpixels) into neural networks, beneﬁting the training of neural networks as well as the inference on novel data. We investigate our approach in RGB-D videos for segmenting indoor scenes, where depth provides complementary cues to colors and our model performs particularly well. Second, we design learnable dilated convolutions, which are the extension of standard dilated convolutions, whose dilation factors (Yu and Koltun, 2016) need to be carefully determined by hand to obtain decent performance. We present a method to learn dilation factors together with ﬁlter weights of convolutions to avoid a complicated search of dilation factors. We explore extensive studies on challenging street scenes, across various baselines with different complexity as well as several datasets at varying image resolutions. In the second part, we investigate how to utilize expensive training data. First, we start from the generative modelling and study the network architectures and the learning pipeline for generating multiple examples. We aim to improve the diversity of generated examples but also to preserve the comparable quality of the examples. Second, we develop a generative model for synthesizing features of a network. With a mixture of real images and synthetic features, we are able to train a segmentation model with better generalization capability. Our approach is evaluated on different scene parsing tasks to demonstrate the effectiveness of the proposed method. Finally, we study membership inference on the semantic segmentation task. We propose the ﬁrst membership inference attack system against black-box semantic segmentation models, that tries to infer if a data pair is used as training data or not. From our observations, information on training data is indeed leaking. To mitigate the leakage, we leverage our synthetic features to perform prediction obfuscations, reducing the posterior distribution gaps between a training and a testing set. Consequently, our study provides not only an approach for detecting illegal use of data, but also the foundations for a safer use of semantic segmentation models.
Towards Accurate Multi-Person Pose Estimation in the Wild
E. Insafutdinov
PhD Thesis, Universität des Saarlandes, 2020
Multicut Optimization Guarantees & Geometry of Lifted Multicuts
J.-H. Lange
PhD Thesis, Universität des Saarlandes, 2020
Learning to Transfer Texture from Clothing Images to 3D Humans
A. Mir
PhD Thesis, Universität des Saarlandes, 2020
Sensing, Interpreting, and Anticipating Human Social Behaviour in the Real World
P. Müller
PhD Thesis, Universität des Saarlandes, 2020
Understanding and Controlling Leakage in Machine Learning
T. Orekondy
PhD Thesis, Universität des Saarlandes, 2020
Long-Tailed Recognition Using Class-Balanced Experts and Diverse Ensembles
S. Sharma
PhD Thesis, Universität des Saarlandes, 2020
V. Skripniuk
PhD Thesis, Universität des Saarlandes, 2020
Learning Size Sensitive Cloth Model
G. Tiwari
PhD Thesis, Universität des Saarlandes, 2020
N. Walter
PhD Thesis, Universität des Saarlandes, 2020
Learning from Limited Labeled Data - Zero-Shot and Few-Shot Learning
Y. Xian
PhD Thesis, Universität des Saarlandes, 2020
Unsupervised Shape and Pose Disentanglement for 3D Meshes
K. Zhou
PhD Thesis, Universität des Saarlandes, 2020
2019
LiveCap: Real-time Human Performance Capture from Monocular Video
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics, Volume 38, Number 2, 2019
Modeling Conceptual Understanding in Image Reference Games
R. Corona, S. Alaniz and Z. Akata
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019
Combining Generative and Discriminative Models for Hybrid Inference
V. Garcia Satorras, Z. Akata and M. Welling
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019
Learning to Self-Train for Semi-Supervised Few-Shot Classification
X. Li, Q. Sun, Y. Liu, Q. Zhou, S. Zheng, T.-S. Chua and B. Schiele
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019
Everyday Eye Tracking for Real-World Consumer Behavior Analysis
A. Bulling and M. Wedel
A Handbook of Process Tracing Methods for Decision Research, 2019
Conditional Flow Variational Autoencoders for Structured Sequence Prediction
A. Bhattacharyya, M. Hanselmann, M. Fritz, B. Schiele and C.-N. Straehle
Bayesian Deep Learning NeurIPS 2019 Workshop, 2019
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications
X. Zhang, Y. Sugano and A. Bulling
CHI 2019, CHI Conference on Human Factors in Computing Systems, 2019
XNect Demo (v2): Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, H.-P. Seidel, P. Fua, M. Elgharib, H. Rhodin, G. Pons-Moll and C. Theobalt
CVPR 2019 Demonstrations, 2019
Towards Reverse-Engineering Black-Box Neural Networks
S. J. Oh, B. Schiele and M. Fritz
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 2019
InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation
J. Steil, M. Tonsen, Y. Sugano and A. Bulling
GetMobile, Volume 23, Number 2, 2019
P. Müller and A. Bulling
ICMI ’19, International Conference on Multimodal Interaction, 2019
Abstract
Automatic detection of emergent leaders in small groups from nonverbal<br>behaviour is a growing research topic in social signal processing but existing<br>methods were evaluated on single datasets -- an unrealistic assumption for<br>real-world applications in which systems are required to also work in settings<br>unseen at training time. It therefore remains unclear whether current methods<br>for emergent leadership detection generalise to similar but new settings and to<br>which extent. To overcome this limitation, we are the first to study a<br>cross-dataset evaluation setting for the emergent leadership detection task. We<br>provide evaluations for within- and cross-dataset prediction using two current<br>datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the<br>robustness of commonly used feature channels (visual focus of attention, body<br>pose, facial action units, speaking activity) and online prediction in the<br>cross-dataset setting. Our evaluations show that using pose and eye contact<br>based features, cross-dataset prediction is possible with an accuracy of 0.68,<br>as such providing another important piece of the puzzle towards emergent<br>leadership detection in the real world.<br>
Learning to Reconstruct People in Clothing from a Single RGB Camera
T. Alldieck, M. A. Magnor, B. L. Bhatnagar, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval
A. Dutta and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
In the Wild Human Pose Estimation using Explicit 2D Features and Intermediate 3D Representations
I. Habibie, W. Xu, D. Mehta, G. Pons-Moll and C. Theobalt
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Time-Conditioned Action Anticipation in One Shot
Q. Ke, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Combinatorial Persistency Criteria for Multicut and Max-Cut
J.-H. Lange, B. Andres and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Knockoff Nets: Stealing Functionality of Black-Box Models
T. Orekondy, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders
E. Schönfeld, S. Ebrahimi, S. Sinha, T. Darrell and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation
R. Shetty, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
D. Stutz, M. Hein, and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Meta-Transfer Learning for Few-Shot Learning
Q. Sun, Y. Liu, T.-S. Chua and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
MAP Inference via Block-Coordinate Frank-Wolfe Algorithm
P. Swoboda and V. Kolmogorov
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
A Convex Relaxation for Multi-Graph Matching
P. Swoboda, D. Kainmüller, A. Mokarian, C. Theobalt and F. Bernard
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning
Y. Xian, S. Sharma, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
Semantic Projection Network for Zero- and Few-Label Semantic Segmentation
Y. Xian, S. Choudhury, Y. He, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture
N. Yu, C. Barnes, E. Shechtman, S. Amirghodsi and M. Lukáč
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
SimulCap : Single-View Human Performance Capture with Cloth Simulation
T. Yu, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. Pons-Moll and Y. Liu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard
S. Abdelnabi, M. X. Huang and A. Bulling
IEEE International Conference on Systems, Man, and Cybernetics (SMC 2019), 2019
Zero-shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly
Y. Xian, C. H. Lampert, B. Schiele and Z. Akata
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 41, Number 9, 2019
Abstract
Due to the importance of zero-shot learning, i.e. classifying images where<br>there is a lack of labeled training data, the number of proposed approaches has<br>recently increased steadily. We argue that it is time to take a step back and<br>to analyze the status quo of the area. The purpose of this paper is three-fold.<br>First, given the fact that there is no agreed upon zero-shot learning<br>benchmark, we first define a new benchmark by unifying both the evaluation<br>protocols and data splits of publicly available datasets used for this task.<br>This is an important contribution as published results are often not comparable<br>and sometimes even flawed due to, e.g. pre-training on zero-shot test classes.<br>Moreover, we propose a new zero-shot learning dataset, the Animals with<br>Attributes 2 (AWA2) dataset which we make publicly available both in terms of<br>image features and the images themselves. Second, we compare and analyze a<br>significant number of the state-of-the-art methods in depth, both in the<br>classic zero-shot setting but also in the more realistic generalized zero-shot<br>setting. Finally, we discuss in detail the limitations of the current status of<br>the area which can be taken as a basis for advancing it.<br>
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 41, Number 1, 2019
Fashion is Taking Shape: Understanding Clothing Preference Based on Body Shape From Online Sources
H. Sattar, G. Pons-Moll and M. Fritz
2019 IEEE Winter Conference on Applications of Computer Vision (WACV 2019), 2019
360-Degree Textures of People in Clothing from a Single Image
V. Lazova, E. Insafutdinov and G. Pons-Moll
International Conference on 3D Vision, 2019
Bottleneck Potentials in Markov Random Fields
A. Abbas and P. Swoboda
International Conference on Computer Vision (ICCV 2019), 2019
Tex2Shape: Detailed Full Human Body Geometry from a Single Image
T. Alldieck, G. Pons-Moll, C. Theobalt and M. A. Magnor
International Conference on Computer Vision (ICCV 2019), 2019
Abstract
We present a simple yet effective method to infer detailed full human body<br>shape from only a single photograph. Our model can infer full-body shape<br>including face, hair, and clothing including wrinkles at interactive<br>frame-rates. Results feature details even on parts that are occluded in the<br>input image. Our main idea is to turn shape regression into an aligned<br>image-to-image translation problem. The input to our method is a partial<br>texture map of the visible region obtained from off-the-shelf methods. From a<br>partial texture, we estimate detailed normal and vector displacement maps,<br>which can be applied to a low-resolution smooth body model to add detail and<br>clothing. Despite being trained purely with synthetic data, our model<br>generalizes well to real-world photographs. Numerous results demonstrate the<br>versatility and robustness of our method.<br>
HiPPI: Higher-Order Projected Power Iterations for Scalable Multi-Matching
F. Bernard, J. Thunberg, P. Swoboda and C. Theobalt
International Conference on Computer Vision (ICCV 2019), 2019
Multi-Garment Net: Learning to Dress 3D People from Images
B. L. Bhatnagar, G. Tiwari, C. Theobalt and G. Pons-Moll
International Conference on Computer Vision (ICCV 2019), 2019
AMASS: Archive of Motion Capture as Surface Shapes
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll and M. J. Black
International Conference on Computer Vision (ICCV 2019), 2019
Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking
S. Sharma, P. T. Varigonda, P. Bindal, A. Sharma and A. Jain
International Conference on Computer Vision (ICCV 2019), 2019
Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints
N. Yu, L. Davis and M. Fritz
International Conference on Computer Vision (ICCV 2019), 2019
Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods
A. Bhattacharyya, M. Fritz and B. Schiele
International Conference on Learning Representations (ICLR 2019), 2019
Lucid Data Dreaming for Video Object Segmentation
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
International Journal of Computer Vision, Volume 127, Number 9, 2019
Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour
M. X. Huang, J. Li, G. Ngai, H. V. Leong and A. Bulling
MM ’19, 27th ACM International Conference on Multimedia, 2019
Abstract
Internal thought refers to the process of directing attention away from a<br>primary visual task to internal cognitive processing. Internal thought is a<br>pervasive mental activity and closely related to primary task performance. As<br>such, automatic detection of internal thought has significant potential for<br>user modelling in intelligent interfaces, particularly for e-learning<br>applications. Despite the close link between the eyes and the human mind, only<br>a few studies have investigated vergence behaviour during internal thought and<br>none has studied moment-to-moment detection of internal thought from gaze.<br>While prior studies relied on long-term data analysis and required a large<br>number of gaze characteristics, we describe a novel method that is<br>computationally light-weight and that only requires eye vergence information<br>that is readily available from binocular eye trackers. We further propose a<br>novel paradigm to obtain ground truth internal thought annotations that<br>exploits human blur perception. We evaluate our method for three increasingly<br>challenging detection tasks: (1) during a controlled math-solving task, (2)<br>during natural viewing of lecture videos, and (3) during daily activities, such<br>as coding, browsing, and reading. Results from these evaluations demonstrate<br>the performance and robustness of vergence-based detection of internal thought<br>and, as such, open up new directions for research on interfaces that adapt to<br>shifts of mental attention.<br>
Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders
X. Hong, E. Chang and V. Demberg
Multilingual Surface Realisation (MSR 2019), 2019
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements
M. X. Huang and A. Bulling
Proceedings ETRA 2019, 2019
Abstract
Recent methods to automatically calibrate stationary eye trackers were shown<br>to effectively reduce inherent calibration distortion. However, these methods<br>require additional information, such as mouse clicks or on-screen content. We<br>propose the first method that only requires users' eye movements to reduce<br>calibration distortion in the background while users naturally look at an<br>interface. Our method exploits that calibration distortion makes straight<br>saccade trajectories appear curved between the saccadic start and end points.<br>We show that this curving effect is systematic and the result of distorted gaze<br>projection plane. To mitigate calibration distortion, our method undistorts<br>this plane by straightening saccade trajectories using image warping. We show<br>that this approach improves over the common six-point calibration and is<br>promising for reducing distortion. As such, it provides a non-intrusive<br>solution to alleviating accuracy decrease of eye tracker during long-term use.<br>
Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage
P. Müller, D. Buschek, M. X. Huang and A. Bulling
Proceedings ETRA 2019, 2019
Privacy-Aware Eye Tracking Using Differential Privacy
J. Steil, I. Hagestedt, M. X. Huang and A. Bulling
Proceedings ETRA 2019, 2019
PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features
J. Steil, M. Koelle, W. Heuten, S. Boll and A. Bulling
Proceedings ETRA 2019, 2019
Detecting Stress from Mouse-Gaze Attraction
J. Wang, E. Y. Fu, G. Ngai, H. Va Leong and M. X. Huang
Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC 2019), 2019
Gradient-Leaks: Understanding Deanonymization in Federated Learning
T. Orekondy, S. J. Oh, Y. Zhang, B. Schiele and M. Fritz
The 2nd International Workshop on Federated Learning for Data Privacy and Confidentiality (FL-NeurIPS 2019), 2019
(Accepted/in press)
Bottleneck Potentials in Markov Random Fields
A. Abbas and P. Swoboda
Technical Report, 2019
(arXiv: 1904.08080)
Abstract
We consider general discrete Markov Random Fields(MRFs) with additional<br>bottleneck potentials which penalize the maximum (instead of the sum) over<br>local potential value taken by the MRF-assignment. Bottleneck potentials or<br>analogous constructions have been considered in (i) combinatorial optimization<br>(e.g. bottleneck shortest path problem, the minimum bottleneck spanning tree<br>problem, bottleneck function minimization in greedoids), (ii) inverse problems<br>with $L_{\infty}$-norm regularization, and (iii) valued constraint satisfaction<br>on the $(\min,\max)$-pre-semirings. Bottleneck potentials for general discrete<br>MRFs are a natural generalization of the above direction of modeling work to<br>Maximum-A-Posteriori (MAP) inference in MRFs. To this end, we propose MRFs<br>whose objective consists of two parts: terms that factorize according to (i)<br>$(\min,+)$, i.e. potentials as in plain MRFs, and (ii) $(\min,\max)$, i.e.<br>bottleneck potentials. To solve the ensuing inference problem, we propose<br>high-quality relaxations and efficient algorithms for solving them. We<br>empirically show efficacy of our approach on large scale seismic horizon<br>tracking problems.<br>
V. Agarwal
PhD Thesis, Universität des Saarlandes, 2019
Abstract
In the past few years, Visual Question Answering (VQA) has seen immense progress<br>both in terms of accuracy and network architectures. From a simple end-to-end<br>neural network-based architecture to complex modular architectures that incorporate<br>interpretability and explainability, VQA has been a very dynamic area of research.<br>Recent work have shown despite significant progress, VQA models are notoriously<br>brittle to linguistic variations in the questions, wherein a small rephrasing of the<br>question leads the VQA models to change their answer. However the variations in<br>the images, by editing them in a semantic fashion, have not been studied before (to the<br>best of our knowledge). In my thesis, we explore how consistent these models are when<br>we manipulate the images in a semantic fashion, wherein we remove objects irrelevant<br>to answering the question from the images. Ideally, under this manipulation, the<br>model should not change its answer. We construct consistency metrics based on<br>how often models flip their answer. Our findings reveal that a compositional model,<br>though having slightly lesser accuracy than an attention model, is more robust to<br>such manipulations. We also show that fine-tuning the model using the generated<br>edited samples in a strategic manner can help make the model more consistent and<br>robust.<br>In the next phase, we target the task of counting in particular, wherein we hope<br>to teach counting to the model by modulating the frequency of an object. We use<br>the same method to generate the dataset but this time we remove the object being<br>counted in the question, one instance at a time. Hence we expect the answer to change.<br>We evaluate the most robust model’s predictions on this set and see a significant<br>drop in accuracy. We show that fine-tuning the model using the edited counting set<br>significantly improves the performance when evaluated on our edited counting set.<br>In addition, this edited set marginally improves the model’s accuracy on the original<br>set.
“Best-of-Many-Samples” Distribution Matching
A. Bhattacharyya, M. Fritz and B. Schiele
Technical Report, 2019
(arXiv: 1909.12598)
Abstract
Generative Adversarial Networks (GANs) can achieve state-of-the-art sample<br>quality in generative modelling tasks but suffer from the mode collapse<br>problem. Variational Autoencoders (VAE) on the other hand explicitly maximize a<br>reconstruction-based data log-likelihood forcing it to cover all modes, but<br>suffer from poorer sample quality. Recent works have proposed hybrid VAE-GAN<br>frameworks which integrate a GAN-based synthetic likelihood to the VAE<br>objective to address both the mode collapse and sample quality issues, with<br>limited success. This is because the VAE objective forces a trade-off between<br>the data log-likelihood and divergence to the latent prior. The synthetic<br>likelihood ratio term also shows instability during training. We propose a<br>novel objective with a "Best-of-Many-Samples" reconstruction cost and a stable<br>direct estimate of the synthetic likelihood. This enables our hybrid VAE-GAN<br>framework to achieve high data log-likelihood and low divergence to the latent<br>prior at the same time and shows significant improvement over both hybrid<br>VAE-GANS and plain GANs in mode coverage and quality.<br>
Semantic Projection Network for Zero- and Few-label Semantic Segmentation
S. Choudhury
PhD Thesis, Universität des Saarlandes, 2019
Hippocampus Segmentation Combining T1 MR Images with High-Resolution T2 MR Images
A. Dima
PhD Thesis, Universität des Saarlandes, 2019
Analyzing the Dependency of ConvNets on Spatial Information
Y. Fan
PhD Thesis, Universität des Saarlandes, 2019
Black-Box Adversarial Attacks in Machine Learning
J. Klesen
PhD Thesis, Universität des Saarlandes, 2019
Texture Completion of People in Diverse Clothing
V. Lazova
PhD Thesis, Universität des Saarlandes, 2019
LCC: Learning to Customize and Combine Neural Networks for Few-Shot Learning
Y. Liu, Q. Sun, A.-A. Liu, Y. Su, B. Schiele and T.-S. Chua
Technical Report, 2019
(arXiv: 1904.08479)
Abstract
Meta-learning has been shown to be an effective strategy for few-shot<br>learning. The key idea is to leverage a large number of similar few-shot tasks<br>in order to meta-learn how to best initiate a (single) base-learner for novel<br>few-shot tasks. While meta-learning how to initialize a base-learner has shown<br>promising results, it is well known that hyperparameter settings such as the<br>learning rate and the weighting of the regularization term are important to<br>achieve best performance. We thus propose to also meta-learn these<br>hyperparameters and in fact learn a time- and layer-varying scheme for learning<br>a base-learner on novel tasks. Additionally, we propose to learn not only a<br>single base-learner but an ensemble of several base-learners to obtain more<br>robust results. While ensembles of learners have shown to improve performance<br>in various settings, this is challenging for few-shot learning tasks due to the<br>limited number of training samples. Therefore, our approach also aims to<br>meta-learn how to effectively combine several base-learners. We conduct<br>extensive experiments and report top performance for five-class few-shot<br>recognition tasks on two challenging benchmarks: miniImageNet and<br>Fewshot-CIFAR100 (FC100).<br>
Learning Manipulation under Physics Constraints with Visual Perception
W. Li, A. Leonardis, J. Bohg and M. Fritz
Technical Report, 2019
(arXiv: 1904.09860)
Abstract
Understanding physical phenomena is a key competence that enables humans and<br>animals to act and interact under uncertain perception in previously unseen<br>environments containing novel objects and their configurations. In this work,<br>we consider the problem of autonomous block stacking and explore solutions to<br>learning manipulation under physics constraints with visual perception inherent<br>to the task. Inspired by the intuitive physics in humans, we first present an<br>end-to-end learning-based approach to predict stability directly from<br>appearance, contrasting a more traditional model-based approach with explicit<br>3D representations and physical simulation. We study the model's behavior<br>together with an accompanied human subject test. It is then integrated into a<br>real-world robotic system to guide the placement of a single wood block into<br>the scene without collapsing existing tower structure. To further automate the<br>process of consecutive blocks stacking, we present an alternative approach<br>where the model learns the physics constraint through the interaction with the<br>environment, bypassing the dedicated physics learning as in the former part of<br>this work. In particular, we are interested in the type of tasks that require<br>the agent to reach a given goal state that may be different for every new<br>trial. Thereby we propose a deep reinforcement learning framework that learns<br>policies for stacking tasks which are parametrized by a target structure.<br>
Interpretability Beyond Classification Output: Semantic Bottleneck Networks
M. Losch, M. Fritz and B. Schiele
Technical Report, 2019
(arXiv: 1907.10882)
Abstract
Today's deep learning systems deliver high performance based on end-to-end<br>training. While they deliver strong performance, these systems are hard to<br>interpret. To address this issue, we propose Semantic Bottleneck Networks<br>(SBN): deep networks with semantically interpretable intermediate layers that<br>all downstream results are based on. As a consequence, the analysis on what the<br>final prediction is based on is transparent to the engineer and failure cases<br>and modes can be analyzed and avoided by high-level reasoning. We present a<br>case study on street scene segmentation to demonstrate the feasibility and<br>power of SBN. In particular, we start from a well performing classic deep<br>network which we adapt to house a SB-Layer containing task related semantic<br>concepts (such as object-parts and materials). Importantly, we can recover<br>state of the art performance despite a drastic dimensionality reduction from<br>1000s (non-semantic feature) to 10s (semantic concept) channels. Additionally<br>we show how the activations of the SB-Layer can be used for both the<br>interpretation of failure cases of the network as well as for confidence<br>prediction of the resulting output. For the first time, e.g., we show<br>interpretable segmentation results for most predictions at over 99% accuracy.<br>
A Novel BiLevel Paradigm for Image-to-Image Translation
L. Ma, Q. Sun, B. Schiele and L. Van Gool
Technical Report, 2019
(arXiv: 1904.09028)
Abstract
Image-to-image (I2I) translation is a pixel-level mapping that requires a<br>large number of paired training data and often suffers from the problems of<br>high diversity and strong category bias in image scenes. In order to tackle<br>these problems, we propose a novel BiLevel (BiL) learning paradigm that<br>alternates the learning of two models, respectively at an instance-specific<br>(IS) and a general-purpose (GP) level. In each scene, the IS model learns to<br>maintain the specific scene attributes. It is initialized by the GP model that<br>learns from all the scenes to obtain the generalizable translation knowledge.<br>This GP initialization gives the IS model an efficient starting point, thus<br>enabling its fast adaptation to the new scene with scarce training data. We<br>conduct extensive I2I translation experiments on human face and street view<br>datasets. Quantitative results validate that our approach can significantly<br>boost the performance of classical I2I translation models, such as PG2 and<br>Pix2Pix. Our visualization results show both higher image quality and more<br>appropriate instance-specific details, e.g., the translated image of a person<br>looks more like that person in terms of identity.<br>
XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll and C. Theobalt
Technical Report, 2019
(arXiv: 1907.00837)
Abstract
We present a real-time approach for multi-person 3D motion capture at over 30<br>fps using a single RGB camera. It operates in generic scenes and is robust to<br>difficult occlusions both by other people and objects. Our method operates in<br>subsequent stages. The first stage is a convolutional neural network (CNN) that<br>estimates 2D and 3D pose features along with identity assignments for all<br>visible joints of all individuals. We contribute a new architecture for this<br>CNN, called SelecSLS Net, that uses novel selective long and short range skip<br>connections to improve the information flow allowing for a drastically faster<br>network without compromising accuracy. In the second stage, a fully-connected<br>neural network turns the possibly partial (on account of occlusion) 2D pose and<br>3D pose features for each subject into a complete 3D pose estimate per<br>individual. The third stage applies space-time skeletal model fitting to the<br>predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose,<br>and enforce temporal coherence. Our method returns the full skeletal pose in<br>joint angles for each subject. This is a further key distinction from previous<br>work that neither extracted global body positions nor joint angle results of a<br>coherent skeleton in real time for multi-person scenes. The proposed system<br>runs on consumer hardware at a previously unseen speed of more than 30 fps<br>given 512x320 images as input while achieving state-of-the-art accuracy, which<br>we will demonstrate on a range of challenging real-world scenes.<br>
Defending Membership Inference Attacks on Classiﬁcation Models with Differential Privacy
S. Rahimian
PhD Thesis, Universität des Saarlandes, 2019
Shape Evasion: Preventing Body Shape Inference of Multi-Stage Approaches
H. Sattar, K. Krombholz, G. Pons-Moll and M. Fritz
Technical Report, 2019
(arXiv: 1905.11503)
Abstract
Modern approaches to pose and body shape estimation have recently achieved<br>strong performance even under challenging real-world conditions. Even from a<br>single image of a clothed person, a realistic looking body shape can be<br>inferred that captures a users' weight group and body shape type well. This<br>opens up a whole spectrum of applications -- in particular in fashion -- where<br>virtual try-on and recommendation systems can make use of these new and<br>automatized cues. However, a realistic depiction of the undressed body is<br>regarded highly private and therefore might not be consented by most people.<br>Hence, we ask if the automatic extraction of such information can be<br>effectively evaded. While adversarial perturbations have been shown to be<br>effective for manipulating the output of machine learning models -- in<br>particular, end-to-end deep learning approaches -- state of the art shape<br>estimation methods are composed of multiple stages. We perform the first<br>investigation of different strategies that can be used to effectively<br>manipulate the automatic shape estimation while preserving the overall<br>appearance of the original image.<br>
Intents and Preferences Prediction Based on Implicit Human Cues
H. Sattar
PhD Thesis, Universität des Saarlandes, 2019
Abstract
Visual search is an important task, and it is part of daily human life. Thus, it has been a long-standing goal in Computer Vision to develop methods aiming at analysing human search intent and preferences. As the target of the search only exists in mind of the person, search intent prediction remains challenging for machine perception. In this thesis, we focus on advancing techniques for search target and preference prediction from implicit human cues. First, we propose a search target inference algorithm from human fixation data recorded during visual search. In contrast to previous work that has focused on individual instances as a search target in a closed world, we propose the first approach to predict the search target in open-world settings by learning the compatibility between observed fixations and potential search targets. Second, we further broaden the scope of search target prediction to categorical classes, such as object categories and attributes. However, state of the art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism – incorporating both spatial and temporal aspects of human gaze behaviour. Third, we go one step further and investigate the feasibility of combining our gaze embedding approach, with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Forth, for the first time, we studied the effect of body shape on people preferences of outfits. We propose a novel and robust multi-photo approach to estimate the body shapes of each user and build a conditional model of clothing categories given body-shape. We demonstrate that in real-world data, clothing categories and body-shapes are correlated. We show that our approach estimates a realistic looking body shape that captures a user’s weight group and body shape type, even from a single image of a clothed person. However, an accurate depiction of the naked body is considered highly private and therefore, might not be consented by most people. First, we studied the perception of such technology via a user study. Then, in the last part of this thesis, we ask if the automatic extraction of such information can be effectively evaded. In summary, this thesis addresses several different tasks that aims to enable the vision system to analyse human search intent and preferences in real-world scenarios. In particular, the thesis proposes several novel ideas and models in visual search target prediction from human fixation data, for the first time studied the correlation between shape and clothing categories opening a new direction in clothing recommendation systems, and introduces a new topic in privacy and computer vision, aimed at preventing automatic 3D shape extraction from images.
Mobile Eye Tracking for Everyone
J. Steil
PhD Thesis, Universität des Saarlandes, 2019
Abstract
Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training
D. Stutz, M. Hein and B. Schiele
Technical Report, 2019
(arXiv: 1910.06259)
Abstract
Adversarial training is the standard to train models robust against<br>adversarial examples. However, especially for complex datasets, adversarial<br>training incurs a significant loss in accuracy and is known to generalize<br>poorly to stronger attacks, e.g., larger perturbations or other threat models.<br>In this paper, we introduce confidence-calibrated adversarial training (CCAT)<br>where the key idea is to enforce that the confidence on adversarial examples<br>decays with their distance to the attacked examples. We show that CCAT<br>preserves better the accuracy of normal training while robustness against<br>adversarial examples is achieved via confidence thresholding, i.e., detecting<br>adversarial examples based on their confidence. Most importantly, in strong<br>contrast to adversarial training, the robustness of CCAT generalizes to larger<br>perturbations and other threat models, not encountered during training. For<br>evaluation, we extend the commonly used robust test error to our detection<br>setting, present an adaptive attack with backtracking and allow the attacker to<br>select, per test example, the worst-case adversarial example from multiple<br>black- and white-box attacks. We present experimental results using $L_\infty$,<br>$L_2$, $L_1$ and $L_0$ attacks on MNIST, SVHN and Cifar10.<br>
2018
Sequential Attacks on Agents for Long-Term Adversarial Goals
E. Tretschk, S. J. Oh and M. Fritz
2. ACM Computer Science in Cars Symposium (CSCS 2018), 2018
Detailed Human Avatars from Monocular Video
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt and G. Pons-Moll
3DV 2018 , International Conference on 3D Vision, 2018
Single-Shot Multi-person 3D Pose Estimation from Monocular RGB
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll and C. Theobalt
3DV 2018 , International Conference on 3D Vision, 2018
Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation
M. Omran, C. Lassner,, G. Pons-Moll, P. Gehler and B. Schiele
3DV 2018 , International Conference on 3D Vision, 2018
Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time
Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges and G. Pons-Moll
ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2018), Volume 37, Number 6, 2018
Quick Bootstrapping of a Personalized Gaze Model from Real-Use Interactions
M. X. Huang, J. Li, G. Ngai and H. Va Leong
ACM Transactions on Intelligent Systems and Technology, Volume 9, Number 4, 2018
Unsupervised Learning of Shape and Pose with Differentiable Point Clouds
E. Insafutdinov and A. Dosovitskiy
Advances in Neural Information Processing Systems 31 (NeurIPS 2018), 2018
Adversarial Scene Editing: Automatic Object Removal from Weak Supervision
R. Shetty, M. Fritz and B. Schiele
Advances in Neural Information Processing Systems 31 (NeurIPS 2018), 2018
Abstract
While great progress has been made recently in automatic image manipulation,<br>it has been limited to object centric images like faces or structured scene<br>datasets. In this work, we take a step towards general scene-level image<br>editing by developing an automatic interaction-free object removal model. Our<br>model learns to find and remove objects from general scene images using<br>image-level labels and unpaired data in a generative adversarial network (GAN)<br>framework. We achieve this with two key contributions: a two-stage editor<br>architecture consisting of a mask generator and image in-painter that<br>co-operate to remove objects, and a novel GAN based prior for the mask<br>generator that allows us to flexibly incorporate knowledge about object shapes.<br>We experimentally show on two datasets that our method effectively removes a<br>wide variety of objects using weak supervision only<br>
VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements
M. Khamis, C. Oechsner, F. Alt and A. Bulling
AVI 2018, International Conference on Advanced Visual Interfaces, 2018
JAMI: Fast Computation of Conditional Mutual Information for ceRNA Network Analysis
A. Horňáková, M. List, J. Vreeken and M. H. Schulz
Bioinformatics, Volume 34, Number 17, 2018
Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild
M. Khamis, A. Baier, N. Henze, F. Alt and A. Bulling
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
Which one is me? Identifying Oneself on Public Displays
M. Khamis, C. Becker, A. Bulling and F. Alt
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
Training Person-Specific Gaze Estimators from Interactions with Multiple Devices
X. Zhang, M. X. Huang, Y. Sugano and A. Bulling
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
GazeDirector: Fully Articulated Eye Gaze Redirection in Video
E. Wood, T. Baltrusaitis, L.-P. Morency, P. Robinson and A. Bulling
Computer Graphics Forum (Proc. EUROGRAPHICS 2018), Volume 37, Number 2, 2018
Video Object Segmentation with Language Referring Expressions
A. Khoreva, A. Rohrbach and B. Schiele
Computer Vision - ACCV 2018, 2018
NightOwls: A Pedestrians at Night Dataset
L. Neumann, M. Karg, S. Zhang, C. Scharfenberger, E. Piegert, S. Mistr, O. Prokofyeva, R. Thiel, A. Vedaldi, A. Zisserman and B. Schiele
Computer Vision - ACCV 2018, 2018
Grounding Visual Explanations
L. A. Hendricks, R. Hu, T. Darrell and Z. Akata
Computer Vision -- ECCV 2018, 2018
Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes
Y. He, B. Schiele and M. Fritz
Computer Vision -- ECCV 2018, 2018
Textual Explanations for Self-Driving Vehicles
J. Kim, A. Rohrbach, T. Darrell, J. Canny and Z. Akata
Computer Vision -- ECCV 2018, 2018
Abstract
Deep neural perception and control networks have become key com-<br>ponents of self-driving vehicles. User acceptance is likely to benefit from easy-<br>to-interpret textual explanations which allow end-users to understand what trig-<br>gered a particular behavior. Explanations may be triggered by the neural con-<br>troller, namely<br>introspective explanations<br>, or informed by the neural controller’s<br>output, namely<br>rationalizations<br>. We propose a new approach to introspective ex-<br>planations which consists of two parts. First, we use a visual (spatial) attention<br>model to train a convolutional network end-to-end from images to the vehicle<br>control commands,<br>i<br>.<br>e<br>., acceleration and change of course. The controller’s at-<br>tention identifies image regions that potentially influence the network’s output.<br>Second, we use an attention-based video-to-text model to produce textual ex-<br>planations of model actions. The attention maps of controller and explanation<br>model are aligned so that explanations are grounded in the parts of the scene that<br>mattered to the controller. We explore two approaches to attention alignment,<br>strong- and weak-alignment. Finally, we explore a version of our model that<br>generates rationalizations, and compare with introspective explanations on the<br>same video segments. We evaluate these models on a novel driving dataset with<br>ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-<br>X) dataset. Code is available at<br>https://github.com/JinkyuKimUCB/explainable-deep-driving
A Hybrid Model for Identity Obfuscation by Face Replacement
Q. Sun, A. Tewari, W. Xu, M. Fritz, C. Theobalt and B. Schiele
Computer Vision -- ECCV 2018, 2018
Recovering Accurate {3D} Human Pose in the Wild Using {IMUs} and a Moving Camera
T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn and G. Pons-Moll
Computer Vision -- ECCV 2018, 2018
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz and A. Leonardis
Computer Vision - ECCV 2018 Workshops, 2018
GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User
M. Khamis, A. Kienle, F. Alt and A. Bulling
DroNet’18, 4th ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, 2018
Demo of XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, H. Rhodin, W. Xu, G. Pons-Moll and C. Theobalt
ECCV 2018 Demo Sessions, 2018
A Vision-grounded Dataset for Predicting Typical Locations for Verbs
N. Mukuze, A. Rohrbach, V. Demberg and B. Schiele
Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
Eye Movements During Everyday Behavior Predict Personality Traits
S. Hoppe, T. Loetscher, S. Morey and A. Bulling
Frontiers in Human Neuroscience, Volume 12, 2018
Objects, Relationships, and Context in Visual Data
H. Zhang and Q. Sun
ICMR’18, International Conference on Multimedia Retrieval, 2018
Video Based Reconstruction of 3D People Models
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
PoseTrack: A Benchmark for Human Pose Estimation and Tracking
M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Accurate and Diverse Sampling of Sequences based on a “Best of Many” Sample Objective
A. Bhattacharyya, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty
A. Bhattacharyya, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs
E. Laude, J.-H. Lange, J. Schüpfer, C. Domokos, L. Leal-Taixé, F. R. Schmidt, B. Andres and D. Cremers
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Disentangled Person Image Generation
L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images
T. Orekondy, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Learning 3D Shape Completion from Laser Scan Data with Weak Supervision
D. Stutz and A. Geiger
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Natural and Effective Obfuscation by Head Inpainting
Q. Sun, L. Ma, S. J. Oh, L. Van Gool, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Feature Generating Networks for Zero-Shot Learning
Y. Xian, T. Lorenz, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Fooling Vision and Language Models Despite Localization and Attention Mechanism
X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell and D. Song
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll and Y. Liu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Occluded Pedestrian Detection through Guided Attention in CNNs
S. Zhang, J. Yang and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Learning to Refine Human Pose Estimation
M. Fieraru, A. Khoreva, L. Pishchulin and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), 2018
Image and Video Captioning with Augmented Neural Architectures
R. Shetty, H. R. Tavakoli and J. Laaksonen
IEEE MultiMedia, Volume 25, Number 2, 2018
M. X. Huang, J. Li, G. Ngai, H. V. Leong and K. A. Hua
IEEE Transactions on Multimedia, Volume 20, Number 7, 2018
Reflectance and Natural Illumination from Single-Material Specular Objects Using Deep Learning
S. Georgoulis, K. Rematas, T. Ritschel, E. Gavves, M. Fritz, L. Van Gool and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 8, 2018
Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification
M. Lapin, M. Hein and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 7, 2018
Discriminatively Trained Latent Ordinal Model for Video Classification
K. Sikka and G. Sharma
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 8, 2018
Towards Reaching Human Performance in Pedestrian Detection
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 4, 2018
Abstract
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background- versus-foreground errors. To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of training and test annotations.
Learning 3D Shape Completion under Weak Supervision
D. Stutz and A. Geiger
International Journal of Computer Vision, Volume 128, 2018
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori and L. Fei-Fei
International Journal of Computer Vision, Volume 126, Number 2-4, 2018
Every Little Movement Has a Meaning of Its Own: Using Past Mouse Movements to Predict the Next Interaction
T. C. K. Kwok, E. Y. Fu, E. Y. Wu, M. X. Huang, G. Ngai and H.-V. Leong
IUI 2018, 23rd International Conference on Intelligent User Interfaces, 2018
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour
P. Müller, M. X. Huang and A. Bulling
IUI 2018, 23rd International Conference on Intelligent User Interfaces, 2018
Explainable AI: The New 42?
R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. Kieseberg and A. Holzinger
Machine Learning and Knowledge Extraction (CD-MAKE 2018), 2018
Tracing Cell Lineages in Videos of Lens-free Microscopy
M. Rempfler, V. Stierle, K. Ditzel, S. Kumar, P. Paulitschke, B. Andres and B. H. Menze
Medical Image Analysis, Volume 48, 2018
Cross-Species Learning: A Low-Cost Approach to Learning Human Fight from Animal Fight
E. Y. Fu, M. X. Huang, H. V. Leong and G. Ngai
MM’18, 26th ACM Multimedia Conference, 2018
The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned
M. Khamis, F. Alt and A. Bulling
MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018
Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors
J. Steil, P. Müller, Y. Sugano and A. Bulling
MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018
NRST: Non-rigid Surface Tracking from Monocular Video
M. Habermann, W. Xu, H. Rohdin, M. Zollhöfer, G. Pons-Moll and C. Theobalt
Pattern Recognition (GCPR 2018), 2018
Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction
M. Barz, F. Daiber, D. Sonntag and A. Bulling
Proceedings ETRA 2018, 2018
Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimuli’s Trajectory is Partially Hidden
T. Mattusch, M. Mirzamohammad, M. Khamis, A. Bulling and F. Alt
Proceedings ETRA 2018, 2018
Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour
P. Müller, M. X. Huang, X. Zhang and A. Bulling
Proceedings ETRA 2018, 2018
Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings
S. Park, X. Zhang, A. Bulling and O. Hilliges
Proceedings ETRA 2018, 2018
Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets
J. Steil, M. X. Huang and A. Bulling
Proceedings ETRA 2018, 2018
Revisiting Data Normalization for Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano and A. Bulling
Proceedings ETRA 2018, 2018
A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation
R. Shetty, B. Schiele and M. Fritz
Proceedings of the 27th USENIX Security Symposium, 2018
Partial Optimality and Fast Lower Bounds for Weighted Correlation Clustering
J.-H. Lange, A. Karrenbauer and B. Andres
Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
A. Khan, I. Steiner, Y. Sugano, A. Bulling and R. Macdonald
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
Generating Counterfactual Explanations with Natural Language
L. A. Hendricks, R. Hu, T. Darrell and Z. Akata
Proceedings of the 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018), 2018
(arXiv: 1806.09809)
Abstract
Natural language explanations of deep neural network decisions provide an<br>intuitive way for a AI agent to articulate a reasoning process. Current textual<br>explanations learn to discuss class discriminative features in an image.<br>However, it is also helpful to understand which attributes might change a<br>classification decision if present in an image (e.g., "This is not a Scarlet<br>Tanager because it does not have black wings.") We call such textual<br>explanations counterfactual explanations, and propose an intuitive method to<br>generate counterfactual explanations by inspecting which evidence in an input<br>is missing, but might contribute to a different classification decision if<br>present in the image. To demonstrate our method we consider a fine-grained<br>image classification task in which we take as input an image and a<br>counterfactual class and output text which explains why the image does not<br>belong to a counterfactual class. We then analyze our generated counterfactual<br>explanations both qualitatively and quantitatively using proposed automatic<br>metrics.<br>
Advanced Steel Microstructure Classification by Deep Learning Methods
S. M. Azimi, D. Britz, M. Engstler, M. Fritz and F. Mücklich
Scientific Reports, Volume 8, 2018
Abstract
The inner structure of a material is called microstructure. It stores the genesis of a material and determines all its physical and chemical properties. While microstructural characterization is widely spread and well known, the microstructural classification is mostly done manually by human experts, which opens doors for huge uncertainties. Since the microstructure could be a combination of different phases with complex substructures its automatic classification is very challenging and just a little work in this field has been carried out. Prior related works apply mostly designed and engineered features by experts and classify microstructure separately from feature extraction step. Recently Deep Learning methods have shown surprisingly good performance in vision applications by learning the features from data together with the classification step. In this work, we propose a deep learning method for microstructure classification in the examples of certain microstructural constituents of low carbon steel. This novel method employs pixel-wise segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by max-voting scheme. Our system achieves 93.94% classification accuracy, drastically outperforming the state-of-the-art method of 48.89% accuracy, indicating the effectiveness of pixel-wise approaches. Beyond the success presented in this paper, this line of research offers a more robust and first of all objective way for the difficult task of steel quality appreciation.
Towards Reverse-Engineering Black-Box Neural Networks
S. J. Oh, M. Augustin, B. Schiele and M. Fritz
Sixth International Conference on Learning Representations (ICLR 2018), 2018
Long-Term Image Boundary Prediction
A. Bhattacharyya, M. Malinowski, B. Schiele and M. Fritz
Thirty-Second AAAI Conference on Artificial Intelligence, 2018
Bottleneck Potentials in Markov Random Fields
A. Abbas
PhD Thesis, Universität des Saarlandes, 2018
Higher-order Projected Power Iterations for Scalable Multi-Matching
F. Bernard, J. Thunberg, P. Swoboda and C. Theobalt
Technical Report, 2018
(arXiv: 1811.10541)
Abstract
The matching of multiple objects (e.g. shapes or images) is a fundamental<br>problem in vision and graphics. In order to robustly handle ambiguities, noise<br>and repetitive patterns in challenging real-world settings, it is essential to<br>take geometric consistency between points into account. Computationally, the<br>multi-matching problem is difficult. It can be phrased as simultaneously<br>solving multiple (NP-hard) quadratic assignment problems (QAPs) that are<br>coupled via cycle-consistency constraints. The main limitations of existing<br>multi-matching methods are that they either ignore geometric consistency and<br>thus have limited robustness, or they are restricted to small-scale problems<br>due to their (relatively) high computational cost. We address these<br>shortcomings by introducing a Higher-order Projected Power Iteration method,<br>which is (i) efficient and scales to tens of thousands of points, (ii)<br>straightforward to implement, (iii) able to incorporate geometric consistency,<br>and (iv) guarantees cycle-consistent multi-matchings. Experimentally we show<br>that our approach is superior to existing methods.<br>
Bayesian Prediction of Future Street Scenes through Importance Sampling based Optimization
A. Bhattacharyya, M. Fritz and B. Schiele
Technical Report, 2018
(arXiv: 1806.06939)
Abstract
For autonomous agents to successfully operate in the real world, anticipation of future events and states of their environment is a key competence. This problem can be formalized as a sequence prediction problem, where a number of observations are used to predict the sequence into the future. However, real-world scenarios demand a model of uncertainty of such predictions, as future states become increasingly uncertain and multi-modal -- in particular on long time horizons. This makes modelling and learning challenging. We cast state of the art semantic segmentation and future prediction models based on deep learning into a Bayesian formulation that in turn allows for a full Bayesian treatment of the prediction problem. We present a new sampling scheme for this model that draws from the success of variational autoencoders by incorporating a recognition network. In the experiments we show that our model outperforms prior work in accuracy of the predicted segmentation and provides calibrated probabilities that also better capture the multi-modal aspects of possible future states of street scenes.
Proceedings PETMEI 2018
A. Bulling, E. Kasneci and C. Lander (Eds.)
ACM, 2018
Primal-Dual Wasserstein GAN
M. Gemici, Z. Akata and M. Welling
Technical Report, 2018
(arXiv: 1805.09575)
Abstract
We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for building latent variable models of the data distribution based on the primal and the dual formulations of the optimal transport (OT) problem. We utilize the primal formulation to learn a flexible inference mechanism and to create an optimal approximate coupling between the data distribution and the generative model. In order to learn the generative model, we use the dual formulation and train the decoder adversarially through a critic network that is regularized by the approximate coupling obtained from the primal. Unlike previous methods that violate various properties of the optimal critic, we regularize the norm and the direction of the gradients of the critic function. Our model shares many of the desirable properties of auto-encoding models in terms of mode coverage and latent structure, while avoiding their undesirable averaging properties, e.g. their inability to capture sharp visual features when modeling real images. We compare our algorithm with several other generative modeling techniques that utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception Scores (IS).
MLCapsule: Guarded Offline Deployment of Machine Learning as a Service
L. Hanzlik, Y. Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes and M. Fritz
Technical Report, 2018
(arXiv: 1808.00590)
Abstract
With the widespread use of machine learning (ML) techniques, ML as a service<br>has become increasingly popular. In this setting, an ML model resides on a<br>server and users can query the model with their data via an API. However, if<br>the user's input is sensitive, sending it to the server is not an option.<br>Equally, the service provider does not want to share the model by sending it to<br>the client for protecting its intellectual property and pay-per-query business<br>model. In this paper, we propose MLCapsule, a guarded offline deployment of<br>machine learning as a service. MLCapsule executes the machine learning model<br>locally on the user's client and therefore the data never leaves the client.<br>Meanwhile, MLCapsule offers the service provider the same level of control and<br>security of its model as the commonly used server-side execution. In addition,<br>MLCapsule is applicable to offline applications that require local execution.<br>Beyond protecting against direct model access, we demonstrate that MLCapsule<br>allows for implementing defenses against advanced attacks on machine learning<br>models such as model stealing/reverse engineering and membership inference.<br>
Manipulating Attributes of Natural Scenes via Hallucination
L. Karacan, Z. Akata, A. Erdem and E. Erdem
Technical Report, 2018
(arXiv: 1808.07413)
Abstract
In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hallucinated with the given attributes, the corresponding look is then transferred to the input image while preserving the semantic details intact, giving a photo-realistic manipulation result. As the proposed framework hallucinates what the scene will look like, it does not require any reference style image as commonly utilized in most of the appearance or style transfer approaches. Moreover, it allows to simultaneously manipulate a given scene according to a diverse set of transient attributes within a single model, eliminating the need of training multiple networks per each translation task. Our comprehensive set of qualitative and quantitative results demonstrate the effectiveness of our approach against the competing methods.
Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation
K. Z. Lin, W. Xu, Q. Sun, C. Theobalt and T.-S. Chua
Technical Report, 2018
(arXiv: 1812.09899)
Abstract
We propose a novel approach to jointly perform 3D object retrieval and pose<br>estimation from monocular images.In order to make the method robust to real<br>world scene variations in the images, e.g. texture, lighting and background,we<br>learn an embedding space from 3D data that only includes the relevant<br>information, namely the shape and pose.Our method can then be trained for<br>robustness under real world scene variations without having to render a large<br>training set simulating these variations. Our learned embedding explicitly<br>disentangles a shape vector and a pose vector, which alleviates both pose bias<br>for 3D shape retrieval and categorical bias for pose estimation. Having the<br>learned disentangled embedding, we train a CNN to map the images to the<br>embedding space, and then retrieve the closest 3D shape from the database and<br>estimate the 6D pose of the object using the embedding vectors. Our method<br>achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for<br>category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore<br>outperforms the previous state-of-the-art methods on both tasks.<br>
From Perception over Anticipation to Manipulation
W. Li
PhD Thesis, Universität des Saarlandes, 2018
Abstract
From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.
Deep Appearance Maps
M. Maximov, T. Ritschel and M. Fritz
Technical Report, 2018
(arXiv: 1804.00863)
Abstract
We propose a deep representation of appearance, i. e. the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have used deep learning to extract classic appearance representations relating to reflectance model parameters (e. g. Phong) or illumination (e. g. HDR environment maps). We suggest to directly represent appearance itself as a network we call a deep appearance map (DAM). This is a 4D generalization over 2D reflectance maps, which held the view direction fixed. First, we show how a DAM can be learned from images or video frames and later be used to synthesize appearance, given new surface orientations and viewer positions. Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learning-to-learn). Finally, we generalize this to an appearance estimation-and-segmentation task, where we map from an image showing multiple materials to multiple networks reproducing their appearance, as well as per-pixel segmentation.
Computational Modelling of Visual Attention during Reading
A. Nurkas
PhD Thesis, Universität des Saarlandes, 2018
Image Manipulation against Learned Models Privacy and Security Implications
S. J. Oh
PhD Thesis, Universität des Saarlandes, 2018
Abstract
Machine learning is transforming the world. Its application areas span privacy<br>sensitive and security critical tasks such as human identification and self-driving<br>cars. These applications raise privacy and security related questions that are not<br>fully understood or answered yet: Can automatic person recognisers identify people<br>in photos even when their faces are blurred? How easy is it to find an adversarial<br>input for a self-driving car that makes it drive off the road?<br>This thesis contributes one of the first steps towards a better understanding of<br>such concerns. We observe that many privacy and security critical scenarios for<br>learned models involve input data manipulation: users obfuscate their identity by<br>blurring their faces and adversaries inject imperceptible perturbations to the input<br>signal. We introduce a data manipulator framework as a tool for collectively describing<br>and analysing privacy and security relevant scenarios involving learned models.<br>A data manipulator introduces a shift in data distribution for achieving privacy or<br>security related goals, and feeds the transformed input to the target model. This<br>framework provides a common perspective on the studies presented in the thesis.<br>We begin the studies from the user’s privacy point of view. We analyse the<br>efficacy of common obfuscation methods like face blurring, and show that they<br>are surprisingly ineffective against state of the art person recognition systems. We<br>then propose alternatives based on head inpainting and adversarial examples. By<br>studying the user privacy, we also study the dual problem: model security. In model<br>security perspective, a model ought to be robust and reliable against small amounts<br>of data manipulation. In both cases, data are manipulated with the goal of changing<br>the target model prediction. User privacy and model security problems can be<br>described with the same objective.<br>We then study the knowledge aspect of the data manipulation problem. The more<br>one knows about the target model, the more effective manipulations one can craft.<br>We propose a game theoretic manipulation framework to systematically represent<br>the knowledge level on the target model and derive privacy and security guarantees.<br>We then discuss ways to increase knowledge about a black-box model by only querying<br>it, deriving implications that are relevant to both privacy and security perspectives.
Understanding and Controlling User Linkability in Decentralized Learning
T. Orekondy, S. J. Oh, B. Schiele and M. Fritz
Technical Report, 2018
(arXiv: 1805.05838)
Abstract
Machine Learning techniques are widely used by online services (e.g. Google, Apple) in order to analyze and make predictions on user data. As many of the provided services are user-centric (e.g. personal photo collections, speech recognition, personal assistance), user data generated on personal devices is key to provide the service. In order to protect the data and the privacy of the user, federated learning techniques have been proposed where the data never leaves the user's device and "only" model updates are communicated back to the server. In our work, we propose a new threat model that is not concerned with learning about the content - but rather is concerned with the linkability of users during such decentralized learning scenarios. We show that model updates are characteristic for users and therefore lend themselves to linkability attacks. We show identification and matching of users across devices in closed and open world scenarios. In our experiments, we find our attacks to be highly effective, achieving 20x-175x chance-level performance. In order to mitigate the risks of linkability attacks, we study various strategies. As adding random noise does not offer convincing operation points, we propose strategies based on using calibrated domain-specific data; we find these strategies offers substantial protection against linkability threats with little effect to utility.
End-to-end Learning for Graph Decomposition
J. Song, B. Andres, M. Black, O. Hilliges and S. Tang
Technical Report, 2018
(arXiv: 1812.09737)
Abstract
We propose a novel end-to-end trainable framework for the graph decomposition<br>problem. The minimum cost multicut problem is first converted to an<br>unconstrained binary cubic formulation where cycle consistency constraints are<br>incorporated into the objective function. The new optimization problem can be<br>viewed as a Conditional Random Field (CRF) in which the random variables are<br>associated with the binary edge labels of the initial graph and the hard<br>constraints are introduced in the CRF as high-order potentials. The parameters<br>of a standard Neural Network and the fully differentiable CRF are optimized in<br>an end-to-end manner. Furthermore, our method utilizes the cycle constraints as<br>meta-supervisory signals during the learning of the deep feature<br>representations by taking the dependencies between the output random variables<br>into account. We present analyses of the end-to-end learned representations,<br>showing the impact of the joint training, on the task of clustering images of<br>MNIST. We also validate the effectiveness of our approach both for the feature<br>learning and the final clustering on the challenging task of real-world<br>multi-person pose estimation.<br>
PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis
J. Steil, M. Koelle, W. Heuten, S. Boll and A. Bulling
Technical Report, 2018
(arXiv: 1801.04457)
Abstract
As first-person cameras in head-mounted displays become increasingly prevalent, so does the problem of infringing user and bystander privacy. To address this challenge, we present PrivacEye, a proof-of-concept system that detects privacysensitive everyday situations and automatically enables and disables the first-person camera using a mechanical shutter. To close the shutter, PrivacEye detects sensitive situations from first-person camera videos using an end-to-end deep-learning model. To open the shutter without visual input, PrivacEye uses a separate, smaller eye camera to detect changes in users' eye movements to gauge changes in the "privacy level" of the current situation. We evaluate PrivacEye on a dataset of first-person videos recorded in the daily life of 17 participants that they annotated with privacy sensitivity levels. We discuss the strengths and weaknesses of our proof-of-concept system based on a quantitative technical evaluation as well as qualitative insights from semi-structured interviews.
Gaze Estimation and Interaction in Real-World Environments
X. Zhang
PhD Thesis, Universität des Saarlandes, 2018
Abstract
Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.
2017
They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers
M. Khamis, L. Bandelow, S. Schick, D. Casadevall, A. Bulling and F. Alt
16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017
EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging
C. Lander, S. Gehring, M. Löchtefeld, A. Bulling and A. Krüger
16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), 2017
Long-Term On-Board Prediction of Pedestrians in Traffic Scenes
A. Bhattacharyya, M. Fritz and B. Schiele
1st Conference on Robot Learning (CoRL 2017), 2017
S. Ebrahimi, A. Rohrbach and T. Darrell
1st Conference on Robot Learning (CoRL 2017), 2017
STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Y. He, W.-C. Chiu, M. Keuper and M. Fritz
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Learning Non-maximum Suppression
J. Hosang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
ArtTrack: Articulated Multi-Person Tracking in the Wild
E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Gaze Embeddings for Zero-Shot Image Classification
N. Karessli, Z. Akata, B. Schiele and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Learning Video Object Segmentation from Static Images
A. Khoreva, F. Perazzi, R. Benenson, B. Schiele and A. Sorkine-Hornung
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Simple Does It: Weakly Supervised Instance and Semantic Segmentation
A. Khoreva, R. Benenson, J. Hosang, M. Hein and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
InstanceCut: from Edges to Instances with MultiCut
A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy and C. Rother
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications
E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Exploiting Saliency for Object Segmentation from Image Level Labels
S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Generating Descriptions with Grounded and Co-Referenced People
A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Domain Based Approach to Social Relation Recognition
Q. Sun, B. Schiele and M. Fritz
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
A Message Passing Algorithm for the Minimum Cost Multicut Problem
P. Swoboda and B. Andres
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Multiple People Tracking by Lifted Multicut and Person Re-identification
S. Tang, M. Andriluka, B. Andres and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Zero-shot learning - The Good, the Bad and the Ugly
Y. Xian, B. Schiele and Z. Akata
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
CityPersons: A Diverse Dataset for Pedestrian Detection
S. Zhang, R. Benenson and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
Abstract
Convnets have enabled significant progress in pedestrian detection recently, but there are still open questions regarding suitable architectures and training data. We revisit CNN design and point out key adaptations, enabling plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset. To achieve further improvement from more and better data, we introduce CityPersons, a new set of person annotations on top of the Cityscapes dataset. The diversity of CityPersons allows us for the first time to train one single CNN model that generalizes well over multiple benchmarks. Moreover, with additional training with CityPersons, we obtain top results using FasterRCNN on Caltech, improving especially for more difficult cases (heavy occlusion and small scale) and providing higher localization quality.
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
30th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), 2017
Visual Stability Prediction and Its Application to Manipulation
W. Li, A. Leonardis and M. Fritz
AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, 2017
Pose Guided Person Image Generation
L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars and L. Van Gool
Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017
ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones
M. X. Huang, J. Li, G. Ngai and H. V. Leong
CHI’17, 35th Annual ACM Conference on Human Factors in Computing Systems, 2017
Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications
M. Klauck, Y. Sugano and A. Bulling
CHI 2017 Extended Abstracts, 2017
Lucid Data Dreaming for Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
DAVIS Challenge on Video Object Segmentation 2017, 2017
GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication
M. Khamis,, M. Hassib, E. von Zezschwitz, A. Bulling and F. Alt
ICMI’17, 19th ACM International Conference on Multimodal Interaction, 2017
What Is Around The Camera?
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Adversarial Image Perturbation for Privacy Protection -- A Game Theory Perspective
S. J. Oh, M. Fritz and B. Schiele
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images
T. Orekondy, B. Schiele and M. Fritz
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Efficient Algorithms for Moral Lineage Tracing
M. Rempfler, J.-H. Lange, F. Jug, C. Blasse, E. W. Myers, B. H. Menze and B. Andres
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz and B. Schiele
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Paying Attention to Descriptions Generated by Image Captioning Models
H. R. Tavakoli, R. Shetty, A. Borji and J. Laaksonen
IEEE International Conference on Computer Vision (ICCV 2017), 2017
Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling
H. Sattar, A. Bulling and M. Fritz
2017 IEEE International Conference on Computer Vision Workshops (MBCC @ICCV 2017), 2017
Abstract
Previous work focused on predicting visual search targets from human fixations but, in the real world, a specific target is often not known, e.g. when searching for a present for a friend. In this work we instead study the problem of predicting the mental picture, i.e. only an abstract idea instead of a specific target. This task is significantly more challenging given that mental pictures of the same target category can vary widely depending on personal biases, and given that characteristic target attributes can often not be verbalised explicitly. We instead propose to use gaze information as implicit information on users' mental picture and present a novel gaze pooling layer to seamlessly integrate semantic and localized fixation information into a deep image representation. We show that we can robustly predict both the mental picture's category as well as attributes on a novel dataset containing fixation data of 14 users searching for targets on a subset of the DeepFahion dataset. Our results have important implications for future search interfaces and suggest deep gaze pooling as a general-purpose approach for gaze-supported computer vision systems.
Visual Stability Prediction for Robotic Manipulation
W. Li, A. Leonardis and M. Fritz
IEEE International Conference on Robotics and Automation (ICRA 2017), 2017
MARCOnI -- ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes
A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 3, 2017
Novel Views of Objects from a Single Image
K. Rematas, C. Nguyen, T. Ritschel, M. Fritz and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 8, 2017
Expanded Parts Model for Semantic Description of Humans in Still Images
G. Sharma, F. Jurie and C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Number 1, 2017
A Compact Representation of Human Actions by Sliding Coordinate Coding
R. Ding, Q. Sun, M. Liu and H. Liu
International Journal of Advanced Robotic Systems, Volume 14, Number 6, 2017
M. Malinowski, M. Rohrbach and M. Fritz
International Journal of Computer Vision, Volume 125, Number 1-3, 2017
Movie Description
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele
International Journal of Computer Vision, Volume 123, Number 1, 2017
Abstract
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015.
Cell Lineage Tracing in Lens-Free Microscopy Videos
M. Rempfler, S. Kumar, V. Stierle, P. Paulitschke, B. Andres and B. H. Menze
Medical Image Computing and Computer Assisted Intervention -- MICCAI 2017, 2017
Building Statistical Shape Spaces for 3D Human Modeling
L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt and B. Schiele
Pattern Recognition, Volume 67, 2017
Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes
Q. Sun, H. Liu and T. Harada
Pattern Recognition, Volume 64, 2017
Learning Dilation Factors for Semantic Segmentation of Street Scenes
Y. He, M. Keuper, B. Schiele and M. Fritz
Pattern Recognition (GCPR 2017), 2017
A Comparative Study of Local Search Algorithms for Correlation Clustering
E. Levinkov, A. Kirillov and B. Andres
Pattern Recognition (GCPR 2017), 2017
Look Together: Using Gaze for Assisting Co-located Collaborative Search
Y. Zhang, K. Pfeuffer, M. K. Chong, J. Alexander, A. Bulling and H. Gellersen
Personal and Ubiquitous Computing, Volume 21, Number 1, 2017
GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices
M. Khamis, R. Hasholzner, A. Bulling and F. Alt
Pervasive Displays 2017 (PerDis 2017), 2017
Analysis and Optimization of Graph Decompositions by Lifted Multicuts
A. Horňáková, J.-H. Lange and B. Andres
Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017
EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays
M. Khamis, D. Buschek, T. Thieron, F. Alt and A. Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 4, 2017
InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation
M. Tonsen, J. Steil, Y. Sugano and A. Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Number 3, 2017
Efficiently Summarising Event Sequences with Rich Interleaving Patterns
A. Bhattacharyya and J. Vreeken
Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), 2017
Are you stressed? Your eyes and the mouse can tell
J. Wang, M. X. Huang, G. Ngai and H. V. Leong
Seventh International Conference on Affective Computing and Intelligent Interaction (ACII 2017), 2017
EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays
M. Khamis, A. Hoesl, A. Klimczak, M. Reiss, F. Alt and A. Bulling
UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017
Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery
X. Zhang, Y. Sugano and A. Bulling
UIST’17, 30th Annual Symposium on User Interface Software and Technology, 2017
Learning to Track Humans in Videos
M. Fieraru
PhD Thesis, Universität des Saarlandes, 2017
Analysis and Improvement of the Visual Object Detection Pipeline
J. Hosang
PhD Thesis, Universität des Saarlandes, 2017
Abstract
Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression.
Learning to Segment in Images and Videos with Different Forms of Supervision
A. Khoreva
PhD Thesis, Universität des Saarlandes, 2017
Abstract
Much progress has been made in image and video segmentation<br>over the last years. To a large extent, the success can be attributed to<br>the strong appearance models completely learned from data, in particular<br>using deep learning methods. However,to perform best these methods require<br>large representative datasets for training with expensive pixel-level<br>annotations, which in case of videos are prohibitive to obtain. Therefore,<br>there is a need to relax this constraint and to consider alternative forms<br>of supervision, which are easier and cheaper to collect. In this thesis,<br>we aim to develop algorithms for learning to segment in images and videos<br>with different levels of supervision.<br>First, we develop approaches for training convolutional networks with weaker<br>forms of supervision, such as bounding boxes or image labels, for object<br>boundary estimation and semantic/instance labelling tasks. We propose to<br>generate pixel-level approximate groundtruth from these weaker forms of<br>annotations to train a network, which allows to achieve high-quality<br>results comparable to the full supervision quality without any<br>modifications of the network architecture or the training procedure.<br>Second, we address the problem of the excessive computational and memory<br>costs inherent to solving video segmentation via graphs. We propose<br>approaches to improve the runtime and memory efficiency as well as the<br>output segmentation quality by learning from the available training data<br>the best representation of the graph. In particular, we contribute with<br>learning must-link constraints, the topology and edge weights of the graph<br>as well as enhancing the graph nodes - superpixels - themselves.<br>Third, we tackle the task of pixel-level object tracking and address the<br>problem of the limited amount of densely annotated video data for training<br>convolutional networks. We introduce an architecture which allows training<br>with static images only and propose an elaborate data synthesis scheme<br>which creates a large number of training examples close to the target<br>domain from the given first frame mask. With the proposed techniques we<br>show that densely annotated consequent video data is not necessary to<br>achieve high-quality temporally coherent video segmentationresults.<br>In summary, this thesis advances the state of the art in weakly supervised<br>image segmentation, graph-based video segmentation and pixel-level object<br>tracking and contributes with the new ways of training convolutional<br>networks with a limited amount of pixel-level annotated training data.
Lucid Data Dreaming for Multiple Object Tracking
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
Technical Report, 2017
(arXiv: 1703.09554)
Abstract
Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k ~ 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the object tracking task.
Image Classification with Limited Training Data and Class Ambiguity
M. Lapin
PhD Thesis, Universität des Saarlandes, 2017
Abstract
Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations.
Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning
W. Li, J. Bohg and M. Fritz
Technical Report, 2017
(arXiv: 1711.00267)
Abstract
Understanding physical phenomena is a key component of human intelligence and enables physical interaction with previously unseen environments. In this paper, we study how an artificial agent can autonomously acquire this intuition through interaction with the environment. We created a synthetic block stacking environment with physics simulation in which the agent can learn a policy end-to-end through trial and error. Thereby, we bypass to explicitly model physical knowledge within the policy. We are specifically interested in tasks that require the agent to reach a given goal state that may be different for every new trial. To this end, we propose a deep reinforcement learning framework that learns policies which are parametrized by a goal. We validated the model on a toy example navigating in a grid world with different target positions and in a block stacking task with different target structures of the final tower. In contrast to prior work, our policies show better generalization across different goals.
Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image
M. Malinowski
PhD Thesis, Universität des Saarlandes, 2017
Abstract
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
Person Recognition in Social Media Photos
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
Technical Report, 2017
(arXiv: 1710.03224)
Abstract
People nowadays share large parts of their personal lives through social media. Being able to automatically recognise people in personal photos may greatly enhance user convenience by easing photo album organisation. For human identification task, however, traditional focus of computer vision has been face recognition and pedestrian re-identification. Person recognition in social media photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses) and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnet features from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearance gap between training and testing samples. We present an in-depth analysis of the importance of different features according to time and viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, social groups, and events. Compared the conference version of the paper, this paper additionally presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results even compared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including the head viewpoint-wise breakdown of performance, and (5) results on the open-world setup.
Whitening Black-Box Neural Networks
S. J. Oh, M. Augustin, B. Schiele and M. Fritz
Technical Report, 2017
(arXiv: 1711.01768)
Abstract
Many deployed learned models are black boxes: given input, returns output. Internal information about the model, such as the architecture, optimisation procedure, or training data, is not disclosed explicitly as it might contain proprietary information or make the system more vulnerable. This work shows that such attributes of neural networks can be exposed from a sequence of queries. This has multiple implications. On the one hand, our work exposes the vulnerability of black-box neural networks to different types of attacks -- we show that the revealed internal information helps generate more effective adversarial examples against the black box model. On the other hand, this technique can be used for better protection of private content from automatic recognition models using adversarial examples. Our paper suggests that it is actually hard to draw a line between white box and black box models.
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach
Technical Report, 2017
(arXiv: 1711.07373)
Abstract
Deep models are the defacto standard in visual decision problems due to their<br>impressive performance on a wide array of visual tasks. On the other hand,<br>their opaqueness has led to a surge of interest in explainable systems. In this<br>work, we emphasize the importance of model explanation in various forms such as<br>visual pointing and textual justification. The lack of data with justification<br>annotations is one of the bottlenecks of generating multimodal explanations.<br>Thus, we propose two large-scale datasets with annotations that visually and<br>textually justify a classification decision for various activities, i.e. ACT-X,<br>and for question answering, i.e. VQA-X. We also introduce a multimodal<br>methodology for generating visual and textual explanations simultaneously. We<br>quantitatively show that training with the textual explanations not only yields<br>better textual justification models, but also models that better localize the<br>evidence that support their decision.<br>
Generation and Grounding of Natural Language Descriptions for Visual Data
A. Rohrbach
PhD Thesis, Universität des Saarlandes, 2017
Abstract
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.
Visual Decoding of Targets During Visual Search From Human Eye Fixations
H. Sattar, M. Fritz and A. Bulling
Technical Report, 2017
(arXiv: 1706.05993)
Abstract
What does human gaze reveal about a users' intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user's mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model's predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of user
People detection and tracking in crowded scenes
S. Tang
PhD Thesis, Universität des Saarlandes, 2017
Abstract
People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance.
Unconstrained Appearance-based Gaze Estimation from a Freely Moving Camera
G. Wang
PhD Thesis, Universität des Saarlandes, 2017
2016
Multi-Cue Zero-Shot Learning with Strong Supervision
Z. Akata, M. Malinowski, M. Fritz and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
CP-mtML: Coupled Projection Multi-task Metric Learning for Large Scale Face Retrieval
B. Bhattarai, G. Sharma and F. Jurie
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
The Cityscapes Dataset for Semantic Urban Scene Understanding
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Moral Lineage Tracing
F. Jug, E. Levinkov, C. Blasse, E. W. Myers and B. Andres
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Weakly Supervised Object Boundaries
A. Khoreva, R. Benenson, M. Omran, M. Hein and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Abstract
State-of-the-art learning based boundary detection methods require extensive training data. Since labelling object boundaries is one of the most expensive types of annotations, there is a need to relax the requirement to carefully annotate images to make both the training more affordable and to extend the amount of training data. In this paper we propose a technique to generate weakly supervised annotations and show that bounding box annotations alone suffice to reach high-quality object boundaries without using any object-specific boundary annotations. With the proposed weak supervision techniques we achieve the top performance on the object boundary detection task, outperforming by a large margin the current fully supervised state-of-the-art methods.
Loss Functions for Top-k Error: Analysis and Insights
M. Lapin, M. Hein and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Learning Deep Representations of Fine-Grained Visual Descriptions
S. Reed, Z. Akata, H. Lee and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Deep Reflectance Maps
K. Rematas, T. Ritschel, M. Fritz, E. Gavves and T. Tuytelaars
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Abstract
Undoing the image formation process and therefore decomposing appearance into its intrinsic properties is a challenging task due to the under-constraint nature of this inverse problem. While significant progress has been made on inferring shape, materials and illumination from images only, progress in an unconstrained setting is still limited. We propose a convolutional neural architecture to estimate reflectance maps of specular materials in natural lighting conditions. We achieve this in an end-to-end learning formulation that directly predicts a reflectance map from the image itself. We show how to improve estimates by facilitating additional supervision in an indirect scheme that first predicts surface orientation and afterwards predicts the reflectance map by a learning-based sparse data interpolation. In order to analyze performance on this difficult task, we propose a new challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg) using both synthetic and real images. Furthermore, we show the application of our method to a range of image-based editing tasks on real images.
Convexity Shape Constraints for Image Segmentation
L. A. Royer, D. L. Richmond, B. Andres and D. Kainmueller
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
LOMo: Latent Ordinal Model for Facial Analysis in Videos
K. Sikka, G. Sharma and M. Bartlett
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
End-to-end People Detection in Crowded Scenes
R. Stewart and M. Andriluka
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Latent Embeddings for Zero-shot Classification
Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
How Far are We from Solving Pedestrian Detection?
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras
H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2016), Volume 35, Number 6, 2016a
Learning What and Where to Draw
S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele and L. Honglak
Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016
SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull
S. Schneegass, Y. Oualil and A. Bulling
CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, 2016
Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces
P. Xu, Y. Sugano and A. Bulling
CHI 2016, 34th Annual ACM Conference on Human Factors in Computing Systems, 2016
GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices
M. Khamis, F. Alt, M. Hassib, E. von Zezschwitz, R. Hasholzner and A. Bulling
CHI 2016 Extended Abstracts, 2016
On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input
D. Kirst and A. Bulling
CHI 2016 Extended Abstracts, 2016
Abstract
Rotations performed with the index finger and thumb involve some of the most complex motor action among common multi-touch gestures, yet little is known about the factors affecting performance and ergonomics. This note presents results from a study where the angle, direction, diameter, and position of rotations were systematically manipulated. Subjects were asked to perform the rotations as quickly as possible without losing contact with the display, and were allowed to skip rotations that were too uncomfortable. The data show surprising interaction effects among the variables, and help us identify whole categories of rotations that are slow and cumbersome for users.
Pervasive Attentive User Interfaces
A. Bulling
Computer, Volume 49, Number 1, 2016
Towards Segmenting Consumer Stereo Videos: Benchmark, Baselines and Ensembles
W.-C. Chiu, F. Galasso and M. Fritz
Computer Vision -- ACCV 2016, 2016
Local Higher-order Statistics (LHS) Describing Images with Statistics of Local Non-binarized Pixel Patterns
G. Sharma and F. Jurie
Computer Vision and Image Understanding, Volume 142, 2016
An Efficient Fusion Move Algorithm for the Minimum Cost Lifted Multicut Problem
T. Beier, B. Andres, U. Köthe and F. A. Hamprecht
Computer Vision - ECCV 2016, 2016
Generating Visual Explanations
L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele and T. Darrell
Computer Vision -- ECCV 2016, 2016
Abstract
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. We propose a novel loss function based on sampling and reinforcement learning that learns to generate sentences that realize a global sentence property, such as class specificity. Our results on a fine-grained bird species classification dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods.
DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka and B. Schiele
Computer Vision -- ECCV 2016, 2016
Abstract
The goal of this paper is to advance the state-of-the-art of articulated pose estimation in scenes with multiple people. To that end we contribute on three fronts. We propose (1) improved body part detectors that generate effective bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms that allow to assemble the proposals into a variable number of consistent body part configurations; and (3) an incremental optimization strategy that explores the search space more efficiently thus leading both to better performance and significant speed-up factors. We evaluate our approach on two single-person and two multi-person pose estimation benchmarks. The proposed approach significantly outperforms best known multi-person pose estimation results while demonstrating competitive performance on the task of single person pose estimation. Models and code available at http://pose.mpi-inf.mpg.de
Faceless Person Recognition: Privacy Implications in Social Media
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
Computer Vision -- ECCV 2016, 2016
Grounding of Textual Phrases in Images by Reconstruction
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele
Computer Vision -- ECCV 2016, 2016
A 3D Morphable Eye Region Model for Gaze Estimation
E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson and A. Bulling
Computer Vision -- ECCV 2016, 2016
VConv-DAE: Deep Volumetric Shape Learning Without Object Labels
A. Sharma, O. Grau and M. Fritz
Computer Vision - ECCV 2016 Workshops, 2016
Abstract
With the advent of affordable depth sensors, 3D capture becomes more and more ubiquitous and already has made its way into commercial products. Yet, capturing the geometry or complete shapes of everyday objects using scanning devices (eg. Kinect) still comes with several challenges that result in noise or even incomplete shapes. Recent success in deep learning has shown how to learn complex shape distributions in a data-driven way from large scale 3D CAD Model collections and to utilize them for 3D processing on volumetric representations and thereby circumventing problems of topology and tessellation. Prior work has shown encouraging results on problems ranging from shape completion to recognition. We provide an analysis of such approaches and discover that training as well as the resulting representation are strongly and unnecessarily tied to the notion of object labels. Furthermore, deep learning research argues ~\cite{Vincent08} that learning representation with over-complete model are more prone to overfitting compared to the approach that learns from noisy data. Thus, we investigate a full convolutional volumetric denoising auto encoder that is trained in a unsupervised fashion. It outperforms prior work on recognition as well as more challenging tasks like denoising and shape completion. In addition, our approach is atleast two order of magnitude faster at test time and thus, provides a path to scaling up 3D deep learning.
Multi-Person Tracking by Multicut and Deep Matching
S. Tang, B. Andres, M. Andriluka and B. Schiele
Computer Vision - ECCV 2016 Workshops, 2016
Improved Image Boundaries for Better Video Segmentation
A. Khoreva, R. Benenson, F. Galasso, M. Hein and B. Schiele
Computer Vision -- ECCV 2016 Workshops, 2016
Abstract
Graph-based video segmentation methods rely on superpixels as starting point. While most previous work has focused on the construction of the graph edges and weights as well as solving the graph partitioning problem, this paper focuses on better superpixels for video segmentation. We demonstrate by a comparative analysis that superpixels extracted from boundaries perform best, and show that boundary estimation can be significantly improved via image and time domain cues. With superpixels generated from our better boundaries we observe consistent improvement for two video segmentation methods in two different datasets.
Eyewear Computing -- Augmenting the Human with Head-mounted Wearable Assistants
A. Bulling, O. Cakmakci, K. Kunze and J. M. Rehg (Eds.)
Schloss Dagstuhl, 2016
Attention, please!: Comparing Features for Measuring Audience Attention Towards Pervasive Displays
F. Alt, A. Bulling, L. Mecke and D. Buschek
DIS 2016, 11th ACM SIGCHI Designing Interactive Systems Conference, 2016
Sensing and Controlling Human Gaze in Daily Living Space for Human-Harmonized Information Environments
Y. Sato, Y. Sugano, A. Sugimoto, Y. Kuno and H. Koike
Human-Harmonized Information Technology, 2016
Smooth Eye Movement Interaction Using EOG Glasses
M. Dhuliawala, J. Lee, J. Shimizu, A. Bulling, K. Kunze, T. Starner and W. Woo
ICMI’16, 18th ACM International Conference on Multimodal Interaction, 2016
Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries
S. Nag Chowdhury, M. Malinowski, A. Bulling and M. Fritz
ICMR’16, ACM International Conference on Multimedia Retrieval, 2016
Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation
M. Malinowski, M. Rohrbach and M. Fritz
IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016), 2016
(Accepted/in press)
Abstract
We are addressing an open-ended question answering task about real-world images. With the help of currently available methods developed in Computer Vision and Natural Language Processing, we would like to push an architecture with a global visual representation to its limits. In our contribution, we show how to achieve competitive performance on VQA with global visual features (Residual Net) together with a carefully desgined architecture.
A Joint Learning Approach for Cross Domain Age Estimation
B. Bhattarai, G. Sharma, A. Lechervy and F. Jurie
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2016), 2016
Learning to Detect Visual Grasp Affordance
H. Oh Song, M. Fritz, D. Goehring and T. Darell
IEEE Transactions on Automation Science and Engineering, Volume 13, Number 2, 2016
Label-Embedding for Image Classification
Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 7, 2016
3D Pictorial Structures Revisited: Multiple Human Pose Estimation
V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab and S. Ilic
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 10, 2016
Leveraging the Wisdom of the Crowd for Fine-Grained Recognition
J. Deng, J. Krause, M. Stark and L. Fei-Fei
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 4, 2016
What Makes for Effective Detection Proposals?
J. Hosang, R. Benenson, P. Dollár and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 4, 2016
Reconstructing Curvilinear Networks using Path Classifiers and Integer Programming
E. T. Turetken, F. Benmansour, B. Andres, P. Głowacki and H. Pfister
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 38, Number 12, 2016
Combining Eye Tracking with Optimizations for Lens Astigmatism in modern wide-angle HMDs
D. Pohl, X. Zhang and A. Bulling
2016 IEEE Virtual Reality Conference (VR), 2016
Recognition of Ongoing Complex Activities by Sequence Prediction Over a Hierarchical Label Space
W. Li and M. Fritz
2016 IEEE Winter Conference on Applications of Computer Vision (WACV 2016), 2016
Eyewear Computers for Human-Computer Interaction
A. Bulling and K. Kunze
Interactions, Volume 23, Number 3, 2016
Demo hour
H. Jeong, D. Saakes, U. Lee, A. Esteves, E. Velloso, A. Bulling, K. Masai, Y. Sugiura, M. Ogata, K. Kunze, M. Inami, M. Sugimoto, A. Rathnayake and T. Dias
Interactions, Volume 23, Number 1, 2016
Recognizing Fine-grained and Composite Activities Using Hand-centric Features and Script Data
M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
International Journal of Computer Vision, Volume 119, Number 3, 2016
Pattern Recognition
B. Rosenhahn and B. Andres (Eds.)
Springer, 2016
Pupil Detection for Head-mounted Eye Tracking in the Wild: An Evaluation of the State of the Art
W. Fuhl, M. Tonsen, A. Bulling and E. Kasneci
Machine Vision and Applications, Volume 27, Number 8, 2016
The Minimum Cost Connected Subgraph Problem in Medical Image Analysis
M. Rempfler, B. Andres and B. H. Menze
Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2016, 2016
Demo: I-Pic: A Platform for Privacy-Compliant Image Capture
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu
MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016
I-Pic: A Platform for Privacy-Compliant Image Capture
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu
MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016
I-Pic: A Platform for Privacy-Compliant Image Capture
P. Aditya, R. Sen, P. Druschel, S. J. Oh, R. Benenson, M. Fritz, B. Schiele, B. Bhattachariee and T. T. Wu
MobiSys’16, 4th Annual International Conference on Mobile Systems, Applications, and Services, 2016
Long Term Boundary Extrapolation for Deterministic Motion
A. Bhattacharyya, M. Malinowski and M. Fritz
NIPS Workshop on Intuitive Physics, 2016
A Convnet for Non-maximum Suppression
J. Hosang, R. Benenson and B. Schiele
Pattern Recognition (GCPR 2016), 2016
Abstract
Non-maximum suppression (NMS) is used in virtually all state-of-the-art object detection pipelines. While essential object detection ingredients such as features, classifiers, and proposal methods have been extensively researched surprisingly little work has aimed to systematically address NMS. The de-facto standard for NMS is based on greedy clustering with a fixed distance threshold, which forces to trade-off recall versus precision. We propose a convnet designed to perform NMS of a given set of detections. We report experiments on a synthetic setup, and results on crowded pedestrian detection scenes. Our approach overcomes the intrinsic limitations of greedy NMS, obtaining better recall and precision.
Learning to Select Long-Track Features for Structure-From-Motion and Visual SLAM
J. Scheer, M. Fritz and O. Grau
Pattern Recognition (GCPR 2016), 2016
Convexification of Learning from Constraints
I. Shcherbatyi and B. Andres
Pattern Recognition (GCPR 2016), 2016
Special Issue Introduction
D. J. Cook, A. Bulling and Z. Yu
Pervasive and Mobile Computing (Proc. PerCom 2015), Volume 26, 2016
Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces
M. Barz, F. Daiber and A. Bulling
Proceedings ETRA 2016, 2016
3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers
M. Mansouryar, J. Steil, Y. Sugano and A. Bulling
Proceedings ETRA 2016, 2016
Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions
L. Sesma-Sanchez, Y. Zhang, H. Gellersen and A. Bulling
Proceedings ETRA 2016, 2016
Labelled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments
M. Tonsen, X. Zhang, Y. Sugano and A. Bulling
Proceedings ETRA 2016, 2016
Learning an Appearance-based Gaze Estimator from One Million Synthesised Images
E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson and A. Bulling
Proceedings ETRA 2016, 2016
F. Alt, M. Mikusz, S. Schneegass and A. Bulling
Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), 2016
EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?
M. Khamis, L. Trotter, M. Tessman, C. Dannhart, A. Bulling and F. Alt
Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia (MUM 2016), 2016
Generative Adversarial Text to Image Synthesis
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele and H. Lee
Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016
Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
A. Mokarian Forooshani, M. Malinowski and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2016), 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016
Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze
A. L. Simeone, A. Bulling, J. Alexander and H. Gellersen
Proceedings of the 2016 International Working Conference on Advanced Visual Interfaces (AVI 2016), 2016
Commonsense in Parts: Mining Part-Whole Relations from the Web and Image Tags
N. Tandon, C. D. Hariman, J. Urbani, A. Rohrbach, M. Rohrbach and G. Weikum
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016
Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User’s Current Visual Field
D. Pohl, X. Zhang, A. Bulling and O. Grau
Proceedings VRST 2016, 2016
Visual Object Class Recognition
M. Stark, B. Schiele and A. Leonardis
Springer Handbook of Robotics, 2016
Interactive Multicut Video Segmentation
E. Levinkov, J. Tompkin, N. Bonneel, S. Kirchhoff, B. Andres and H. Pfister
The 24th Pacific Conference on Computer Graphics and Applications Short Papers Proceedings (Pacific Graphics 2016), 2016
TextPursuits: Using Text for Pursuits-based Interaction and Calibration on Public Displays
M. Khamis, O. Saltuk, A. Hang, K. Stolz, A. Bulling and F. Alt
UbiComp’16, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2016
EyeWear 2016: First Workshop on EyeWear Computing
A. Bulling, O. Cakmakci, K. Kunze and J. M. Rehg
Challenges and Design Space of Gaze-enabled Public Displays
M. Khamis, F. Alt and A. Bulling
Solar System: Smooth Pursuit Interactions Using EOG Glasses
J. Shimizu, J. Lee, M. Dhuliawala, A. Bulling, T. Starner, W. Woo and K. Kunze
AggreGaze: Collective Estimation of Audience Attention on Public Displays
Y. Sugano, X. Zhang and A. Bulling
UIST 2016, 29th Annual Symposium on User Interface Software and Technology, 2016
Advanced Microstructure Classification of Steel by Classic and Deep Learning Methods
S. Azimi
PhD Thesis, Universität des Saarlandes, 2016
Spatio-Temporal Image Boundary Extrapolation
A. Bhattacharyya, M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1605.07363)
Abstract
Boundary prediction in images as well as video has been a very active topic of research and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on predicting boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and extrapolate motion patterns. We experiment on established real-world video segmentation dataset, which provides a testbed for this new task. We show for the first time spatio-temporal boundary extrapolation in this challenging scenario. Furthermore, we show long-term prediction of boundaries in situations where the motion is governed by the laws of physics. We successfully predict boundaries in a billiard scenario without any assumptions of a strong parametric model or any object notion. We argue that our model has with minimalistic model assumptions derived a notion of 'intuitive physics' that can be applied to novel scenes.
Bayesian Non-Parametrics for Multi-Modal Segmentation
W.-C. Chiu
PhD Thesis, Universität des Saarlandes, 2016
Natural Illumination from Multiple Materials Using Deep Learning
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars and L. Van Gool
Technical Report, 2016
(arXiv: 1611.09325)
Abstract
Recovering natural illumination from a single Low-Dynamic Range (LDR) image is a challenging task. To remedy this situation we exploit two properties often found in everyday images. First, images rarely show a single material, but rather multiple ones that all reflect the same illumination. However, the appearance of each material is observed only for some surface orientations, not all. Second, parts of the illumination are often directly observed in the background, without being affected by reflection. Typically, this directly observed part of the illumination is even smaller. We propose a deep Convolutional Neural Network (CNN) that combines prior knowledge about the statistics of illumination and reflectance with an input that makes explicit use of these two observations. Our approach maps multiple partial LDR material observations represented as reflectance maps and a background image to a spherical High-Dynamic Range (HDR) illumination map. For training and testing we propose a new data set comprising of synthetic and real images with multiple materials observed under the same illumination. Qualitative and quantitative evidence shows how both multi-material and using a background are essential to improve illumination estimations.
DeLight-Net: Decomposing Reflectance Maps into Specular Materials and Natural Illumination
S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, L. Van Gool and T. Tuytelaars
Technical Report, 2016
(arXiv: 1603.08240)
Abstract
In this paper we are extracting surface reflectance and natural environmental illumination from a reflectance map, i.e. from a single 2D image of a sphere of one material under one illumination. This is a notoriously difficult problem, yet key to various re-rendering applications. With the recent advances in estimating reflectance maps from 2D images their further decomposition has become increasingly relevant. To this end, we propose a Convolutional Neural Network (CNN) architecture to reconstruct both material parameters (i.e. Phong) as well as illumination (i.e. high-resolution spherical illumination maps), that is solely trained on synthetic data. We demonstrate that decomposition of synthetic as well as real photographs of reflectance maps, both in High Dynamic Range (HDR), and, for the first time, on Low Dynamic Range (LDR) as well. Results are compared to previous approaches quantitatively as well as qualitatively in terms of re-renderings where illumination, material, view or shape are changed.
RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling
Y. He, W.-C. Chiu, M. Keuper and M. Fritz
Technical Report, 2016
(arXiv: 1604.02388)
Abstract
Beyond the success in classification, neural networks have recently shown strong results on pixel-wise prediction tasks like image semantic segmentation on RGBD data. However, the commonly used deconvolutional layers for upsampling intermediate representations to the full-resolution output still show different failure modes, like imprecise segmentation boundaries and label mistakes in particular on large, weakly textured objects (e.g. fridge, whiteboard, door). We attribute these errors in part to the rigid way, current network aggregate information, that can be either too local (missing context) or too global (inaccurate boundaries). Therefore we propose a data-driven pooling layer that integrates with fully convolutional architectures and utilizes boundary detection from RGBD image segmentation approaches. We extend our approach to leverage region-level correspondences across images with an additional temporal pooling stage. We evaluate our approach on the NYU-Depth-V2 dataset comprised of indoor RGBD video sequences and compare it to various state-of-the-art baselines. Besides a general improvement over the state-of-the-art, our approach shows particularly good results in terms of accuracy of the predicted boundaries and in segmenting previously problematic classes.
End-to-End Eye Movement Detection Using Convolutional Neural Networks
S. Hoppe and A. Bulling
Technical Report, 2016
(arXiv: 1609.02452)
Abstract
Common computational methods for automated eye movement detection - i.e. the task of detecting different types of eye movement in a continuous stream of gaze data - are limited in that they either involve thresholding on hand-crafted signal features, require individual detectors each only detecting a single movement, or require pre-segmented data. We propose a novel approach for eye movement detection that only involves learning a single detector end-to-end, i.e. directly from the continuous gaze data stream and simultaneously for different eye movements without any manual feature crafting or segmentation. Our method is based on convolutional neural networks (CNN) that recently demonstrated superior performance in a variety of tasks in computer vision, signal processing, and machine learning. We further introduce a novel multi-participant dataset that contains scripted and free-viewing sequences of ground-truth annotated saccades, fixations, and smooth pursuits. We show that our CNN-based method outperforms state-of-the-art baselines by a large margin on this challenging dataset, thereby underlining the significant potential of this approach for holistic, robust, and accurate eye movement protocol analysis.
Dense-CNN: Fully Convolutional Neural Networks for Human Body Pose Estimation
E. Insafutdinov
PhD Thesis, Universität des Saarlandes, 2016
Gaze Embeddings for Fine-Grained Zero-Shot Image Classification
N. Karessli
PhD Thesis, Universität des Saarlandes, 2016
A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
M. Keuper, S. Tang, Z. Yu, B. Andres, T. Brox and B. Schiele
Technical Report, 2016
(arXiv: 1607.06317)
Abstract
Recently, Minimum Cost Multicut Formulations have been proposed and proven to be successful in both motion trajectory segmentation and multi-target tracking scenarios. Both tasks benefit from decomposing a graphical model into an optimal number of connected components based on attractive and repulsive pairwise terms. The two tasks are formulated on different levels of granularity and, accordingly, leverage mostly local information for motion segmentation and mostly high-level information for multi-target tracking. In this paper we argue that point trajectories and their local relationships can contribute to the high-level task of multi-target tracking and also argue that high-level cues from object detection and tracking are helpful to solve motion segmentation. We propose a joint graphical model for point trajectories and object detections whose Multicuts are solutions to motion segmentation {\it and} multi-target tracking problems at once. Results on the FBMS59 motion segmentation benchmark as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark demonstrate the promise of this joint approach.
To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction
W. Li, S. Azimi, A. Leonardis and M. Fritz
Technical Report, 2016
(arXiv: 1604.00066)
Abstract
Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel object and their configurations. Developmental psychology has shown that such skills are acquired by infants from observations at a very early stage. In this paper, we contrast a more traditional approach of taking a model-based route with explicit 3D representations and physical simulation by an end-to-end approach that directly predicts stability and related quantities from appearance. We ask the question if and to what extent and quality such a skill can directly be acquired in a data-driven way bypassing the need for an explicit simulation. We present a learning-based approach based on simulated data that predicts stability of towers comprised of wooden blocks under different conditions and quantities related to the potential fall of the towers. The evaluation is carried out on synthetic data and compared to human judgments on the same stimuli.
M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1610.01076)
Abstract
Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task.
Deep Learning for Filling Blanks in Image Captions
A. Mokarian Forooshani
PhD Thesis, Universität des Saarlandes, 2016
Attentive Explanations: Justifying Decisions and Pointing to the Evidence
D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Darrell and M. Rohrbach
Technical Report, 2016
(arXiv: 1612.04757)
Abstract
Deep models are the defacto standard in visual decision models due to their<br>impressive performance on a wide array of visual tasks. However, they are<br>frequently seen as opaque and are unable to explain their decisions. In<br>contrast, humans can justify their decisions with natural language and point to<br>the evidence in the visual world which led to their decisions. We postulate<br>that deep models can do this as well and propose our Pointing and Justification<br>(PJ-X) model which can justify its decision with a sentence and point to the<br>evidence by introspecting its decision and explanation process using an<br>attention mechanism. Unfortunately there is no dataset available with reference<br>explanations for visual decision making. We thus collect two datasets in two<br>domains where it is interesting and challenging to explain decisions. First, we<br>extend the visual question answering task to not only provide an answer but<br>also a natural language explanation for the answer. Second, we focus on<br>explaining human activities which is traditionally more challenging than object<br>classification. We extensively evaluate our PJ-X model, both on the<br>justification and pointing tasks, by comparing it to prior models and ablations<br>using both automatic and human evaluations.<br>
Articulated People Detection and Pose Estimation in Challenging Real World Environments
L. Pishchulin
PhD Thesis, Universität des Saarlandes, 2016
EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras (Extended Abstract)
H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel, B. Schiele and C. Theobalt
Technical Report, 2016b
(arXiv: 1701.00142)
Abstract
Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center. They often create discomfort by possibly needed marker suits, and their recording volume is severely restricted and often constrained to indoor scenes with controlled backgrounds. We therefore propose a new method for real-time, marker-less and egocentric motion capture which estimates the full-body skeleton pose from a lightweight stereo pair of fisheye cameras that are attached to a helmet or virtual-reality headset. It combines the strength of a new generative pose estimation framework for fisheye views with a ConvNet-based body-part detector trained on a new automatically annotated and augmented dataset. Our inside-in method captures full-body motion in general indoor and outdoor scenes, and also crowded scenes.
Seeing with Humans: Gaze-Assisted Neural Image Captioning
Y. Sugano and A. Bulling
Technical Report, 2016
(arXiv: 1608.05203)
Abstract
Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.
2015
On the Interplay between Spontaneous Spoken Instructions and Human Visual Behaviour in an Indoor Guidance Task
N. Koleva, S. Hoppe, M. M. Moniri, M. Staudte and A. Bulling
37th Annual Meeting of the Cognitive Science Society (COGSCI 2015), 2015
Scene Viewing and Gaze Analysis during Phonetic Segmentation Tasks
A. Khan, I. Steiner, R. G. Macdonald, Y. Sugano and A. Bulling
Abstracts of the 18th European Conference on Eye Movements (ECEM 2015), 2015
The Feet in Human-Computer Interaction: A Survey of Foot-Based Interaction
E. Velloso, D. Schmidt, J. Alexander, H. Gellersen and A. Bulling
ACM Computing Surveys, Volume 48, Number 2, 2015
Introduction to the Special Issue on Activity Recognition for Interaction
A. Bulling, U. Blanke, D. Tan, J. Rekimoto and G. Abowd
ACM Transactions on Interactive Intelligent Systems, Volume 4, Number 4, 2015
Efficient Output Kernel Learning for Multiple Tasks
P. Jawanpuria, M. Lapin, M. Hein and B. Schiele
Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015
Top-k Multiclass SVM
M. Lapin, M. Hein and B. Schiele
Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015
Rekonstruktion zerebraler Gefässnetzwerke aus in-vivo μMRA mittels physiologischem Vorwissen zur lokalen Gefässgeometrie
M. Rempfler, M. Schneider, G. D. Ielacqua, T. Sprenger, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres and B. H. Menze
Bildverarbeitung für die Medizin 2015 (BVM 2015), 2015
A Study on the Natural History of Scanning Behaviour in Patients with Visual Field Defects after Stroke
T. Loetscher, C. Chen, S. Wignall, A. Bulling, S. Hoppe, O. Churches, N. A. Thomas, M. E. R. Nicholls and A. Lee
BMC Neurology, Volume 15, 2015
Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-scale-translate Tasks
J. Turner, J. Alexander, A. Bulling and H. Gellersen
CHI 2015, 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015
The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay
M. Vidal, R. Bismuth, A. Bulling and H. Gellersen
CHI 2015, 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015
Abstract
The eyes are a rich channel for non-verbal communication in our daily interactions. We propose social gaze interaction as a game mechanic to enhance user interactions with virtual characters. We develop a game from the ground-up in which characters are esigned to be reactive to the player’s gaze in social ways, such as etting annoyed when the player seems distracted or changing their dialogue depending on the player’s apparent focus of ttention. Results from a qualitative user study provide insights bout how social gaze interaction is intuitive for users, elicits deep feelings of immersion, and highlight the players’ self-consciousness of their own eye movements through their strong reactions to the characters
Editorial of Special Issue on Shape Representations Meet Visual Recognition
S. Savarese, M. Sun and M. Stark
Computer Vision and Image Understanding, Volume 139, 2015
Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers
M. Barz, A. Bulling and F. Daiber
Technical Report, 2015
Abstract
Head-mounted eye tracking has significant potential for mobile gaze-based interaction with ambient displays but current interfaces lack information about the tracker\'s gaze estimation error. Consequently, current interfaces do not exploit the full potential of gaze input as the inherent estimation error can not be dealt with. The error depends on the physical properties of the display and constantly varies with changes in position and distance of the user to the display. In this work we present a computational model of gaze estimation error for head-mounted eye trackers. Our model covers the full processing pipeline for mobile gaze estimation, namely mapping of pupil positions to scene camera coordinates, marker-based display detection, and display mapping. We build the model based on a series of controlled measurements of a sample state-of-the-art monocular head-mounted eye tracker. Results show that our model can predict gaze estimation error with a root mean squared error of 17.99~px ($1.96^\\circ$).
GazeProjector: Location-independent Gaze Interaction on and Across Multiple Displays
C. Lander, S. Gehring, A. Krüger, S. Boring and A. Bulling
Technical Report, 2015
Abstract
Mobile gaze-based interaction with multiple displays may occur from arbitrary positions and orientations. However, maintaining high gaze estimation accuracy still represents a significant challenge. To address this, we present GazeProjector, a system that combines accurate point-of-gaze estimation with natural feature tracking on displays to determine the mobile eye tracker’s position relative to a display. The detected eye positions are transformed onto that display allowing for gaze-based interaction. This allows for seamless gaze estimation and interaction on (1) multiple displays of arbitrary sizes, (2) independently of the user’s position and orientation to the display. In a user study with 12 participants we compared GazeProjector to existing well- established methods such as visual on-screen markers and a state-of-the-art motion capture system. Our results show that our approach is robust to varying head poses, orientations, and distances to the display, while still providing high gaze estimation accuracy across multiple displays without re-calibration. The system represents an important step towards the vision of pervasive gaze-based interfaces.
Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position
E. Velloso, J. Alexander, A. Bulling and H. Gellersen
Human-Computer Interaction -- INTERACT 2015, 2015
An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation
E. Velloso, J. Turner, J. Alexander, A. Bulling and H. Gellersen
Human-Computer Interaction -- INTERACT 2015, 2015
See the Difference: Direct Pre-Image Reconstruction and Pose Estimation by Differentiating HOG
W.-C. Chiu and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts
M. Keuper, E. Levinkov, N. Bonneel, G. Layoue, T. Brox and B. Andres
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Motion Trajectory Segmentation via Minimum Cost Multicuts
M. Keuper, B. Andres and T. Brox
ICCV 2015, IEEE International Conference on Computer Vision, 2015
M. Malinowski, M. Rohrbach and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Person Recognition in Personal Photo Collections
S. J. Oh, R. Benenson, M. Fritz and B. Schiele
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval
G. Sharma and B. Schiele
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Rendering of Eyes for Eye-Shape Registration and Gaze Estimation
E. Wood, T. Baltrusaitis, X. Zhang, Y. Sugano, P. Robinson and A. Bulling
ICCV 2015, IEEE International Conference on Computer Vision, 2015
Evaluation of Output Embeddings for Fine-grained Image Classification
Z. Akata, S. Reed, D. Walter, H. Lee and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Enriching Object Detection with 2D-3D Registration and Continuous Viewpoint Estimation
C. Choy, M. Stark and S. Savarese
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Efficient ConvNet-based Marker-less Motion Capture in General Scenes with a Low Number of Cameras
A. Elhayek, E. de Aguiar, J. Tompson, A. Jain, L. Pishchulin, M. Andriluka, C. Bregler, B. Schiele and C. Theobalt
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Taking a Deeper Look at Pedestrians
J. Hosang, M. Omran, R. Benenson and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Image Retrieval using Scene Graphs
J. Johnson, R. Krishna, M. Stark, J. Li, M. Bernstein and L. Fei-Fei
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Classifier Based Graph Construction for Video Segmentation
A. Khoreva, F. Galasso, M. Hein and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
A Flexible Tensor Block Coordinate Ascent Scheme for Hypergraph Matching
Q. N. Nguyen, A. Gautier and M. Hein
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
A Dataset for Movie Description
A. Rohrbach, M. Rohrbach, N. Tandon and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Prediction of Search Targets from Fixations in Open-world Settings
H. Sattar, S. Müller, M. Fritz and A. Bulling
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Subgraph Decomposition for Multi-target Tracking
S. Tang, B. Andres, M. Andriluka and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Filtered Channel Features for Pedestrian Detection
S. Zhang, R. Benenson and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
Appearance-based Gaze Estimation in the Wild
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
3D Object Class Detection in the Wild
B. Pepik, M. Stark, P. Gehler, T. Ritschel and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition Workshops (3DSI 2015), 2015
Joint Segmentation and Activity Discovery using Semantic and Temporal Priors
J. Seiter, W.-C. Chiu, M. Fritz, O. Amft and G. Tröster
IEEE International Conference on Pervasive Computing and Communication (PERCOM 2015), 2015
Teaching Robots the Use of Human Tools from Demonstration with Non-dexterous End-effectors
W. Li and M. Fritz
2015 IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS 2015), 2015
GyroPen: Gyroscopes for Pen-Input with Mobile Phones
T. Deselaers, D. Keysers, J. Hosang and H. Rowley
IEEE Transactions on Human-Machine Systems, Volume 45, Number 2, 2015
Appearance-based Gaze Estimation with Online Calibration from Mouse Operations
Y. Sugano, Y. Matsushita, Y. Sato and H. Koike
IEEE Transactions on Human-Machine Systems, Volume 45, Number 6, 2015
Gaze Estimation From Eye Appearance: A Head Pose-free Method via Eye Image Synthesis
F. Lu, Y. Sugano, T. Okabe and Y. Sato
IEEE Transactions on Image Processing, Volume 24, Number 11, 2015
Detecting Surgical Tools by Modelling Local Appearance and Global Shape
D. Bouget, R. Benenson, M. Omran, L. Riffaud, B. Schiele and P. Jannin
IEEE Transactions on Medical Imaging, Volume 34, Number 12, 2015
Multi-view and 3D Deformable Part Models
B. Pepik, M. Stark, P. Gehler and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 37, Number 11, 2015
Emotion Recognition from Embedded Bodily Expressions and Speech During Dyadic Interactions
P. Müller, S. Amin, P. Verma, M. Andriluka and A. Bulling
International Conference on Affective Computing and Intelligent Interaction (ACII 2015), 2015
A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems
J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy and C. Rother
International Journal of Computer Vision, Volume 115, Number 2, 2015
Abstract
Szeliski et al. published an influential study in 2006 on energy minimization methods for Markov Random Fields (MRF). This study provided valuable insights in choosing the best optimization technique for certain classes of problems. While these insights remain generally useful today, the phenomenal success of random field models means that the kinds of inference problems that have to be solved changed significantly. Specifically, the models today often include higher order interactions, flexible connectivity structures, large la\-bel-spaces of different cardinalities, or learned energy tables. To reflect these changes, we provide a modernized and enlarged study. We present an empirical comparison of 32 state-of-the-art optimization techniques on a corpus of 2,453 energy minimization instances from diverse applications in computer vision. To ensure reproducibility, we evaluate all methods in the OpenGM 2 framework and report extensive results regarding runtime and solution quality. Key insights from our study agree with the results of Szeliski et al. for the types of models they studied. However, on new and challenging types of models our findings disagree and suggest that polyhedral methods and integer programming solvers are competitive in terms of runtime and solution quality over a large range of model types.
Towards Scene Understanding with Detailed 3D Object Representations
Z. Zia, M. Stark and K. Schindler
International Journal of Computer Vision, Volume 112, Number 2, 2015
Walking Reduces Spatial Neglect
T. Loetscher, C. Chen, S. Hoppe, A. Bulling, S. Wignall, C. Owen, N. Thomas and A. Lee
Journal of the International Neuropsychological Society, 2015
Bridging the Gap Between Synthetic and Real Data
M. Fritz
Machine Learning with Interdependent and Non-Identically Distributed Data, 2015
Reconstructing Cerebrovascular Networks under Local Physiological Constraints by Integer Programming
M. Rempfler, M. Schneider, G. D. Ielacqua, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres and B. H. Menze
Medical Image Analysis, Volume 25, Number 1, 2015
Graphical Passwords in the Wild: Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes
F. Alt, S. Schneegass, A. Shirazi, M. Hassib and A. Bulling
MobileHCI’15, 17th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2015
What is Holding Back Convnets for Detection?
B. Pepik, R. Benenson, T. Ritschel and B. Schiele
Pattern Recognition (GCPR 2015), 2015
The Long-short Story of Movie Description
A. Rohrbach, M. Rohrbach and B. Schiele
Pattern Recognition (GCPR 2015), 2015
Eye Tracking for Public Displays in the Wild
Y. Zhang, M. K. Chong, A. Bulling and H. Gellersen
Personal and Ubiquitous Computing, Volume 19, Number 5, 2015
Characterizing Information Diets of Social Media Users
J. Kulshrestha, M. B. Zafar, L. E. Espin Noboa, K. Gummadi and S. Gosh
Proceedings of the 9th International AAAI Conference on Web and Social Media (ICWSM 2015), 2015
The Cityscapes Dataset
M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele
The Future of Datasets in Vision 2015 (CVPR 2015 Workshop), 2015
Latent Max-margin Metric Learning for Comparing Video Face Tubes
G. Sharma and P. Pérez
The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2015), 2015
Hard to Cheat: A Turing Test based on Answering Questions about Images
M. Malinowski and M. Fritz
Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015
(arXiv: 1501.03302)
Abstract
Progress in language and image understanding by machines has sparkled the<br>interest of the research community in more open-ended, holistic tasks, and<br>refueled an old AI dream of building intelligent machines. We discuss a few<br>prominent challenges that characterize such holistic tasks and argue for<br>"question answering about images" as a particular appealing instance of such a<br>holistic task. In particular, we point out that it is a version of a Turing<br>Test that is likely to be more robust to over-interpretations and contrast it<br>with tasks like grounding and generation of descriptions. Finally, we discuss<br>tools to measure progress in this field.<br>
Discovery of Everyday Human Activities From Long-Term Visual Behaviour Using Topic Models
J. Steil and A. Bulling
UbiComp 2015, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Analyzing Visual Attention During Whole Body Interaction with Public Displays
R. Walter, A. Bulling, D. Lindbauer, M. Schuessler and J. Müller
UbiComp 2015, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Human Visual Behaviour for Collaborative Human-Machine Interaction
A. Bulling
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Orbits: Enabling Gaze Interaction in Smart Watches Using Moving Targets
A. Esteves, E. Velloso, A. Bulling and H. Gellersen
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Recognition of Curiosity Using Eye Movement Analysis
S. Hoppe, T. Loetscher, S. Morey and A. Bulling
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits
M. Khamis, F. Alt and A. Bulling
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
Tackling Challenges of Interactive Public Displays Using Gaze
M. Khamis, A. Bulling and F. Alt
UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015
GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues
F. Alt, A. Bulling, G. Gravanis and D. Buschek
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements
A. Esteves, E. Velloso, A. Bulling and H. Gellersen
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays
C. Lander, S. Gehring, A. Krüger, S. Boring and A. Bulling
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
Self-calibrating Head-mounted Eye Trackers Using Egocentric Visual Saliency
Y. Sugano and A. Bulling
UIST’15, 28th Annual ACM Symposium on User Interface Software and Technology, 2015
Learning Probability Measures in 01-Variables Using Multi-Linear Polynomial Lifting
S. Brust
PhD Thesis, Universität des Saarlandes, 2015
Long-Range Connectivity in the Multicut Problem
B. Grochulla
PhD Thesis, Universität des Saarlandes, 2015
What Makes for Effective Detection Proposals?
J. Hosang, R. Benenson, P. Dollár and B. Schiele
Technical Report, 2015
(arXiv: 1502.05082)
Abstract
Current top performing object detectors employ detection proposals to guide the search for objects, thereby avoiding exhaustive sliding window search across images. Despite the popularity and widespread use of detection proposals, it is unclear which trade-offs are made when using them during object detection. We provide an in-depth analysis of twelve proposal methods along with four baselines regarding proposal repeatability, ground truth annotation recall on PASCAL and ImageNet, and impact on DPM and R-CNN detection performance. Our analysis shows that for object detection improving proposal localisation accuracy is as important as improving recall. We introduce a novel metric, the average recall (AR), which rewards both high recall and good localisation and correlates surprisingly well with detector performance. Our findings show common strengths and weaknesses of existing methods, and provide insights and metrics for selecting and tuning proposal methods.
Structured Forests: From Edges to Contours
PhD Thesis, Universität des Saarlandes, 2015
Learning to Choose Optimal Viewpoints for Pose Estimation of 3D Objects
P. Müller
PhD Thesis, Universität des Saarlandes, 2015
Contextual Media Retrieval Using Natural Language Queries
S. Nag Chowdhury
PhD Thesis, Universität des Saarlandes, 2015
Richer Object Representations for Object Class Detection in Challenging Real World Image
B. Pepik
PhD Thesis, Universität des Saarlandes, 2015
Convexification of Learning From Constraints
I. Shcherbatyi
PhD Thesis, Universität des Saarlandes, 2015
GazeDPM: Early Integration of Gaze Information in Deformable Part Models
I. Shcherbatyi, A. Bulling and M. Fritz
Technical Report, 2015
(arXiv: 1505.05753)
Abstract
An increasing number of works explore collaborative human-computer systems in which human gaze is used to enhance computer vision systems. For object detection these efforts were so far restricted to late integration approaches that have inherent limitations, such as increased precision without increase in recall. We propose an early integration approach in a deformable part model, which constitutes a joint formulation over gaze and visual data. We show that our GazeDPM method improves over the state-of-the-art DPM baseline by 4% and a recent method for gaze-supported object detection by 3% on the public POET dataset. Our approach additionally provides introspection of the learnt models, can reveal salient image structures, and allows us to investigate the interplay between gaze attracting and repelling areas, the importance of view-specific models, as well as viewers' personal biases in gaze patterns. We finally study important practical aspects of our approach, such as the impact of using saliency maps instead of real fixations, the impact of the number of fixations, as well as robustness to gaze estimation error.
Labeled Pupils in the Wild: A Dataset for Studying Pupil Detection in Unconstrained Environments
M. Tonsen, X. Zhang, Y. Sugano and A. Bulling
Technical Report, 2015
(arXiv: 1511.05768)
Abstract
We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people with different ethnicities, a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, as well as make-up. We benchmark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution, vision aids, as well as recording location (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.
Latent Embedding for Zero-shot Image Classification
Y. Xian
PhD Thesis, Universität des Saarlandes, 2015
2014
A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors
A. Bulling, U. Blanke and B. Schiele
ACM Computing Surveys, Volume 46, Number 3, 2014
Pursuits: Spontaneous Eye-based Interaction for Dynamic Interfaces
M. Vidal, A. Bulling and H. Gellersen
ACM SIGMOBILE Mobile Computing and Communications Review, Volume 18, Number 4, 2014
Abstract
Although gaze is an attractive modality for pervasive interaction, real-world implementation of eye-based interfaces poses significant challenges. In particular, user calibration is tedious and time consuming. Pursuits is an innovative interaction technique that enables truly spontaneous interaction with eye-based interfaces. A user can simply walk up to the screen and readily interact with moving targets. Instead of being based on gaze location, Pursuits correlates eye pursuit movements with objects dynamically moving on the interface.
A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input
M. Malinowski and M. Fritz
Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014
Eye Tracking and Eye-based Human–computer Interaction
P. Majaranta and A. Bulling
Ubic: Bridging the Gap Between Digital Cryptography and the Physical World
M. Simkin, A. Bulling, M. Fritz and D. Schröder
Computer Security - ESORICS 2014, 2014
Estimation of Human Body Shape and Posture under Clothing
S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu and J. Lang
Computer Vision and Image Understanding, Volume 127, 2014
Face Detection Without Bells and Whistles
M. Mathias, R. Benenson, M. Pedersoli and L. Van Gool
Computer Vision - ECCV 2014, 2014
Multiple Human Pose Estimation with Temporally Consistent 3D Pictorial Structures
X. Wang, B. Schiele, P. Fua, V. Belagiannis, S. Ilic and N. Navab
Computer Vision - ECCV 2014 Workshops, 2014
First International Workshop on Video Segmentation -- Panel Discussion
T. Brox, F. Galasso, F. Li, J. M. Rehg and B. Schiele
Computer Vision -- ECCV 2014 Workshops, 2014
Ten Years of Pedestrian Detection, What Have We Learned?
R. Benenson, M. Omran, J. Hosang and B. Schiele
Computer Vision - ECCV 2014 Workshops (ECCV 2014 Workshop CVRSUAD), 2014
2D Human Pose Estimation: New Benchmark and State of the Art Analysis
M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
3D Pictorial Structures for Multiple Human Pose Estimation
V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab and S. Ilic
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Spectral Graph Reduction for Efficient Image and Streaming Video Segmentation
F. Galasso, M. Keuper, T. Brox and B. Schiele
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Anytime Recognition of Objects and Scenes
S. Karayev, M. Fritz and T. Darrell
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Scalable Multitask Representation Learning for Scene Classification
M. Lapin, B. Schiele and M. Hein
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Image-based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models
K. Rematas, T. Ritschel, M. Fritz and T. Tuytelaars
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Are Cars Just 3D Boxes? - Jointly Estimating the 3D Shape of Multiple Objects
M. Z. Zia, M. Stark and K. Schindler
2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 2014
Cognition-aware Computing
A. Bulling and T. O. Zander
IEEE Pervasive Computing, Volume 13, Number 3, 2014
3D Traffic Scene Understanding from Movable Platforms
A. Geiger, M. Lauer, C. Wojek, C. Stiller and R. Urtasun
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 36, Number 5, 2014
Learning Human Pose Estimation Features with Convolutional Networks
A. Jain, J. Tompson, M. Andriluka, G. W. Taylor and C. Bregler
International Conference on Learning Representations 2014 (ICLR 2014), 2014
(arXiv: 1312.7302)
Abstract
This paper introduces a new architecture for human pose estimation using a multi- layer convolutional network architecture and a modified learning technique that learns low-level features and higher-level weak spatial models. Unconstrained human pose estimation is one of the hardest problems in computer vision, and our new architecture and learning schema shows significant improvement over the current state-of-the-art results. The main contribution of this paper is showing, for the first time, that a specific variation of deep learning is able to outperform all existing traditional architectures on this task. The paper also discusses several lessons learned while researching alternatives, most notably, that it is possible to learn strong low-level feature detectors on features that might even just cover a few pixels in the image. Higher-level spatial models improve somewhat the overall result, but to a much lesser extent then expected. Many researchers previously argued that the kinematic structure and top-down information is crucial for this domain, but with our purely bottom up, and weak spatial model, we could improve other more complicated architectures that currently produce the best results. This mirrors what many other researchers, like those in the speech recognition, object recognition, and other domains have experienced.
Multi-view Priors for Learning Detectors from Sparse Viewpoint Data
B. Pepik, M. Stark, P. Gehler and B. Schiele
International Conference on Learning Representations 2014 (ICLR 2014), 2014
(arXiv: 1312.6095)
Abstract
While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
Multi-View Priors for Learning Detectors from Sparse Viewpoint Data
B. Pepik, M. Stark, P. Gehler and B. Schiele
International Conference on Learning Representations 2014 (ICLR 2014), 2014
(arXiv: http://arxiv.org/abs/1312.6095)
Abstract
While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
Detection and Tracking of Occluded People
S. Tang, M. Andriluka and B. Schiele
International Journal of Computer Vision, Volume 110, Number 1, 2014
Introduction to the PETMEI Special Issue
A. Bulling and R. Bednarik
Journal of Eye Movement Research, Volume 7, Number 3, 2014
Computer Vision - ECCV 2014
D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars (Eds.)
Springer, 2014
Candidate Sampling for Neuron Reconstruction from Anisotropic Electron Microscopy Volumes
J. Funke, J. N. P. Martel, S. Gerhard, B. Andres, D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber, H. Pfister, A. Cardona and M. Cook
Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2014, 2014
Extracting Vascular Networks under Physiological Constraints via Integer Programming
M. Rempfler, M. Schneider, G. D. Ielacqua, X. Xiao, S. R. Stock, J. Klohs, G. Székely, B. Andres and B. H. Menze
Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2014, 2014
Learning Using Privileged Information: SVM+ and Weighted SVM
M. Lapin, M. Hein and B. Schiele
Neural Networks, Volume 53, 2014
Towards a Visual Turing Challenge
M. Malinowski and M. Fritz
NIPS 2014 Workshop on Learning Semantics, 2014
(arXiv: 1410.8027)
Abstract
As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
Expressive Models and Comprehensive Benchmark for 2D Human Pose Estimation
L. Pishchulin, M. Andriluka, P. Gehler and B. Schiele
Parts and Attributes (ECCV 2014 Workshop PA), 2014
Test-time Adaptation for 3D Human Pose Estimation
S. Amin, P. Müller, A. Bulling and M. Andriluka
Pattern Recognition (GCPR 2014), 2014
Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering
A. Khoreva, F. Galasso, M. Hein and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Learning Multi-scale Representations for Material Classification
W. Li
Pattern Recognition (GCPR 2014), 2014
Fine-grained Activity Recognition with Holistic and Pose Based Features
L. Pishchulin, M. Andriluka and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Cross-device Gaze-supported Point-to-point Content Transfer
J. Turner, A. Bulling, J. Alexander and H. Gellersen
Proceedings ETRA 2014, 2014
EyeTab: Model-based Gaze Estimation on Unmodified Tablet Computers
E. Wood and A. Bulling
Proceedings ETRA 2014, 2014
S. Ishimaru, K. Kunze, K. Kise, J. Weppner, A. Dengel, P. Lukowicz and A. Bulling
Proceedings of the 5th Augmented Human International Conference (AH 2014), 2014
Object Disambiguation for Augmented Reality Applications
W.-C. Chiu, G. Johnson, D. McCulley, O. Grau and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2014), 2014
How Good are Detection Proposals, really?
J. Hosang, R. Benenson and B. Schiele
Proceedings of the British Machine Vision Conference (BMVC 2014), 2014
Abstract
Current top performing Pascal VOC object detectors employ detection proposals to guide the search for objects thereby avoiding exhaustive sliding window search across images. Despite the popularity of detection proposals, it is unclear which trade‐offs are made when using them during object detection. We provide an in depth analysis of ten object proposal methods along with four baselines regarding ground truth annotation recall (on Pascal VOC 2007 and ImageNet 2013), repeatability, and impact on DPM detector performance. Our findings show common weaknesses of existing methods, and provide insights to choose the most adequate method for different settings.
Pupil-Canthi-Ratio: A Calibration-free Method for Tracking Horizontal Gaze Direction
Y. Zhang, A. Bulling and H. Gellersen
Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces (AVI 2014), 2014
Scalable Multitask Representation Learning for Scene Classification
M. Lapin, B. Schiele and M. Hein
Scene Understanding Workshop (SUNw 2014), 2014
Learning People Detectors for Tracking in Crowded Scenes
S. Tang, M. Andriluka, A. Milan, K. Schindler, S. Roth and B. Schiele
Scene Understanding Workshop (SUNw 2014), 2014
High-Resolution 3D Layout from a Single View
M. Z. Zia, M. Stark and K. Schindler
Scene Understanding Workshop (SUNw 2014), 2014
SmudgeSafe: Geometric Image Transformations for Smudge-resistant User Authentication
S. Schneegass, F. Steimle, A. Bulling, F. Alt and A. Schmidt
UbiComp’14, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2014
GazeHorizon: Enabling Passers-by to Interact with Public Displays by Gaze
Y. Zhang, J. Müller, M. K. Chong, A. Bulling and H. Gellersen
UbiComp’14, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2014
Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction
M. Kassner, W. Patera and A. Bulling