2019
LiveCap: Real-time Human Performance Capture from Monocular Video
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics, Volume 38, Number 2, 2019
Modeling Conceptual Understanding in Image Reference Games
R. Corona, S. Alaniz and Z. Akata
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019
Combining Generative and Discriminative Models for Hybrid Inference
V. Garcia Satorras, Z. Akata and M. Welling
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019
Learning to Self-Train for Semi-Supervised Few-Shot Classification
X. Li, Q. Sun, Y. Liu, Q. Zhou, S. Zheng, T.-S. Chua and B. Schiele
Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019
Everyday Eye Tracking for Real-World Consumer Behavior Analysis
A. Bulling and M. Wedel
A Handbook of Process Tracing Methods for Decision Research, 2019
Conditional Flow Variational Autoencoders for Structured Sequence Prediction
A. Bhattacharyya, M. Hanselmann, M. Fritz, B. Schiele and C.-N. Straehle
Bayesian Deep Learning NeurIPS 2019 Workshop, 2019
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications
X. Zhang, Y. Sugano and A. Bulling
CHI 2019, CHI Conference on Human Factors in Computing Systems, 2019
XNect Demo (v2): Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, H.-P. Seidel, P. Fua, M. Elgharib, H. Rhodin, G. Pons-Moll and C. Theobalt
CVPR 2019 Demonstrations, 2019
Towards Reverse-Engineering Black-Box Neural Networks
S. J. Oh, B. Schiele and M. Fritz
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 2019
InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation
J. Steil, M. Tonsen, Y. Sugano and A. Bulling
GetMobile, Volume 23, Number 2, 2019
P. Müller and A. Bulling
ICMI ’19, International Conference on Multimodal Interaction, 2019
Abstract
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets -- an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.
Learning to Reconstruct People in Clothing from a Single RGB Camera
T. Alldieck, M. A. Magnor, B. L. Bhatnagar, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval
A. Dutta and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
In the Wild Human Pose Estimation using Explicit 2D Features and Intermediate 3D Representations
I. Habibie, W. Xu, D. Mehta, G. Pons-Moll and C. Theobalt
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Time-Conditioned Action Anticipation in One Shot
Q. Ke, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Combinatorial Persistency Criteria for Multicut and Max-Cut
J.-H. Lange, B. Andres and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Knockoff Nets: Stealing Functionality of Black-Box Models
T. Orekondy, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders
E. Schönfeld, S. Ebrahimi, S. Sinha, T. Darrell and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation
R. Shetty, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
D. Stutz, M. Hein, and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Meta-Transfer Learning for Few-Shot Learning
Q. Sun, Y. Liu, T.-S. Chua and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
MAP Inference via Block-Coordinate Frank-Wolfe Algorithm
P. Swoboda and V. Kolmogorov
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
A Convex Relaxation for Multi-Graph Matching
P. Swoboda, D. Kainmüller, A. Mokarian, C. Theobalt and F. Bernard
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Semantic Projection Network for Zero- and Few-Label Semantic Segmentation
Y. Xian, S. Choudhury, Y. He, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning
Y. Xian, S. Sharma, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture
N. Yu, C. Barnes, E. Shechtman, S. Amirghodsi and M. Lukáč
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
SimulCap : Single-View Human Performance Capture with Cloth Simulation
T. Yu, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. Pons-Moll and Y. Liu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019
Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard
S. Abdelnabi, M. X. Huang and A. Bulling
IEEE International Conference on Systems, Man, and Cybernetics (SMC 2019), 2019
Zero-shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly
Y. Xian, C. H. Lampert, B. Schiele and Z. Akata
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 41, Number 9, 2019
Abstract
Due to the importance of zero-shot learning, i.e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano, M. Fritz and A. Bulling
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 41, Number 1, 2019
Fashion is Taking Shape: Understanding Clothing Preference Based on Body Shape From Online Sources
H. Sattar, G. Pons-Moll and M. Fritz
2019 IEEE Winter Conference on Applications of Computer Vision (WACV 2019), 2019
360-Degree Textures of People in Clothing from a Single Image
V. Lazova, E. Insafutdinov and G. Pons-Moll
International Conference on 3D Vision, 2019
Bottleneck Potentials in Markov Random Fields
A. Abbas and P. Swoboda
International Conference on Computer Vision (ICCV 2019), 2019
Tex2Shape: Detailed Full Human Body Geometry from a Single Image
T. Alldieck, G. Pons-Moll, C. Theobalt and M. A. Magnor
International Conference on Computer Vision (ICCV 2019), 2019
Abstract
We present a simple yet effective method to infer detailed full human body shape from only a single photograph. Our model can infer full-body shape including face, hair, and clothing including wrinkles at interactive frame-rates. Results feature details even on parts that are occluded in the input image. Our main idea is to turn shape regression into an aligned image-to-image translation problem. The input to our method is a partial texture map of the visible region obtained from off-the-shelf methods. From a partial texture, we estimate detailed normal and vector displacement maps, which can be applied to a low-resolution smooth body model to add detail and clothing. Despite being trained purely with synthetic data, our model generalizes well to real-world photographs. Numerous results demonstrate the versatility and robustness of our method.
HiPPI: Higher-Order Projected Power Iterations for Scalable Multi-Matching
F. Bernard, J. Thunberg, P. Swoboda and C. Theobalt
International Conference on Computer Vision (ICCV 2019), 2019
Multi-Garment Net: Learning to Dress 3D People from Images
B. L. Bhatnagar, G. Tiwari, C. Theobalt and G. Pons-Moll
International Conference on Computer Vision (ICCV 2019), 2019
AMASS: Archive of Motion Capture as Surface Shapes
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll and M. J. Black
International Conference on Computer Vision (ICCV 2019), 2019
Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking
S. Sharma, P. T. Varigonda, P. Bindal, A. Sharma and A. Jain
International Conference on Computer Vision (ICCV 2019), 2019
Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints
N. Yu, L. Davis and M. Fritz
International Conference on Computer Vision (ICCV 2019), 2019
Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods
A. Bhattacharyya, M. Fritz and B. Schiele
International Conference on Learning Representations (ICLR 2019), 2019
Lucid Data Dreaming for Video Object Segmentation
A. Khoreva, R. Benenson, E. Ilg, T. Brox and B. Schiele
International Journal of Computer Vision, Volume 127, Number 9, 2019
Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour
M. X. Huang, J. Li, G. Ngai, H. V. Leong and A. Bulling
MM ’19, 27th ACM International Conference on Multimedia, 2019
Abstract
Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. Internal thought is a pervasive mental activity and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modelling in intelligent interfaces, particularly for e-learning applications. Despite the close link between the eyes and the human mind, only a few studies have investigated vergence behaviour during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is computationally light-weight and that only requires eye vergence information that is readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations that exploits human blur perception. We evaluate our method for three increasingly challenging detection tasks: (1) during a controlled math-solving task, (2) during natural viewing of lecture videos, and (3) during daily activities, such as coding, browsing, and reading. Results from these evaluations demonstrate the performance and robustness of vergence-based detection of internal thought and, as such, open up new directions for research on interfaces that adapt to shifts of mental attention.
Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders
X. Hong, E. Chang and V. Demberg
Multilingual Surface Realisation (MSR 2019), 2019
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements
M. X. Huang and A. Bulling
Proceedings ETRA 2019, 2019
Abstract
Recent methods to automatically calibrate stationary eye trackers were shown to effectively reduce inherent calibration distortion. However, these methods require additional information, such as mouse clicks or on-screen content. We propose the first method that only requires users' eye movements to reduce calibration distortion in the background while users naturally look at an interface. Our method exploits that calibration distortion makes straight saccade trajectories appear curved between the saccadic start and end points. We show that this curving effect is systematic and the result of distorted gaze projection plane. To mitigate calibration distortion, our method undistorts this plane by straightening saccade trajectories using image warping. We show that this approach improves over the common six-point calibration and is promising for reducing distortion. As such, it provides a non-intrusive solution to alleviating accuracy decrease of eye tracker during long-term use.
Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage
P. Müller, D. Buschek, M. X. Huang and A. Bulling
Proceedings ETRA 2019, 2019
Privacy-Aware Eye Tracking Using Differential Privacy
J. Steil, I. Hagestedt, M. X. Huang and A. Bulling
Proceedings ETRA 2019, 2019
PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features
J. Steil, M. Koelle, W. Heuten, S. Boll and A. Bulling
Proceedings ETRA 2019, 2019
Detecting Stress from Mouse-Gaze Attraction
J. Wang, E. Y. Fu, G. Ngai, H. Va Leong and M. X. Huang
Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC 2019), 2019
Gradient-Leaks: Understanding Deanonymization in Federated Learning
T. Orekondy, S. J. Oh, Y. Zhang, B. Schiele and M. Fritz
The 2nd International Workshop on Federated Learning for Data Privacy and Confidentiality (FL-NeurIPS 2019), 2019
(Accepted/in press)
Bottleneck Potentials in Markov Random Fields
A. Abbas and P. Swoboda
Technical Report, 2019
(arXiv: 1904.08080)
Abstract
We consider general discrete Markov Random Fields(MRFs) with additional bottleneck potentials which penalize the maximum (instead of the sum) over local potential value taken by the MRF-assignment. Bottleneck potentials or analogous constructions have been considered in (i) combinatorial optimization (e.g. bottleneck shortest path problem, the minimum bottleneck spanning tree problem, bottleneck function minimization in greedoids), (ii) inverse problems with $L_{\infty}$-norm regularization, and (iii) valued constraint satisfaction on the $(\min,\max)$-pre-semirings. Bottleneck potentials for general discrete MRFs are a natural generalization of the above direction of modeling work to Maximum-A-Posteriori (MAP) inference in MRFs. To this end, we propose MRFs whose objective consists of two parts: terms that factorize according to (i) $(\min,+)$, i.e. potentials as in plain MRFs, and (ii) $(\min,\max)$, i.e. bottleneck potentials. To solve the ensuing inference problem, we propose high-quality relaxations and efficient algorithms for solving them. We empirically show efficacy of our approach on large scale seismic horizon tracking problems.
“Best-of-Many-Samples” Distribution Matching
A. Bhattacharyya, M. Fritz and B. Schiele
Technical Report, 2019
(arXiv: 1909.12598)
Abstract
Generative Adversarial Networks (GANs) can achieve state-of-the-art sample quality in generative modelling tasks but suffer from the mode collapse problem. Variational Autoencoders (VAE) on the other hand explicitly maximize a reconstruction-based data log-likelihood forcing it to cover all modes, but suffer from poorer sample quality. Recent works have proposed hybrid VAE-GAN frameworks which integrate a GAN-based synthetic likelihood to the VAE objective to address both the mode collapse and sample quality issues, with limited success. This is because the VAE objective forces a trade-off between the data log-likelihood and divergence to the latent prior. The synthetic likelihood ratio term also shows instability during training. We propose a novel objective with a "Best-of-Many-Samples" reconstruction cost and a stable direct estimate of the synthetic likelihood. This enables our hybrid VAE-GAN framework to achieve high data log-likelihood and low divergence to the latent prior at the same time and shows significant improvement over both hybrid VAE-GANS and plain GANs in mode coverage and quality.
SampleFix: Learning to Correct Programs by Sampling Diverse Fixes
H. Hajipour, A. Bhattacharyya and M. Fritz
Technical Report, 2019
(arXiv: 1906.10502)
Abstract
Automatic program correction is an active topic of research, which holds the potential of dramatically improving productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning and NLP have rekindled the hope to eventually fully automate the process of repairing programs. A key challenge is ambiguity, as multiple codes -- or fixes -- can implement the same functionality. In addition, datasets by nature fail to capture the variance introduced by such ambiguities. Therefore, we propose a deep generative model to automatically correct programming errors by learning a distribution of potential fixes. Our model is formulated as a deep conditional variational autoencoder that samples diverse fixes for the given erroneous programs. In order to account for ambiguity and inherent lack of representative datasets, we propose a novel regularizer to encourage the model to generate diverse fixes. Our evaluations on common programming errors show for the first time the generation of diverse fixes and strong improvements over the state-of-the-art approaches by fixing up to 65% of the mistakes.
LCC: Learning to Customize and Combine Neural Networks for Few-Shot Learning
Y. Liu, Q. Sun, A.-A. Liu, Y. Su, B. Schiele and T.-S. Chua
Technical Report, 2019
(arXiv: 1904.08479)
Abstract
Meta-learning has been shown to be an effective strategy for few-shot learning. The key idea is to leverage a large number of similar few-shot tasks in order to meta-learn how to best initiate a (single) base-learner for novel few-shot tasks. While meta-learning how to initialize a base-learner has shown promising results, it is well known that hyperparameter settings such as the learning rate and the weighting of the regularization term are important to achieve best performance. We thus propose to also meta-learn these hyperparameters and in fact learn a time- and layer-varying scheme for learning a base-learner on novel tasks. Additionally, we propose to learn not only a single base-learner but an ensemble of several base-learners to obtain more robust results. While ensembles of learners have shown to improve performance in various settings, this is challenging for few-shot learning tasks due to the limited number of training samples. Therefore, our approach also aims to meta-learn how to effectively combine several base-learners. We conduct extensive experiments and report top performance for five-class few-shot recognition tasks on two challenging benchmarks: miniImageNet and Fewshot-CIFAR100 (FC100).
Learning Manipulation under Physics Constraints with Visual Perception
W. Li, A. Leonardis, J. Bohg and M. Fritz
Technical Report, 2019
(arXiv: 1904.09860)
Abstract
Understanding physical phenomena is a key competence that enables humans and animals to act and interact under uncertain perception in previously unseen environments containing novel objects and their configurations. In this work, we consider the problem of autonomous block stacking and explore solutions to learning manipulation under physics constraints with visual perception inherent to the task. Inspired by the intuitive physics in humans, we first present an end-to-end learning-based approach to predict stability directly from appearance, contrasting a more traditional model-based approach with explicit 3D representations and physical simulation. We study the model's behavior together with an accompanied human subject test. It is then integrated into a real-world robotic system to guide the placement of a single wood block into the scene without collapsing existing tower structure. To further automate the process of consecutive blocks stacking, we present an alternative approach where the model learns the physics constraint through the interaction with the environment, bypassing the dedicated physics learning as in the former part of this work. In particular, we are interested in the type of tasks that require the agent to reach a given goal state that may be different for every new trial. Thereby we propose a deep reinforcement learning framework that learns policies for stacking tasks which are parametrized by a target structure.
Interpretability Beyond Classification Output: Semantic Bottleneck Networks
M. Losch, M. Fritz and B. Schiele
Technical Report, 2019
(arXiv: 1907.10882)
Abstract
Today's deep learning systems deliver high performance based on end-to-end training. While they deliver strong performance, these systems are hard to interpret. To address this issue, we propose Semantic Bottleneck Networks (SBN): deep networks with semantically interpretable intermediate layers that all downstream results are based on. As a consequence, the analysis on what the final prediction is based on is transparent to the engineer and failure cases and modes can be analyzed and avoided by high-level reasoning. We present a case study on street scene segmentation to demonstrate the feasibility and power of SBN. In particular, we start from a well performing classic deep network which we adapt to house a SB-Layer containing task related semantic concepts (such as object-parts and materials). Importantly, we can recover state of the art performance despite a drastic dimensionality reduction from 1000s (non-semantic feature) to 10s (semantic concept) channels. Additionally we show how the activations of the SB-Layer can be used for both the interpretation of failure cases of the network as well as for confidence prediction of the resulting output. For the first time, e.g., we show interpretable segmentation results for most predictions at over 99% accuracy.
A Novel BiLevel Paradigm for Image-to-Image Translation
L. Ma, Q. Sun, B. Schiele and L. Van Gool
Technical Report, 2019
(arXiv: 1904.09028)
Abstract
Image-to-image (I2I) translation is a pixel-level mapping that requires a large number of paired training data and often suffers from the problems of high diversity and strong category bias in image scenes. In order to tackle these problems, we propose a novel BiLevel (BiL) learning paradigm that alternates the learning of two models, respectively at an instance-specific (IS) and a general-purpose (GP) level. In each scene, the IS model learns to maintain the specific scene attributes. It is initialized by the GP model that learns from all the scenes to obtain the generalizable translation knowledge. This GP initialization gives the IS model an efficient starting point, thus enabling its fast adaptation to the new scene with scarce training data. We conduct extensive I2I translation experiments on human face and street view datasets. Quantitative results validate that our approach can significantly boost the performance of classical I2I translation models, such as PG2 and Pix2Pix. Our visualization results show both higher image quality and more appropriate instance-specific details, e.g., the translated image of a person looks more like that person in terms of identity.
XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll and C. Theobalt
Technical Report, 2019
(arXiv: 1907.00837)
Abstract
We present a real-time approach for multi-person 3D motion capture at over 30 fps using a single RGB camera. It operates in generic scenes and is robust to difficult occlusions both by other people and objects. Our method operates in subsequent stages. The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals. We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy. In the second stage, a fully-connected neural network turns the possibly partial (on account of occlusion) 2D pose and 3D pose features for each subject into a complete 3D pose estimate per individual. The third stage applies space-time skeletal model fitting to the predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose, and enforce temporal coherence. Our method returns the full skeletal pose in joint angles for each subject. This is a further key distinction from previous work that neither extracted global body positions nor joint angle results of a coherent skeleton in real time for multi-person scenes. The proposed system runs on consumer hardware at a previously unseen speed of more than 30 fps given 512x320 images as input while achieving state-of-the-art accuracy, which we will demonstrate on a range of challenging real-world scenes.
Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning
A. M. G. Salem, A. Bhattacharyya, M. Backes, M. Fritz and Y. Zhang
Technical Report, 2019
(arXiv: 1904.01067)
Abstract
Machine learning (ML) has progressed rapidly during the past decade and the major factor that drives such development is the unprecedented large-scale data. As data generation is a continuous process, this leads to ML service providers updating their models frequently with newly-collected data in an online learning scenario. In consequence, if an ML model is queried with the same set of data samples at two different points in time, it will provide different results. In this paper, we investigate whether the change in the output of a black-box ML model before and after being updated can leak information of the dataset used to perform the update. This constitutes a new attack surface against black-box ML models and such information leakage severely damages the intellectual property and data privacy of the ML model owner/provider. In contrast to membership inference attacks, we use an encoder-decoder formulation that allows inferring diverse information ranging from detailed characteristics to full reconstruction of the dataset. Our new attacks are facilitated by state-of-the-art deep learning techniques. In particular, we propose a hybrid generative model (BM-GAN) that is based on generative adversarial networks (GANs) but includes a reconstructive loss that allows generating accurate samples. Our experiments show effective prediction of dataset characteristics and even full reconstruction in challenging conditions.
Intents and Preferences Prediction Based on Implicit Human Cues
H. Sattar
PhD Thesis, Universität des Saarlandes, 2019
Abstract
Visual search is an important task, and it is part of daily human life. Thus, it has been a long-standing goal in Computer Vision to develop methods aiming at analysing human search intent and preferences. As the target of the search only exists in mind of the person, search intent prediction remains challenging for machine perception. In this thesis, we focus on advancing techniques for search target and preference prediction from implicit human cues. First, we propose a search target inference algorithm from human fixation data recorded during visual search. In contrast to previous work that has focused on individual instances as a search target in a closed world, we propose the first approach to predict the search target in open-world settings by learning the compatibility between observed fixations and potential search targets. Second, we further broaden the scope of search target prediction to categorical classes, such as object categories and attributes. However, state of the art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism – incorporating both spatial and temporal aspects of human gaze behaviour. Third, we go one step further and investigate the feasibility of combining our gaze embedding approach, with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Forth, for the first time, we studied the effect of body shape on people preferences of outfits. We propose a novel and robust multi-photo approach to estimate the body shapes of each user and build a conditional model of clothing categories given body-shape. We demonstrate that in real-world data, clothing categories and body-shapes are correlated. We show that our approach estimates a realistic looking body shape that captures a user’s weight group and body shape type, even from a single image of a clothed person. However, an accurate depiction of the naked body is considered highly private and therefore, might not be consented by most people. First, we studied the perception of such technology via a user study. Then, in the last part of this thesis, we ask if the automatic extraction of such information can be effectively evaded. In summary, this thesis addresses several different tasks that aims to enable the vision system to analyse human search intent and preferences in real-world scenarios. In particular, the thesis proposes several novel ideas and models in visual search target prediction from human fixation data, for the first time studied the correlation between shape and clothing categories opening a new direction in clothing recommendation systems, and introduces a new topic in privacy and computer vision, aimed at preventing automatic 3D shape extraction from images.
Shape Evasion: Preventing Body Shape Inference of Multi-Stage Approaches
H. Sattar, K. Krombholz, G. Pons-Moll and M. Fritz
Technical Report, 2019
(arXiv: 1905.11503)
Abstract
Modern approaches to pose and body shape estimation have recently achieved strong performance even under challenging real-world conditions. Even from a single image of a clothed person, a realistic looking body shape can be inferred that captures a users' weight group and body shape type well. This opens up a whole spectrum of applications -- in particular in fashion -- where virtual try-on and recommendation systems can make use of these new and automatized cues. However, a realistic depiction of the undressed body is regarded highly private and therefore might not be consented by most people. Hence, we ask if the automatic extraction of such information can be effectively evaded. While adversarial perturbations have been shown to be effective for manipulating the output of machine learning models -- in particular, end-to-end deep learning approaches -- state of the art shape estimation methods are composed of multiple stages. We perform the first investigation of different strategies that can be used to effectively manipulate the automatic shape estimation while preserving the overall appearance of the original image.
Mobile Eye Tracking for Everyone
J. Steil
PhD Thesis, Universität des Saarlandes, 2019
Abstract
Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training
D. Stutz, M. Hein and B. Schiele
Technical Report, 2019
(arXiv: 1910.06259)
Abstract
Adversarial training is the standard to train models robust against adversarial examples. However, especially for complex datasets, adversarial training incurs a significant loss in accuracy and is known to generalize poorly to stronger attacks, e.g., larger perturbations or other threat models. In this paper, we introduce confidence-calibrated adversarial training (CCAT) where the key idea is to enforce that the confidence on adversarial examples decays with their distance to the attacked examples. We show that CCAT preserves better the accuracy of normal training while robustness against adversarial examples is achieved via confidence thresholding, i.e., detecting adversarial examples based on their confidence. Most importantly, in strong contrast to adversarial training, the robustness of CCAT generalizes to larger perturbations and other threat models, not encountered during training. For evaluation, we extend the commonly used robust test error to our detection setting, present an adaptive attack with backtracking and allow the attacker to select, per test example, the worst-case adversarial example from multiple black- and white-box attacks. We present experimental results using $L_\infty$, $L_2$, $L_1$ and $L_0$ attacks on MNIST, SVHN and Cifar10.
2018
Video Object Segmentation with Language Referring Expressions
A. Khoreva, A. Rohrbach and B. Schiele
Computer Vision - ACCV 2018, 2018
NightOwls: A Pedestrians at Night Dataset
L. Neumann, M. Karg, S. Zhang, C. Scharfenberger, E. Piegert, S. Mistr, O. Prokofyeva, R. Thiel, A. Vedaldi, A. Zisserman and B. Schiele
Computer Vision - ACCV 2018, 2018
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz and A. Leonardis
Computer Vision - ECCV 2018 Workshops, 2018
NRST: Non-rigid Surface Tracking from Monocular Video
M. Habermann, W. Xu, H. Rohdin, M. Zollhöfer, G. Pons-Moll and C. Theobalt
Pattern Recognition (GCPR 2018), 2018