Sequential Attacks on Agents for Long-Term Adversarial Goals
E. Tretschk, S. J. Oh and M. Fritz
2. ACM Computer Science in Cars Symposium (CSCS 2018), 2018
Detailed Human Avatars from Monocular Video
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt and G. Pons-Moll
3DV 2018 , International Conference on 3D Vision, 2018
Single-Shot Multi-person 3D Pose Estimation from Monocular RGB
D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll and C. Theobalt
3DV 2018 , International Conference on 3D Vision, 2018
Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation
M. Omran, C. Lassner,, G. Pons-Moll, P. Gehler and B. Schiele
3DV 2018 , International Conference on 3D Vision, 2018
Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time
Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges and G. Pons-Moll
ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2018), Volume 37, Number 6, 2018
Quick Bootstrapping of a Personalized Gaze Model from Real-Use Interactions
M. X. Huang, J. Li, G. Ngai and H. Va Leong
ACM Transactions on Intelligent Systems and Technology, Volume 9, Number 4, 2018
Adversarial Scene Editing: Automatic Object Removal from Weak Supervision
R. Shetty, M. Fritz and B. Schiele
Advances in Neural Information Processing Systems 31, 2018
While great progress has been made recently in automatic image manipulation, it has been limited to object centric images like faces or structured scene datasets. In this work, we take a step towards general scene-level image editing by developing an automatic interaction-free object removal model. Our model learns to find and remove objects from general scene images using image-level labels and unpaired data in a generative adversarial network (GAN) framework. We achieve this with two key contributions: a two-stage editor architecture consisting of a mask generator and image in-painter that co-operate to remove objects, and a novel GAN based prior for the mask generator that allows us to flexibly incorporate knowledge about object shapes. We experimentally show on two datasets that our method effectively removes a wide variety of objects using weak supervision only
Unsupervised Learning of Shape and Pose with Differentiable Point Clouds
E. Insafutdinov and A. Dosovitskiy
Advances in Neural Information Processing Systems 31 (NIPS 2018), 2018
VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements
M. Khamis, C. Oechsner, F. Alt and A. Bulling
AVI 2018, International Conference on Advanced Visual Interfaces, 2018
Conditional Flow Variational Autoencoders for Structured Sequence Prediction
A. Bhattacharyya, M. Hanselmann, M. Fritz, B. Schiele and C.-N. Straehle
Bayesian Deep Learning NeurIPS 2019 Workshop, 2018
(Accepted/in press)
JAMI: Fast Computation of Conditional Mutual Information for ceRNA Network Analysis
A. Horňáková, M. List, J. Vreeken and M. H. Schulz
Bioinformatics, Volume 34, Number 17, 2018
Which one is me? Identifying Oneself on Public Displays
M. Khamis, C. Becker, A. Bulling and F. Alt
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild
M. Khamis, A. Baier, N. Henze, F. Alt and A. Bulling
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
Training Person-Specific Gaze Estimators from Interactions with Multiple Devices
X. Zhang, M. X. Huang, Y. Sugano and A. Bulling
CHI 2018, CHI Conference on Human Factors in Computing Systems, 2018
GazeDirector: Fully Articulated Eye Gaze Redirection in Video
E. Wood, T. Baltrusaitis, L.-P. Morency, P. Robinson and A. Bulling
Computer Graphics Forum (Proc. EUROGRAPHICS 2018), Volume 37, Number 2, 2018
Grounding Visual Explanations
L. A. Hendricks, R. Hu, T. Darrell and Z. Akata
Computer Vision -- ECCV 2018, 2018
Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes
Y. He, B. Schiele and M. Fritz
Computer Vision -- ECCV 2018, 2018
Textual Explanations for Self-Driving Vehicles
J. Kim, A. Rohrbach, T. Darrell, J. Canny and Z. Akata
Computer Vision -- ECCV 2018, 2018
Deep neural perception and control networks have become key com- ponents of self-driving vehicles. User acceptance is likely to benefit from easy- to-interpret textual explanations which allow end-users to understand what trig- gered a particular behavior. Explanations may be triggered by the neural con- troller, namely introspective explanations , or informed by the neural controller’s output, namely rationalizations . We propose a new approach to introspective ex- planations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i . e ., acceleration and change of course. The controller’s at- tention identifies image regions that potentially influence the network’s output. Second, we use an attention-based video-to-text model to produce textual ex- planations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD- X) dataset. Code is available at https://github.com/JinkyuKimUCB/explainable-deep-driving
A Hybrid Model for Identity Obfuscation by Face Replacement
Q. Sun, A. Tewari, W. Xu, M. Fritz, C. Theobalt and B. Schiele
Computer Vision -- ECCV 2018, 2018
Recovering Accurate {3D} Human Pose in the Wild Using {IMUs} and a Moving Camera
T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn and G. Pons-Moll
Computer Vision -- ECCV 2018, 2018
GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User
M. Khamis, A. Kienle, F. Alt and A. Bulling
DroNet’18, 4th ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, 2018
Demo of XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
D. Mehta, O. Sotnychenko, F. Mueller, H. Rhodin, W. Xu, G. Pons-Moll and C. Theobalt
ECCV 2018 Demo Sessions, 2018
A Vision-grounded Dataset for Predicting Typical Locations for Verbs
N. Mukuze, A. Rohrbach, V. Demberg and B. Schiele
Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
Eye Movements During Everyday Behavior Predict Personality Traits
S. Hoppe, T. Loetscher, S. Morey and A. Bulling
Frontiers in Human Neuroscience, Volume 12, 2018
Objects, Relationships, and Context in Visual Data
H. Zhang and Q. Sun
ICMR’18, International Conference on Multimedia Retrieval, 2018
Video Based Reconstruction of 3D People Models
T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
PoseTrack: A Benchmark for Human Pose Estimation and Tracking
M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov, L. Pishchulin, J. Gall and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty
A. Bhattacharyya, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Accurate and Diverse Sampling of Sequences based on a “Best of Many” Sample Objective
A. Bhattacharyya, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs
E. Laude, J.-H. Lange, J. Schüpfer, C. Domokos, L. Leal-Taixé, F. R. Schmidt, B. Andres and D. Cremers
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Disentangled Person Image Generation
L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images
T. Orekondy, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell and M. Rohrbach
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Learning 3D Shape Completion from Laser Scan Data with Weak Supervision
D. Stutz and A. Geiger
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Natural and Effective Obfuscation by Head Inpainting
Q. Sun, L. Ma, S. J. Oh, L. Van Gool, B. Schiele and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Feature Generating Networks for Zero-Shot Learning
Y. Xian, T. Lorenz, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Fooling Vision and Language Models Despite Localization and Attention Mechanism
X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell and D. Song
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll and Y. Liu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Occluded Pedestrian Detection through Guided Attention in CNNs
S. Zhang, J. Yang and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), 2018
Learning to Refine Human Pose Estimation
M. Fieraru, A. Khoreva, L. Pishchulin and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), 2018
Image and Video Captioning with Augmented Neural Architectures
R. Shetty, H. R. Tavakoli and J. Laaksonen
IEEE MultiMedia, Volume 25, Number 2, 2018
Fast-PADMA: Rapidly Adapting Facial Affect Model from Similar Individuals
M. X. Huang, J. Li, G. Ngai, H. V. Leong and K. A. Hua
IEEE Transactions on Multimedia, Volume 20, Number 7, 2018
Reflectance and Natural Illumination from Single-Material Specular Objects Using Deep Learning
S. Georgoulis, K. Rematas, T. Ritschel, E. Gavves, M. Fritz, L. Van Gool and T. Tuytelaars
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 8, 2018
Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification
M. Lapin, M. Hein and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 7, 2018
Discriminatively Trained Latent Ordinal Model for Video Classification
K. Sikka and G. Sharma
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 8, 2018
Towards Reaching Human Performance in Pedestrian Detection
S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Number 4, 2018
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background- versus-foreground errors. To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets for pedestrian detection, and discuss which factors affect their performance. Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of training and test annotations.
Learning 3D Shape Completion under Weak Supervision
D. Stutz and A. Geiger
International Journal of Computer Vision, 2018
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori and L. Fei-Fei
International Journal of Computer Vision, Volume 126, Number 2-4, 2018
Every Little Movement Has a Meaning of Its Own: Using Past Mouse Movements to Predict the Next Interaction
T. C. K. Kwok, E. Y. Fu, E. Y. Wu, M. X. Huang, G. Ngai and H.-V. Leong
IUI 2018, 23rd International Conference on Intelligent User Interfaces, 2018
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour
P. Müller, M. X. Huang and A. Bulling
IUI 2018, 23rd International Conference on Intelligent User Interfaces, 2018
Explainable AI: The New 42?
R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. Kieseberg and A. Holzinger
Machine Learning and Knowledge Extraction (CD-MAKE 2018), 2018
Tracing Cell Lineages in Videos of Lens-free Microscopy
M. Rempfler, V. Stierle, K. Ditzel, S. Kumar, P. Paulitschke, B. Andres and B. H. Menze
Medical Image Analysis, Volume 48, 2018
Cross-Species Learning: A Low-Cost Approach to Learning Human Fight from Animal Fight
E. Y. Fu, M. X. Huang, H. V. Leong and G. Ngai
MM’18, 26th ACM Multimedia Conference, 2018
The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned
M. Khamis, F. Alt and A. Bulling
MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018
Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors
J. Steil, P. Müller, Y. Sugano and A. Bulling
MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018
Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction
M. Barz, F. Daiber, D. Sonntag and A. Bulling
Proceedings ETRA 2018, 2018
Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimuli’s Trajectory is Partially Hidden
T. Mattusch, M. Mirzamohammad, M. Khamis, A. Bulling and F. Alt
Proceedings ETRA 2018, 2018
Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour
P. Müller, M. X. Huang, X. Zhang and A. Bulling
Proceedings ETRA 2018, 2018
Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings
S. Park, X. Zhang, A. Bulling and O. Hilliges
Proceedings ETRA 2018, 2018
Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets
J. Steil, M. X. Huang and A. Bulling
Proceedings ETRA 2018, 2018
Revisiting Data Normalization for Appearance-Based Gaze Estimation
X. Zhang, Y. Sugano and A. Bulling
Proceedings ETRA 2018, 2018
A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation
R. Shetty, B. Schiele and M. Fritz
Proceedings of the 27th USENIX Security Symposium, 2018
Partial Optimality and Fast Lower Bounds for Weighted Correlation Clustering
J.-H. Lange, A. Karrenbauer and B. Andres
Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
A. Khan, I. Steiner, Y. Sugano, A. Bulling and R. Macdonald
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
Generating Counterfactual Explanations with Natural Language
L. A. Hendricks, R. Hu, T. Darrell and Z. Akata
Proceedings of the 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018), 2018
(arXiv: 1806.09809)
Natural language explanations of deep neural network decisions provide an intuitive way for a AI agent to articulate a reasoning process. Current textual explanations learn to discuss class discriminative features in an image. However, it is also helpful to understand which attributes might change a classification decision if present in an image (e.g., "This is not a Scarlet Tanager because it does not have black wings.") We call such textual explanations counterfactual explanations, and propose an intuitive method to generate counterfactual explanations by inspecting which evidence in an input is missing, but might contribute to a different classification decision if present in the image. To demonstrate our method we consider a fine-grained image classification task in which we take as input an image and a counterfactual class and output text which explains why the image does not belong to a counterfactual class. We then analyze our generated counterfactual explanations both qualitatively and quantitatively using proposed automatic metrics.
Advanced Steel Microstructure Classification by Deep Learning Methods
S. M. Azimi, D. Britz, M. Engstler, M. Fritz and F. Mücklich
Scientific Reports, Volume 8, 2018
The inner structure of a material is called microstructure. It stores the genesis of a material and determines all its physical and chemical properties. While microstructural characterization is widely spread and well known, the microstructural classification is mostly done manually by human experts, which opens doors for huge uncertainties. Since the microstructure could be a combination of different phases with complex substructures its automatic classification is very challenging and just a little work in this field has been carried out. Prior related works apply mostly designed and engineered features by experts and classify microstructure separately from feature extraction step. Recently Deep Learning methods have shown surprisingly good performance in vision applications by learning the features from data together with the classification step. In this work, we propose a deep learning method for microstructure classification in the examples of certain microstructural constituents of low carbon steel. This novel method employs pixel-wise segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by max-voting scheme. Our system achieves 93.94% classification accuracy, drastically outperforming the state-of-the-art method of 48.89% accuracy, indicating the effectiveness of pixel-wise approaches. Beyond the success presented in this paper, this line of research offers a more robust and first of all objective way for the difficult task of steel quality appreciation.
Towards Reverse-Engineering Black-Box Neural Networks
S. J. Oh, M. Augustin, B. Schiele and M. Fritz
Sixth International Conference on Learning Representations (ICLR 2018), 2018
Long-Term Image Boundary Prediction
A. Bhattacharyya, M. Malinowski, B. Schiele and M. Fritz
Thirty-Second AAAI Conference on Artificial Intelligence, 2018
Higher-order Projected Power Iterations for Scalable Multi-Matching
F. Bernard, J. Thunberg, P. Swoboda and C. Theobalt
Technical Report, 2018
(arXiv: 1811.10541)
The matching of multiple objects (e.g. shapes or images) is a fundamental problem in vision and graphics. In order to robustly handle ambiguities, noise and repetitive patterns in challenging real-world settings, it is essential to take geometric consistency between points into account. Computationally, the multi-matching problem is difficult. It can be phrased as simultaneously solving multiple (NP-hard) quadratic assignment problems (QAPs) that are coupled via cycle-consistency constraints. The main limitations of existing multi-matching methods are that they either ignore geometric consistency and thus have limited robustness, or they are restricted to small-scale problems due to their (relatively) high computational cost. We address these shortcomings by introducing a Higher-order Projected Power Iteration method, which is (i) efficient and scales to tens of thousands of points, (ii) straightforward to implement, (iii) able to incorporate geometric consistency, and (iv) guarantees cycle-consistent multi-matchings. Experimentally we show that our approach is superior to existing methods.
Bayesian Prediction of Future Street Scenes through Importance Sampling based Optimization
A. Bhattacharyya, M. Fritz and B. Schiele
Technical Report, 2018
(arXiv: 1806.06939)
For autonomous agents to successfully operate in the real world, anticipation of future events and states of their environment is a key competence. This problem can be formalized as a sequence prediction problem, where a number of observations are used to predict the sequence into the future. However, real-world scenarios demand a model of uncertainty of such predictions, as future states become increasingly uncertain and multi-modal -- in particular on long time horizons. This makes modelling and learning challenging. We cast state of the art semantic segmentation and future prediction models based on deep learning into a Bayesian formulation that in turn allows for a full Bayesian treatment of the prediction problem. We present a new sampling scheme for this model that draws from the success of variational autoencoders by incorporating a recognition network. In the experiments we show that our model outperforms prior work in accuracy of the predicted segmentation and provides calibrated probabilities that also better capture the multi-modal aspects of possible future states of street scenes.
Proceedings PETMEI 2018
A. Bulling, E. Kasneci and C. Lander (Eds.)
ACM, 2018
Primal-Dual Wasserstein GAN
M. Gemici, Z. Akata and M. Welling
Technical Report, 2018
(arXiv: 1805.09575)
We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for building latent variable models of the data distribution based on the primal and the dual formulations of the optimal transport (OT) problem. We utilize the primal formulation to learn a flexible inference mechanism and to create an optimal approximate coupling between the data distribution and the generative model. In order to learn the generative model, we use the dual formulation and train the decoder adversarially through a critic network that is regularized by the approximate coupling obtained from the primal. Unlike previous methods that violate various properties of the optimal critic, we regularize the norm and the direction of the gradients of the critic function. Our model shares many of the desirable properties of auto-encoding models in terms of mode coverage and latent structure, while avoiding their undesirable averaging properties, e.g. their inability to capture sharp visual features when modeling real images. We compare our algorithm with several other generative modeling techniques that utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception Scores (IS).
MLCapsule: Guarded Offline Deployment of Machine Learning as a Service
L. Hanzlik, Y. Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes and M. Fritz
Technical Report, 2018
(arXiv: 1808.00590)
With the widespread use of machine learning (ML) techniques, ML as a service has become increasingly popular. In this setting, an ML model resides on a server and users can query the model with their data via an API. However, if the user's input is sensitive, sending it to the server is not an option. Equally, the service provider does not want to share the model by sending it to the client for protecting its intellectual property and pay-per-query business model. In this paper, we propose MLCapsule, a guarded offline deployment of machine learning as a service. MLCapsule executes the machine learning model locally on the user's client and therefore the data never leaves the client. Meanwhile, MLCapsule offers the service provider the same level of control and security of its model as the commonly used server-side execution. In addition, MLCapsule is applicable to offline applications that require local execution. Beyond protecting against direct model access, we demonstrate that MLCapsule allows for implementing defenses against advanced attacks on machine learning models such as model stealing/reverse engineering and membership inference.
Manipulating Attributes of Natural Scenes via Hallucination
L. Karacan, Z. Akata, A. Erdem and E. Erdem
Technical Report, 2018
(arXiv: 1808.07413)
In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hallucinated with the given attributes, the corresponding look is then transferred to the input image while preserving the semantic details intact, giving a photo-realistic manipulation result. As the proposed framework hallucinates what the scene will look like, it does not require any reference style image as commonly utilized in most of the appearance or style transfer approaches. Moreover, it allows to simultaneously manipulate a given scene according to a diverse set of transient attributes within a single model, eliminating the need of training multiple networks per each translation task. Our comprehensive set of qualitative and quantitative results demonstrate the effectiveness of our approach against the competing methods.
Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation
K. Z. Lin, W. Xu, Q. Sun, C. Theobalt and T.-S. Chua
Technical Report, 2018
(arXiv: 1812.09899)
We propose a novel approach to jointly perform 3D object retrieval and pose estimation from monocular images.In order to make the method robust to real world scene variations in the images, e.g. texture, lighting and background,we learn an embedding space from 3D data that only includes the relevant information, namely the shape and pose.Our method can then be trained for robustness under real world scene variations without having to render a large training set simulating these variations. Our learned embedding explicitly disentangles a shape vector and a pose vector, which alleviates both pose bias for 3D shape retrieval and categorical bias for pose estimation. Having the learned disentangled embedding, we train a CNN to map the images to the embedding space, and then retrieve the closest 3D shape from the database and estimate the 6D pose of the object using the embedding vectors. Our method achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore outperforms the previous state-of-the-art methods on both tasks.
From Perception over Anticipation to Manipulation
W. Li
PhD Thesis, Universität des Saarlandes, 2018
From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.
Deep Appearance Maps
M. Maximov, T. Ritschel and M. Fritz
Technical Report, 2018
(arXiv: 1804.00863)
We propose a deep representation of appearance, i. e. the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have used deep learning to extract classic appearance representations relating to reflectance model parameters (e. g. Phong) or illumination (e. g. HDR environment maps). We suggest to directly represent appearance itself as a network we call a deep appearance map (DAM). This is a 4D generalization over 2D reflectance maps, which held the view direction fixed. First, we show how a DAM can be learned from images or video frames and later be used to synthesize appearance, given new surface orientations and viewer positions. Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learning-to-learn). Finally, we generalize this to an appearance estimation-and-segmentation task, where we map from an image showing multiple materials to multiple networks reproducing their appearance, as well as per-pixel segmentation.
Image Manipulation against Learned Models Privacy and Security Implications
S. J. Oh
PhD Thesis, Universität des Saarlandes, 2018
Machine learning is transforming the world. Its application areas span privacy sensitive and security critical tasks such as human identification and self-driving cars. These applications raise privacy and security related questions that are not fully understood or answered yet: Can automatic person recognisers identify people in photos even when their faces are blurred? How easy is it to find an adversarial input for a self-driving car that makes it drive off the road? This thesis contributes one of the first steps towards a better understanding of such concerns. We observe that many privacy and security critical scenarios for learned models involve input data manipulation: users obfuscate their identity by blurring their faces and adversaries inject imperceptible perturbations to the input signal. We introduce a data manipulator framework as a tool for collectively describing and analysing privacy and security relevant scenarios involving learned models. A data manipulator introduces a shift in data distribution for achieving privacy or security related goals, and feeds the transformed input to the target model. This framework provides a common perspective on the studies presented in the thesis. We begin the studies from the user’s privacy point of view. We analyse the efficacy of common obfuscation methods like face blurring, and show that they are surprisingly ineffective against state of the art person recognition systems. We then propose alternatives based on head inpainting and adversarial examples. By studying the user privacy, we also study the dual problem: model security. In model security perspective, a model ought to be robust and reliable against small amounts of data manipulation. In both cases, data are manipulated with the goal of changing the target model prediction. User privacy and model security problems can be described with the same objective. We then study the knowledge aspect of the data manipulation problem. The more one knows about the target model, the more effective manipulations one can craft. We propose a game theoretic manipulation framework to systematically represent the knowledge level on the target model and derive privacy and security guarantees. We then discuss ways to increase knowledge about a black-box model by only querying it, deriving implications that are relevant to both privacy and security perspectives.
Understanding and Controlling User Linkability in Decentralized Learning
T. Orekondy, S. J. Oh, B. Schiele and M. Fritz
Technical Report, 2018
(arXiv: 1805.05838)
Machine Learning techniques are widely used by online services (e.g. Google, Apple) in order to analyze and make predictions on user data. As many of the provided services are user-centric (e.g. personal photo collections, speech recognition, personal assistance), user data generated on personal devices is key to provide the service. In order to protect the data and the privacy of the user, federated learning techniques have been proposed where the data never leaves the user's device and "only" model updates are communicated back to the server. In our work, we propose a new threat model that is not concerned with learning about the content - but rather is concerned with the linkability of users during such decentralized learning scenarios. We show that model updates are characteristic for users and therefore lend themselves to linkability attacks. We show identification and matching of users across devices in closed and open world scenarios. In our experiments, we find our attacks to be highly effective, achieving 20x-175x chance-level performance. In order to mitigate the risks of linkability attacks, we study various strategies. As adding random noise does not offer convincing operation points, we propose strategies based on using calibrated domain-specific data; we find these strategies offers substantial protection against linkability threats with little effect to utility.
End-to-end Learning for Graph Decomposition
J. Song, B. Andres, M. Black, O. Hilliges and S. Tang
Technical Report, 2018
(arXiv: 1812.09737)
We propose a novel end-to-end trainable framework for the graph decomposition problem. The minimum cost multicut problem is first converted to an unconstrained binary cubic formulation where cycle consistency constraints are incorporated into the objective function. The new optimization problem can be viewed as a Conditional Random Field (CRF) in which the random variables are associated with the binary edge labels of the initial graph and the hard constraints are introduced in the CRF as high-order potentials. The parameters of a standard Neural Network and the fully differentiable CRF are optimized in an end-to-end manner. Furthermore, our method utilizes the cycle constraints as meta-supervisory signals during the learning of the deep feature representations by taking the dependencies between the output random variables into account. We present analyses of the end-to-end learned representations, showing the impact of the joint training, on the task of clustering images of MNIST. We also validate the effectiveness of our approach both for the feature learning and the final clustering on the challenging task of real-world multi-person pose estimation.
PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis
J. Steil, M. Koelle, W. Heuten, S. Boll and A. Bulling
Technical Report, 2018
(arXiv: 1801.04457)
As first-person cameras in head-mounted displays become increasingly prevalent, so does the problem of infringing user and bystander privacy. To address this challenge, we present PrivacEye, a proof-of-concept system that detects privacysensitive everyday situations and automatically enables and disables the first-person camera using a mechanical shutter. To close the shutter, PrivacEye detects sensitive situations from first-person camera videos using an end-to-end deep-learning model. To open the shutter without visual input, PrivacEye uses a separate, smaller eye camera to detect changes in users' eye movements to gauge changes in the "privacy level" of the current situation. We evaluate PrivacEye on a dataset of first-person videos recorded in the daily life of 17 participants that they annotated with privacy sensitivity levels. We discuss the strengths and weaknesses of our proof-of-concept system based on a quantitative technical evaluation as well as qualitative insights from semi-structured interviews.
Gaze Estimation and Interaction in Real-World Environments
X. Zhang
PhD Thesis, Universität des Saarlandes, 2018a
Gaze estimation and interaction in real-world environments
X. Zhang
PhD Thesis, Universität des Saarlandes, 2018b
Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.