Publications
2024
- “CloSe: A 3D Clothing Segmentation Dataset and Model,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Interaction Replica: Tracking Human–Object Interaction and Scene Changes From Human Motion,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Generating Continual Human Motion in Diverse 3D Scenes,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Recent Trends in 3D Reconstruction of General Non-Rigid Scenes,” Computer Graphics Forum (Proc. EUROGRAPHICS 2024), 2024.
- “Improving Feature Stability during Upsampling - Spectral Artifacts and the Importance of Spatial Context,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Good Teachers Explain: Explanation-Enhanced Knowledge Distillation,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “HowToCaption: Prompting LLMs to Transform Video Annotations at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “GiT: Towards Generalist Vision Transformer through Universal Language Interface,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Improving 2D Feature Representations by 3D-Aware Fine-Tuning,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Point Transformer V3: Simpler, Faster, Stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “GEARS: Local Geometry-aware Hand-object Interaction Synthesis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Enhanced Long-Tailed Recognition With Contrastive CutMix Augmentation,” IEEE Transactions on Image Processing, vol. 33, 2024.
- “Better Understanding Differences in Attribution Methods via Systematic Evaluations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
- “CosPGD: An Efficient White-Box Adversarial Attack for Pixel-Wise Prediction Tasks,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Implicit Representations for Constrained Image Segmentation,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “MultiMax: Sparse and Mulit-Modal Attention Learning,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “As large as it gets - Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters,” Transactions on Machine Learning Research, vol. 2024, 2024.
- “Efficient and Differentiable Combinatorial Optimization for Visual Computing,” Universität des Saarlandes, Saarbrücken, 2024.
- “Towards Designing Inherently Interpretable Deep Neural Networks for Image Classification,” Universität des Saarlandes, Saarbrücken, 2024.
- “Advancing Image and Video Recognition with Less Supervision,” Universität des Saarlandes, Saarbrücken, 2024.more
Abstract
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic assistance, and autonomous driving. However, the demand for more data to train models for emerging tasks is increasing dramatically. Deep learning models heavily depend on the quality and quantity of data, necessitating high-quality labeled datasets. Yet, each task requires different types of annotations for training and evaluation, posing challenges in obtaining comprehensive supervision. The acquisition of annotations is not only resource-intensive in terms of time and cost but also introduces biases, such as granularity in classification, where distinctions like specific breeds versus generic categories may arise. Furthermore, the dynamic nature of the world causes the challenge that previously annotated data becomes potentially irrelevant, and new categories and rare occurrences continually emerge, making it impossible to label every aspect of the world.
Therefore, this thesis aims to explore various supervision scenarios to mitigate the need for full supervision and reduce data acquisition costs. Specifically, we investigate learning without labels, referred to as self-supervised and unsupervised methods, to better understand video and image representations. To learn from data without labels, we leverage injected priors such as motion speed, direction, action order in videos, or semantic information granularity to obtain powerful data representations. Further, we study scenarios involving reduced supervision levels. To reduce annotation costs, first, we propose to omit precise annotations for one modality in multimodal learning, namely in text-video and image-video settings, and transfer available knowledge to large copora of video data. Second, we study semi-supervised learning scenarios, where only a subset of annotated data alongside unlabeled data is available, and propose to revisit regularization constraints and improve generalization to unlabeled data. Additionally, we address scenarios where parts of available data is inherently limited due to privacy and security reasons or naturally rare events, which not only restrict annotations but also limit the overall data volume. For these scenarios, we propose methods that carefully balance between previously obtained knowledge and incoming limited data by introducing a calibration method or combining a space reservation technique with orthogonality constraints. Finally, we explore multimodal and unimodal open-world scenarios where the model is asked to generalize beyond the given set of object or action classes. Specifically, we propose a new challenging setting on multimodal egocentric videos and propose an adaptation method for vision-language models to generalize on egocentric domain. Moreover, we study unimodal image recognition in an open-set setting and propose to disentangle open-set detection and image classification tasks that effectively improve generalization in different settings.
In summary, this thesis investigates challenges arising when full supervision for training models is not available. We develop methods to understand learning dynamics and the role of biases in data, while also proposing novel setups to advance training with less supervision.
2023
- “Class-Incremental Exemplar Compression for Class-Incremental Learning,” in 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Transitivity Recovering Decompositions: Interpretable and Robust Fine-Grained Relationships,” in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 2023.
- “LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching,” in Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 2023.
- “Differentiable Architecture Search: a One-Shot Method?,” in AutoML Conference 2023, Potsdam/Berlin, Germany, 2023.
- “A Polyhedral Study of Lifted Multicuts,” Discrete Optimization, vol. 47, 2023.
- “SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda.
- “Neural Architecture Design and Robustness: A Dataset,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda.more
Abstract
Deep learning models have proven to be successful in a wide
range of machine learning tasks. Yet, they are often highly sensitive to
perturbations on the input data which can lead to incorrect decisions
with high confidence, hampering their deployment for practical
use-cases. Thus, finding architectures that are (more) robust against
perturbations has received much attention in recent years. Just like the
search for well-performing architectures in terms of clean accuracy,
this usually involves a tedious trial-and-error process with one
additional challenge: the evaluation of a network's robustness is
significantly more expensive than its evaluation for clean accuracy.
Thus, the aim of this paper is to facilitate better streamlined research
on architectural design choices with respect to their impact on
robustness as well as, for example, the evaluation of surrogate measures
for robustness. We therefore borrow one of the most commonly considered
search spaces for neural architecture search for image classification,
NAS-Bench-201, which contains a manageable size of 6466 non-isomorphic
network designs. We evaluate all these networks on a range of common
adversarial attacks and corruption types and introduce a database on
neural architecture design and robustness evaluations. We further
present three exemplary use cases of this dataset, in which we (i)
benchmark robustness measurements based on Jacobian and Hessian matrices
for their robustness predictability, (ii) perform neural architecture
search on robust accuracies, and (iii) provide an initial analysis of
how architectural design choices affect robustness. We find that
carefully crafting the topology of a network can have substantial impact
on its robustness, where networks with the same parameter count range in
mean adversarial robust accuracy from 20%-41%. - “FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning,” in Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda.
- “Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Federated Incremental Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “A Meta-Learning Approach to Predicting Performance and Data Requirements,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Continual Detection Transformer for Incremental Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Object Pop-Up: Can We Infer 3D Objects and their Poses from Human Interactions Alone?,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Virtual Sparse Convolution for Multimodal 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “Visibility Aware Human-Object Interaction Tracking from Single RGB Camera,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “ConQueR: Query Contrast Voxel-DETR for 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, Canada, 2023.
- “TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “SSB: Simple but Strong Baseline for Boosting Performance of Open-Set Semi-Supervised Learning,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “Robustifying Token Attention for Vision Transformers,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “Studying How to Efficiently and Effectively Guide Models with Explanations,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “DARTH: Holistic Test-time Adaptation for Multiple Object Tracking,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “Learning by Sorting: Self-supervised Learning with Group Ordering Constraints,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “SimNP: Learning Self-Similarity Priors Between Neural Points,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “NSF: Neural Surface Fields for Human Modeling from Monocular Depth,” in IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 2023.
- “On the Unreasonable Vulnerability of Transformers for Image Restoration – and an Easy Fix,” in IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), Paris, France, 2023.
- “Classification Robustness to Common Optical Aberrations,” in IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023), Paris, France, 2023.
- “HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection,” in IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 2023.
- “Test-time Domain Adaptation for Monocular Depth Estimation,” in IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 2023.
- “TrafficBots: Towards World Models for Autonomous Driving Simulation and Motion Prediction,” in IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 2023.
- “LayerNet: High-Resolution Semantic 3D Reconstruction of Clothed People,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 2, 2023.
- “Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, 2023.
- “A Deeper Look into DeepCap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, 2023.more
Abstract
Human performance capture is a highly important computer vision problem with
many applications in movie production and virtual/augmented reality. Many
previous performance capture approaches either required expensive multi-view
setups or did not recover dense space-time coherent geometry with
frame-to-frame correspondences. We propose a novel deep learning approach for
monocular dense human performance capture. Our method is trained in a weakly
supervised manner based on multi-view supervision completely removing the need
for training data with 3D ground truth annotations. The network architecture is
based on two separate networks that disentangle the task into a pose estimation
and a non-rigid surface deformation step. Extensive qualitative and
quantitative evaluations show that our approach outperforms the state of the
art in terms of quality and robustness. This work is an extended version of
DeepCap where we provide more detailed explanations, comparisons and results as
well as applications. - “Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, 2023.more
Abstract
Minimum cost lifted multicut problem is a generalization of the multicut problem and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. Its main advantage is that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e. hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is NP-hard and the branch-and-bound algorithm is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: geometric multiple model fitting, homography and motion estimation, motion segmentation.
- “Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, 2023.
- “Urban Scene Semantic Segmentation With Low-Cost Coarse Annotation,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa Village, HI, USA, 2023.
- “Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa Village, HI, USA, 2023.
- “Jointly Learning Band Selection and Filter Array Design for Hyperspectral Imaging,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa Village, HI, USA, 2023.
- “Intra-Source Style Augmentation for Improved Domain Generalization,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa Village, HI, USA, 2023.
- “Revisiting Consistency Regularization for Semi-supervised Learning,” International Journal of Computer Vision, vol. 131, 2023.
- “Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation,” International Journal of Computer Vision, 2023.
- “Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, 2023.
- “3D Object Detection for Autonomous Driving: A Comprehensive Survey,” International Journal of Computer Vision, 2023.
- “Improving Primary-Vertex Reconstruction with a Minimum-Cost Lifted Multicut Graph Partitioning Algorithm,” Journal of Instrumentation, vol. 18, 2023.
- “Towards Understanding Climate Change Perceptions: A Social Media Dataset,” in NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning, New Orleans, LA, USA, 2023.
- “Learning Comprehensive Global Features in Person Re-identification: Ensuring Discriminativeness of more Local Regions,” Pattern Recognition, vol. 134, 2023.
- “An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” in Pattern Recognition (DAGM GCPR 2023), Heidelberg, Germany, 2023.
- “FullFormer: Generating Shapes Inside Shapes,” in Pattern Recognition (DAGM GCPR 2023), Heidelberg, Germany, 2023.
- “Online Hyperparameter Optimization for Class-Incremental Learning,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023.
- “Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023.
- “Learning Context-Aware Classifier for Semantic Segmentation,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 2023.
- “ClusterFuG: Clustering Fully connected Graphs by Multicut,” in Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, Hawaii, USA, 2023.
- “Discovering Class-Specific GAN Controls for Semantic Image Synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2023), Vancouver, Canada, 2023.
- “Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences,” Transactions of the Association for Computational Linguistics, vol. 11, 2023.
- “Improving Native CNN Robustness with Filter Frequency Regularization,” Transactions on Machine Learning Research, vol. 2023, 2023.
- “Implicit Representations for Image Segmentation,” in UniReps: The First Workshop on Unifying Representations in Neural Models, New Orleans, LA, USA, 2022.
- “Modelling 3D Humans : Pose, Shape, Clothing and Interactions,” Universität des Saarlandes, Saarbrücken, 2023.
- “Learning from Imperfect Data Incremental Learning and Few-shot Learning,” Universität des Saarlandes, Saarbrücken, 2023.
- “Improving Quality and Controllability in GAN-based Image Synthesis,” Universität des Saarlandes, Saarbrücken, 2023.
2022
- “Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- “Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- “SP-ViT: Learning 2D Spatial Priors for Vision Transformers,” in 33rd British Machine Vision Conference (BMVC 2022), London, UK, 2022.
- “Relational Proxies: Emergent Relationships as Fine-Grained Discriminators,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Robust Models are less Over-Confident,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Trading off Image Quality for Robustness is not Necessary with Regularized Deterministic Autoencoders,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Motion Transformer with Global Intention Localization and Local Movement Refinement,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “USB: A Unified Semi-supervised Learning Benchmark for Classification,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Towards Efficient 3D Object Detection with Knowledge Distillation,” in Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 2022.
- “Abstracting Sketches Through Simple Primitives,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation using Bounding Boxes,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Learned Vertex Descent: A New Direction for 3D Human Model Fitting,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “DODA: Data-Oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Class-Agnostic Object Counting Robust to Intraclass Diversity,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “FrequencyLowCut Pooling - Plug & Play against Catastrophic Overfitting,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Improving Robustness by Enhancing Weak Subnets,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “A Comparative Study of Graph Matching Algorithms in Computer Vision,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Skeleton-Free Pose Transfer for Stylized 3D Characters,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Learning Where To Look - Generative NAS is Surprisingly Efficient,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “CHORE: Contact, Human and Object Reconstruction from a Single RGB Image,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “COUCH: Towards Controllable Human-Chair Interactions,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement,” in Computer Vision -- ECCV 2022, Tel Aviv, Israel, 2022.
- “Advancing Translational Research in Neuroscience through Multi-task Learning,” Frontiers in Psychiatry, vol. 13, 2022.
- “Semantic Image Synthesis with Semantically Coupled VQ-Model,” in ICLR Workshop on Deep Generative Models for Highly Structured Data (ICLR 2022 DGM4HSD), Virtual, 2022.
- “RAMA: A Rapid Multicut Algorithm on GPU,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “FastDOG: Fast Discrete Optimization on GPU,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “BEHAVE: Dataset and Method for Tracking Human Object Interactions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “B-cos Networks: Alignment is All We Need for Interpretability,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Decoupling Zero-Shot Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories
that have not been seen in the training. Existing works formulate ZS3 as a
pixel-level zero-shot classification problem, and transfer semantic knowledge
from seen classes to unseen ones with the help of language models pre-trained
only with texts. While simple, the pixel-level ZS3 formulation shows the
limited capability to integrate vision-language models that are often
pre-trained with image-text pairs and currently demonstrate great potential for
vision tasks. Inspired by the observation that humans often perform
segment-level semantic labeling, we propose to decouple the ZS3 into two
sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.
2) a zero-shot classification task on segments. The former sub-task does not
involve category information and can be directly transferred to group pixels
for unseen classes. The latter subtask performs at segment-level and provides a
natural way to leverage large-scale vision-language models pre-trained with
image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we
propose a simple and effective zero-shot semantic segmentation model, called
ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by
large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff
in terms of mIoU for unseen classes. Code will be released at
github.com/dingjiansw101/ZegFormer. - “PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with
decoupled representation learning and classifier learning for imbalanced SSL.
To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)
for classifier learning. Furthermore, the current evaluation protocol for
imbalanced SSL focuses only on balanced test sets, which has limited
practicality in real-world scenarios. Therefore, we further conduct a
comprehensive evaluation under various shifted test distributions. In
experiments, we show that our approach outperforms other methods over a large
range of shifted distributions, achieving state-of-the-art performance on
benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our
code will be made publicly available. - “Bi-level Alignment for Cross-Domain Crowd Counting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “LiDAR Snowfall Simulation for Robust 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
As acquiring pixel-wise annotations of real-world images for semantic
segmentation is a costly process, a model can instead be trained with more
accessible synthetic data and adapted to real images without requiring their
annotations. This process is studied in unsupervised domain adaptation (UDA).
Even though a large number of methods propose new adaptation strategies, they
are mostly based on outdated network architectures. As the influence of recent
network architectures has not been systematically studied, we first benchmark
different network architectures for UDA and then propose a novel UDA method,
DAFormer, based on the benchmark results. The DAFormer network consists of a
Transformer encoder and a multi-level context-aware feature fusion decoder. It
is enabled by three simple but crucial training strategies to stabilize the
training and to avoid overfitting DAFormer to the source domain: While the Rare
Class Sampling on the source domain improves the quality of pseudo-labels by
mitigating the confirmation bias of self-training towards common classes, the
Thing-Class ImageNet Feature Distance and a learning rate warmup promote
feature transfer from ImageNet pretraining. DAFormer significantly improves the
state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for
Synthia->Cityscapes and enables learning even difficult classes such as train,
bus, and truck well. The implementation is available at
github.com/lhoyer/DAFormer. - “Large Loss Matters in Weakly Supervised Multi-Label Classification,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Stratified Transformer for 3D Point Cloud Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
Although considerable progress has been made in semantic scene understanding
under clear weather, it is still a tough problem under adverse weather
conditions, such as dense fog, due to the uncertainty caused by imperfect
observations. Besides, difficulties in collecting and labeling foggy images
hinder the progress of this field. Considering the success in semantic scene
understanding under clear weather, we think it is reasonable to transfer
knowledge learned from clear images to the foggy domain. As such, the problem
becomes to bridge the domain gap between clear images and foggy images. Unlike
previous methods that mainly focus on closing the domain gap caused by fog --
defogging the foggy images or fogging the clear images, we propose to alleviate
the domain gap by considering fog influence and style variation simultaneously.
The motivation is based on our finding that the style-related gap and the
fog-related gap can be divided and closed respectively, by adding an
intermediate domain. Thus, we propose a new pipeline to cumulatively adapt
style, fog and the dual-factor (style and fog). Specifically, we devise a
unified framework to disentangle the style factor and the fog factor
separately, and then the dual-factor from images in different domains.
Furthermore, we collaborate the disentanglement of three factors with a novel
cumulative loss to thoroughly disentangle these three factors. Our method
achieves the state-of-the-art performance on three benchmarks and shows
generalization ability in rainy and snowy scenes. - “Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.more
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the
computer vision field due to its superior performance in real-world
applications such as video surveillance with crowded scenes or in vast space.
In this work, we propose a mathematically elegant multi-camera multiple object
tracking approach based on a spatial-temporal lifted multicut formulation. Our
model utilizes state-of-the-art tracklets produced by single-camera trackers as
proposals. As these tracklets may contain ID-Switch errors, we refine them
through a novel pre-clustering obtained from 3D geometry projections. As a
result, we derive a better tracking graph without ID switches and more precise
affinity costs for the data association phase. Tracklets are then matched to
multi-camera trajectories by solving a global lifted multicut formulation that
incorporates short and long-range temporal interactions on tracklets located in
the same camera as well as inter-camera ones. Experimental results on the
WildTrack dataset yield near-perfect result, outperforming state-of-the-art
trackers on Campus while being on par on the PETS-09 dataset. We will make our
implementations available upon acceptance of the paper. - “Towards Better Understanding Attribution Methods,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Generalized Few-shot Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Scribble-Supervised LiDAR Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Sound and Visual Representation Learning with Multiple Pretraining Tasks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “RBGNet: Ray-based Grouping for 3D Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Continual Test-Time Domain Adaptation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “A Unified Query-based Paradigm for Point Cloud Understanding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Adiabatic Quantum Computing for Multi Object Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 2022.
- “Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “Improving Depth Estimation Using Map-Based Depth Priors,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “Learnable Online Graph Representations for 3D Multi-Object Tracking,” IEEE Robotics and Automation Letters, vol. 7, no. 2, 2022.
- “Optimising for Interpretability: Convolutional Dynamic Alignment Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, 2022.
- “Semi-Supervised and Unsupervised Deep Visual Learning: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- “DWDN: Deep Wiener Deconvolution Network for Non-Blind Image Deblurring,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, 2022.
- “Meta-Transfer Learning through Hard Tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, 2022.
- “Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, 2022.
- “Hyperspectral Image Super-Resolution with RGB Image Super-Resolution as an Auxiliary Task,” in 2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022), Waikoloa Village, HI, USA, 2022.
- “ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks,” Information Sciences, vol. 591, 2022.
- “MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes,” in International Conference on 3D Vision, Hybrid / Prague, Czechia, 2022.more
Abstract
3D human motion capture from monocular RGB images respecting interactions of
a subject with complex and possibly deformable environments is a very
challenging, ill-posed and under-explored problem. Existing methods address it
only weakly and do not model possible surface deformations often occurring when
humans interact with scene surfaces. In contrast, this paper proposes
MoCapDeform, i.e., a new framework for monocular 3D human motion capture that
is the first to explicitly model non-rigid deformations of a 3D scene for
improved 3D human pose estimation and deformable environment reconstruction.
MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the
camera space. It first localises a subject in the input monocular video along
with dense contact labels using a new raycasting based strategy. Next, our
human-environment interaction constraints are leveraged to jointly optimise
global 3D human poses and non-rigid surface deformations. MoCapDeform achieves
superior accuracy than competing methods on several datasets, including our
newly recorded one with deforming background scenes. - “PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection,” International Journal of Computer Vision, vol. 131, 2022.
- “OASIS: Only Adversarial Supervision for Semantic Image Synthesis,” International Journal of Computer Vision, vol. 130, 2022.
- “Attribute Prototype Network for Any-Shot Learning,” International Journal of Computer Vision, vol. 130, 2022.
- “DPER: Direct Parameter Estimation for Randomly Missing Data,” Knowledge-Based Systems, vol. 240, 2022.
- “Aliasing and Adversarial Robust Generalization of CNNs,” Machine Learning, vol. 111, 2022.
- “Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks,” in Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022), Grenoble, France, 2022.more
Abstract
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial
optimization problem of partitioning a real-valued edge-weighted graph such as
to minimize the total cost of the partition. While graph convolutional neural
networks (GNN) have proven to be promising in the context of combinatorial
optimization, most of them are only tailored to or tested on positive-valued
edge weights, i.e. they do not comply to the nature of the multicut problem. We
therefore adapt various GNN architectures including Graph Convolutional
Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to
facilitate the efficient encoding of real-valued edge costs. Moreover, we
employ a reformulation of the multicut ILP constraints to a polynomial program
as loss function that allows to learn feasible multicut solutions in a scalable
way. Thus, we provide the first approach towards end-to-end trainable
multicuts. Our findings support that GNN approaches can produce good solutions
in practice while providing lower computation times and largely improved
scalability compared to LP solvers and optimized heuristics, especially when
considering large instances. - “TATL: Task Agnostic Transfer Learning for Skin Attributes Detection,” Medical Image Analysis, vol. 78, 2022.
- “Impact of Realistic Properties of the Point Spread Function on Classification Tasks to Reveal a Possible Distribution Shift,” in NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2022 Workshop DistShift), New Orelans, LA, USA, 2022.
- “Optimizing Edge Detection for Image Segmentation with Multicut Penalties,” in Pattern Recognition (DAGM GCPR 2022), Konstanz, Germany, 2022.more
Abstract
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph
decomposition by optimizing binary edge labels over edge costs. While the
formulation of a MP from independently estimated costs per edge is highly
flexible and intuitive, solving the MP is NP-hard and time-expensive. As a
remedy, recent work proposed to predict edge probabilities with awareness to
potential conflicts by incorporating cycle constraints in the prediction
process. We argue that such formulation, while providing a first step towards
end-to-end learnable edge weights, is suboptimal, since it is built upon a
loose relaxation of the MP. We therefore propose an adaptive CRF that allows to
progressively consider more violated constraints and, in consequence, to issue
solutions with higher validity. Experiments on the BSDS500 benchmark for
natural image segmentation as well as on electron microscopic recordings show
that our approach yields more precise edge detection and image segmentation. - “Keypoint Message Passing for Video-Based Person Re-identification,” in Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Conference, 2022.
- “PlanT: Explainable Planning Transformers via Object-Level Representations,” in Proceedings of the 6th Annual Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 2022.more
Abstract
Planning an optimal route in a complex environment requires efficient
reasoning about the surrounding scene. While human drivers prioritize important
objects and ignore details not relevant to the decision, learning-based
planners typically extract features from dense, high-dimensional grid
representations containing all vehicle and road context information. In this
paper, we propose PlanT, a novel approach for planning in the context of
self-driving that uses a standard transformer architecture. PlanT is based on
imitation learning with a compact object-level input representation. On the
Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the
driving score of the expert) while being 5.3x faster than equivalent
pixel-based planning baselines during inference. Combining PlanT with an
off-the-shelf perception module provides a sensor-based driving system that is
more than 10 points better in terms of driving score than the existing state of
the art. Furthermore, we propose an evaluation protocol to quantify the ability
of planners to identify relevant objects, providing insights regarding their
decision-making. Our results indicate that PlanT can focus on the most relevant
object in the scene, even when this object is geometrically distant. - “Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization,” in Proceedings of The Workshop on Automatic Summarization for Creative Writing (COLING 2022), Gyeongju, Republic of Korea, 2022.
- “An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised Learning,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11086.more
Abstract
Semi-supervised learning (SSL) has shown great promise in leveraging
unlabeled data to improve model performance. While standard SSL assumes uniform
data distribution, we consider a more realistic and challenging setting called
imbalanced SSL, where imbalanced class distributions occur in both labeled and
unlabeled data. Although there are existing endeavors to tackle this challenge,
their performance degenerates when facing severe imbalance since they can not
reduce the class imbalance sufficiently and effectively. In this paper, we
study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance
by simply supplementing labeled data with pseudo-labels, according to the
difference in class distribution from the most frequent class. Such a simple
baseline turns out to be highly effective in reducing class imbalance. It
outperforms existing methods by a significant margin, e.g., 12.8%, 13.6%, and
16.7% over previous SOTA on CIFAR100-LT, FOOD101-LT, and ImageNet127
respectively. The reduced imbalance results in faster convergence and better
pseudo-label accuracy of SimiS. The simplicity of our method also makes it
possible to be combined with other re-balancing techniques to improve the
performance further. Moreover, our method shows great robustness to a wide
range of data distributions, which holds enormous potential in practice. Code
will be publicly available. - “Leveraging Self-Supervised Training for Unintentional Action Recognition,” 2022. [Online]. Available: https://arxiv.org/abs/2209.11870.more
Abstract
Unintentional actions are rare occurrences that are difficult to define
precisely and that are highly dependent on the temporal context of the action.
In this work, we explore such actions and seek to identify the points in videos
where the actions transition from intentional to unintentional. We propose a
multi-stage framework that exploits inherent biases such as motion speed,
motion direction, and order to recognize unintentional actions. To enhance
representations via self-supervised training for the task of unintentional
action recognition we propose temporal transformations, called Temporal
Transformations of Inherent Biases of Unintentional Actions (T2IBUA). The
multi-stage approach models the temporal information on both the level of
individual frames and full clips. These enhanced representations show strong
performance for unintentional action recognition tasks. We provide an extensive
ablation study of our framework and report results that significantly improve
over the state-of-the-art. - “Normalization Perturbation: A Simple Domain Generalization Method for Real-World Domain Shifts,” 2022. [Online]. Available: https://arxiv.org/abs/2211.04393.more
Abstract
Improving model's generalizability against domain shifts is crucial,
especially for safety-critical applications such as autonomous driving.
Real-world domain styles can vary substantially due to environment changes and
sensor noises, but deep models only know the training domain style. Such domain
style gap impedes model generalization on diverse real-world domains. Our
proposed Normalization Perturbation (NP) can effectively overcome this domain
style overfitting problem. We observe that this problem is mainly caused by the
biased distribution of low-level features learned in shallow CNN layers. Thus,
we propose to perturb the channel statistics of source domain features to
synthesize various latent styles, so that the trained deep model can perceive
diverse potential domains and generalizes well even without observations of
target domain data in training. We further explore the style-sensitive channels
for effective style synthesis. Normalization Perturbation only relies on a
single source domain and is surprisingly effective and extremely easy to
implement. Extensive experiments verify the effectiveness of our method for
generalizing models under real-world domain shifts. - “Visually Plausible Human-Object Interaction Capture from Wearable Sensors,” 2022. [Online]. Available: https://arxiv.org/abs/2205.02830.more
Abstract
In everyday lives, humans naturally modify the surrounding environment
through interactions, e.g., moving a chair to sit on it. To reproduce such
interactions in virtual spaces (e.g., metaverse), we need to be able to capture
and model them, including changes in the scene geometry, ideally from
ego-centric input alone (head camera and body-worn inertial sensors). This is
an extremely hard problem, especially since the object/scene might not be
visible from the head camera (e.g., a human not looking at a chair while
sitting down, or not looking at the door handle while opening a door). In this
paper, we present HOPS, the first method to capture interactions such as
dragging objects and opening doors from ego-centric data alone. Central to our
method is reasoning about human-object interactions, allowing to track objects
even when they are not visible from the head camera. HOPS localizes and
registers both the human and the dynamic object in a pre-scanned static scene.
HOPS is an important first step towards advanced AR/VR applications based on
immersive virtual universes, and can provide human-centric training data to
teach machines to interact with their surroundings. The supplementary video,
data, and code will be available on our project page at
virtualhumans.mpi-inf.mpg.de/hops/ - “Lifted Edges as Connectivity Priors for Multicut and Disjoint Paths,” Universität des Saarlandes, Saarbrücken, 2022.
- “Deep Gradient Learning for Efficient Camouflaged Object Detection,” 2022. [Online]. Available: https://arxiv.org/pdf/2205.12853.pdf.more
Abstract
This paper introduces DGNet, a novel deep framework that exploits object
gradient supervision for camouflaged object detection (COD). It decouples the
task into two connected branches, i.e., a context and a texture encoder. The
essential connection is the gradient-induced transition, representing a soft
grouping between context and texture features. Benefiting from the simple but
efficient framework, DGNet outperforms existing state-of-the-art COD models by
a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80
fps) and achieves comparable results to the cutting-edge model
JCSOD-CVPR$_{21}$ with only 6.82% parameters. Application results also show
that the proposed DGNet performs well in polyp segmentation, defect detection,
and transparent object segmentation tasks. Codes will be made available at
github.com/GewelsJI/DGNet. - “MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge -- Motion Prediction,” 2022. [Online]. Available: https://arxiv.org/abs/2209.10033.more
Abstract
In this report, we present the 1st place solution for motion prediction track
in 2022 Waymo Open Dataset Challenges. We propose a novel Motion Transformer
framework for multimodal motion prediction, which introduces a small set of
novel motion query pairs for generating better multimodal future trajectories
by jointly performing the intention localization and iterative motion
refinement. A simple model ensemble strategy with non-maximum-suppression is
adopted to further boost the final performance. Our approach achieves the 1st
place on the motion prediction leaderboard of 2022 Waymo Open Dataset
Challenges, outperforming other methods with remarkable margins. Code will be
available at github.com/sshaoshuai/MTR. - “Understanding and Improving Robustness and Uncertainty Estimation in Deep Learning,” Universität des Saarlandes, Saarbrücken, 2022.more
Abstract
Deep learning is becoming increasingly relevant for many high-stakes applications such as autonomous driving or medical diagnosis where wrong decisions can have massive impact on human lives. Unfortunately, deep neural networks are typically assessed solely based on generalization, e.g., accuracy on a fixed test set. However, this is clearly insufficient for safe deployment as potential malicious actors and distribution shifts or the effects of quantization and unreliable hardware are disregarded. Thus, recent work additionally evaluates performance on potentially manipulated or corrupted inputs as well as after quantization and deployment on specialized hardware. In such settings, it is also important to obtain reasonable estimates of the model's confidence alongside its predictions. This thesis studies robustness and uncertainty estimation in deep learning along three main directions: First, we consider so-called adversarial examples, slightly perturbed inputs causing severe drops in accuracy. Second, we study weight perturbations, focusing particularly on bit errors in quantized weights. This is relevant for deploying models on special-purpose hardware for efficient inference, so-called accelerators. Finally, we address uncertainty estimation to improve robustness and provide meaningful statistical performance guarantees for safe deployment. In detail, we study the existence of adversarial examples with respect to the underlying data manifold. In this context, we also investigate adversarial training which improves robustness by augmenting training with adversarial examples at the cost of reduced accuracy. We show that regular adversarial examples leave the data manifold in an almost orthogonal direction. While we find no inherent trade-off between robustness and accuracy, this contributes to a higher sample complexity as well as severe overfitting of adversarial training. Using a novel measure of flatness in the robust loss landscape with respect to weight changes, we also show that robust overfitting is caused by converging to particularly sharp minima. In fact, we find a clear correlation between flatness and good robust generalization. Further, we study random and adversarial bit errors in quantized weights. In accelerators, random bit errors occur in the memory when reducing voltage with the goal of improving energy-efficiency. Here, we consider a robust quantization scheme, use weight clipping as regularization and perform random bit error training to improve bit error robustness, allowing considerable energy savings without requiring hardware changes. In contrast, adversarial bit errors are maliciously introduced through hardware- or software-based attacks on the memory, with severe consequences on performance. We propose a novel adversarial bit error attack to study this threat and use adversarial bit error training to improve robustness and thereby also the accelerator's security. Finally, we view robustness in the context of uncertainty estimation. By encouraging low-confidence predictions on adversarial examples, our confidence-calibrated adversarial training successfully rejects adversarial, corrupted as well as out-of-distribution examples at test time. Thereby, we are also able to improve the robustness-accuracy trade-off compared to regular adversarial training. However, even robust models do not provide any guarantee for safe deployment. To address this problem, conformal prediction allows the model to predict confidence sets with user-specified guarantee of including the true label. Unfortunately, as conformal prediction is usually applied after training, the model is trained without taking this calibration step into account. To address this limitation, we propose conformal training which allows training conformal predictors end-to-end with the underlying model. This not only improves the obtained uncertainty estimates but also enables optimizing application-specific objectives without losing the provided guarantee. Besides our work on robustness or uncertainty, we also address the problem of 3D shape completion of partially observed point clouds. Specifically, we consider an autonomous driving or robotics setting where vehicles are commonly equipped with LiDAR or depth sensors and obtaining a complete 3D representation of the environment is crucial. However, ground truth shapes that are essential for applying deep learning techniques are extremely difficult to obtain. Thus, we propose a weakly-supervised approach that can be trained on the incomplete point clouds while offering efficient inference. In summary, this thesis contributes to our understanding of robustness against both input and weight perturbations. To this end, we also develop methods to improve robustness alongside uncertainty estimation for safe deployment of deep learning methods in high-stakes applications. In the particular context of autonomous driving, we also address 3D shape completion of sparse point clouds.
- “Structured Prediction Problem Archive,” 2022. [Online]. Available: https://arxiv.org/abs/2202.03574.more
Abstract
Structured prediction problems are one of the fundamental tools in machine
learning. In order to facilitate algorithm development for their numerical
solution, we collect in one place a large number of datasets in easy to read
formats for a diverse set of problem classes. We provide archival links to
datasets, description of the considered problems and problem formats, and a
short summary of problem characteristics including size, number of instances
etc. For reference we also give a non-exhaustive selection of algorithms
proposed in the literature for their solution. We hope that this central
repository will make benchmarking and comparison to established works easier.
We welcome submission of interesting new datasets and algorithms for inclusion
in our archive. - “On Fragile Features and Batch Normalization in Adversarial Training,” 2022. [Online]. Available: https://arxiv.org/abs/2204.12393.more
Abstract
Modern deep learning architecture utilize batch normalization (BN) to
stabilize training and improve accuracy. It has been shown that the BN layers
alone are surprisingly expressive. In the context of robustness against
adversarial examples, however, BN is argued to increase vulnerability. That is,
BN helps to learn fragile features. Nevertheless, BN is still used in
adversarial training, which is the de-facto standard to learn robust features.
In order to shed light on the role of BN in adversarial training, we
investigate to what extent the expressiveness of BN can be used to robustify
fragile features in comparison to random features. On CIFAR10, we find that
adversarially fine-tuning just the BN layers can result in non-trivial
adversarial robustness. Adversarially training only the BN layers from scratch,
in contrast, is not able to convey meaningful adversarial robustness. Our
results indicate that fragile features can be used to learn models with
moderate adversarial robustness, while random features cannot - “Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in Driving Scenes,” 2022. [Online]. Available: https://arxiv.org/abs/2208.08621.more
Abstract
Current efficient LiDAR-based detection frameworks are lacking in exploiting
object relations, which naturally present in both spatial and temporal manners.
To this end, we introduce a simple, efficient, and effective two-stage
detector, termed as Ret3D. At the core of Ret3D is the utilization of novel
intra-frame and inter-frame relation modules to capture the spatial and
temporal relations accordingly. More Specifically, intra-frame relation module
(IntraRM) encapsulates the intra-frame objects into a sparse graph and thus
allows us to refine the object features through efficient message passing. On
the other hand, inter-frame relation module (InterRM) densely connects each
object in its corresponding tracked sequences dynamically, and leverages such
temporal information to further enhance its representations efficiently through
a lightweight transformer network. We instantiate our novel designs of IntraRM
and InterRM with general center-based or anchor-based detectors and evaluate
them on Waymo Open Dataset (WOD). With negligible extra overhead, Ret3D
achieves the state-of-the-art performance, being 5.5% and 3.2% higher than the
recent competitor in terms of the LEVEL 1 and LEVEL 2 mAPH metrics on vehicle
detection, respectively. - “TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement,” 2022. [Online]. Available: https://arxiv.org/abs/2205.07982.more
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction
sequences using a data prior. Existing hand trackers, especially those that
rely on very few cameras, often produce visually unrealistic results with
hand-object intersection or missing contacts. Although correcting such errors
requires reasoning about temporal aspects of interaction, most previous work
focus on static grasps and contacts. The core of our method are TOCH fields, a
novel spatio-temporal representation for modeling correspondences between hands
and objects during interaction. The key component is a point-wise
object-centric representation which encodes the hand position relative to the
object. Leveraging this novel representation, we learn a latent manifold of
plausible TOCH fields with a temporal denoising auto-encoder. Experiments
demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object
interaction models, which are limited to static grasps and contacts. More
importantly, our method produces smooth interactions even before and after
contact. Using a single trained TOCH model, we quantitatively and qualitatively
demonstrate its usefulness for 1) correcting erroneous reconstruction results
from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,
and 3) grasp transfer across objects. We will release our code and trained
model on our project page at virtualhumans.mpi-inf.mpg.de/toch/ - “Hypergraph Transformer for Skeleton-based Action Recognition,” 2022. [Online]. Available: https://arxiv.org/abs/2211.09590.more
Abstract
Skeleton-based action recognition aims to predict human actions given human
joint coordinates with skeletal interconnections. To model such off-grid data
points and their co-occurrences, Transformer-based formulations would be a
natural choice. However, Transformers still lag behind state-of-the-art methods
using graph convolutional networks (GCNs). Transformers assume that the input
is permutation-invariant and homogeneous (partially alleviated by positional
encoding), which ignores an important characteristic of skeleton data, i.e.,
bone connectivity. Furthermore, each type of body joint has a clear physical
meaning in human motion, i.e., motion retains an intrinsic relationship
regardless of the joint coordinates, which is not explored in Transformers. In
fact, certain re-occurring groups of body joints are often involved in specific
actions, such as the subconscious hand movement for keeping balance. Vanilla
attention is incapable of describing such underlying relations that are
persistent and beyond pair-wise. In this work, we aim to exploit these unique
aspects of skeleton data to close the performance gap between Transformers and
GCNs. Specifically, we propose a new self-attention (SA) extension, named
Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order
relations into the model. The K-hop relative positional embeddings are also
employed to take bone connectivity into account. We name the resulting model
Hyperformer, and it achieves comparable or better performance w.r.t. accuracy
and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D
120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the
significantly improved performance reached by our Hyperformer demonstrates the
underestimated potential of Transformer models in this field.
2021
- “Real-time Deep Dynamic Characters,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2021), vol. 40, no. 4, 2021.
- “Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach,” in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021.
- “Fine-Grained Zero-Shot Learning with DNA as Side Information,” in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021.
- “RMM: Reinforced Memory Management for Class-Incremental Learning,” in Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021.
- “Shape your Space: A Gaussian Mixture Regularization Approach to Deterministic Autoencoders,” in Advances in Neural Information Processing Systems 34 pre-proceedings (NeurIPS 2021), Virtual Event, 2021.
- “Monocular 3D Multi-Person Pose Estimation via Predicting Factorized Correction Factors,” Computer Vision and Image Understanding, vol. 213, 2021.
- “Learning to Teach and Learn for Semi-supervised Few-shot Image Classification,” Computer Vision and Image Understanding, vol. 212, 2021.
- “mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Keep CALM and Improve Visual Feature Attribution,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Relating Adversarially Robust Generalization to Flat Minima,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Task Switching Network for Multi-task Learning,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Dual Contrastive Loss and Attention for GANs,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “End-to-End Urban Driving by Imitating a Reinforcement Learning Coach,” in ICCV 2021, IEEE/CVF International Conference on Computer Vision, Virtual Event, 2021.
- “Learning Decision Trees Recurrently Through Communication,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
- “Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
- “Convolutional Dynamic Alignment Networks for Interpretable Classifications,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
- “Distilling Audio-Visual Knowledge by Compositional Contrastive Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
- “Stereo Radiance Fields (SRF): Learning View Synthesis from Sparse Views of Novel Scenes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
- “Learning Spatially-Variant MAP Models for Non-blind Image Deblurring,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
- “Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, US (Virtual), 2021.
- “Adaptive Aggregation Networks for Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA (Virtual), 2021.
- “Open World Compositional Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
- “Learning Graph Embeddings for Compositional Zero-shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, US (Virtual), 2021.
- “SMPLicit: Topology-aware Generative Model for Clothed People,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
- “D-NeRF: Neural Radiance Fields for Dynamic Scenes,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, US (Virtual), 2021.
- “Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Conference, 2021.
- “Deep Outlier Handling for Image Deblurring,” IEEE Transactions on Image Processing, vol. 30, 2021.
- “Generating Face Images With Attributes for Free,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, 2021.
- “Future Moment Assessment for Action Query,” in IEEE Winter Conference on Applications of Computer Vision (WACV 2021), Virtual Event, 2021.
- “Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences,” in IEEE Winter Conference on Applications of Computer Vision (WACV 2021), Virtual, 2021.
- “EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data,” Information Sciences, vol. 567, 2021.
- “You Only Need Adversarial Supervision for Semantic Image Synthesis,” in International Conference on Learning Representations (ICLR 2021), Vienna, Austria (Virtual), 2021.
- “Norm-Aware Embedding for Efficient Person Search and Tracking,” International Journal of Computer Vision, vol. 129, 2021.
- “Guest Editorial: Special Issue on ‘Computer Vision for All Seasons: Adverse Weather and Lighting Conditions,’” International Journal of Computer Vision, vol. 129, 2021.
- “DLOW: Domain Flow and Applications,” International Journal of Computer Vision, vol. 129, 2021.
- “Semantic Bottlenecks: Quantifying and Improving Inspectability of Deep Representations,” International Journal of Computer Vision, vol. 129, 2021.
- “Guided Attention in CNNs for Occluded Pedestrian Detection and Re-identification,” International Journal of Computer Vision, vol. 129, 2021.
- “SampleFix: Learning to Correct Programs by Sampling Diverse Fixes,” in Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), Virtual Event, 2021.
- “DARTS for Inverse Problems: a Study on Stability,” in NeurIPS 2021 Workshop on Deep Learning and Inverse Problems (NeurIPS 2021 Deep Inverse Workshop), Virtual, 2021.
- “Internalized Biases in Fréchet Inception Distance,” in NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2021 Workshop DistShift), Virtual, 2021.
- “(SP)2Net for Generalized Zero-Label Semantic Segmentation,” in Pattern Recognition (GCPR 2021), Bonn, Germany, 2022.
- “Revisiting Consistency Regularization for Semi-supervised Learning,” in Pattern Recognition (GCPR 2021), Bonn, Germany, 2022.
- “Efficient Message Passing for 0–1 ILPs with Binary Decision Diagrams,” in Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 2021.
- “Bit Error Robustness for Energy-Efficient DNN Accelerators,” in Proceedings of the 4th MLSys Conference, Virtual Conference, 2021.more
Abstract
Deep neural network (DNN) accelerators received considerable attention in
past years due to saved energy compared to mainstream hardware. Low-voltage
operation of DNN accelerators allows to further reduce energy consumption
significantly, however, causes bit-level failures in the memory storing the
quantized DNN weights. In this paper, we show that a combination of robust
fixed-point quantization, weight clipping, and random bit error training
(RandBET) improves robustness against random bit errors in (quantized) DNN
weights significantly. This leads to high energy savings from both low-voltage
operation as well as low-precision quantization. Our approach generalizes
across operating voltages and accelerators, as demonstrated on bit errors from
profiled SRAM arrays. We also discuss why weight clipping alone is already a
quite effective way to achieve robustness against bit errors. Moreover, we
specifically discuss the involved trade-offs regarding accuracy, robustness and
precision: Without losing more than 1% in accuracy compared to a normally
trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher
energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even
for 4-bit DNNs. - “Compositional Mixture Representations for Vision and Text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), New Orleans, LA, USA, 2022.
- “Probabilistic Compositional Embeddings for Multimodal Image Retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), New Orleans, LA, USA, 2022.
- “A Closer Look at Self-training for Zero-Label Semantic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), Virtual Workshop, 2021.
- “InfoScrub: Towards Attribute Privacy by Targeted Obfuscation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), Virtual Workshop, 2021.
- “Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2021), Montreal, Canada, 2021.
- “Spectral Distribution Aware Image Generation,” in Thirty-Fifth AAAI Conference on Artificial Intelligence Technical Tracks 2, Virtual Conference, 2021.
- “FastDOG: Fast Discrete Optimization on GPU,” 2021. [Online]. Available: https://arxiv.org/abs/2111.10270.more
Abstract
We present a massively parallel Lagrange decomposition method for solving 0-1
integer linear programs occurring in structured prediction. We propose a new
iterative update scheme for solving the Lagrangean dual and a perturbation
technique for decoding primal solutions. For representing subproblems we follow
Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and
dual algorithms require little synchronization between subproblems and
optimization over BDDs needs only elementary operations without complicated
control flow. This allows us to exploit the parallelism offered by GPUs for all
components of our method. We present experimental results on combinatorial
problems from MAP inference for Markov Random Fields, quadratic assignment and
cell tracking for developmental biology. Our highly parallel GPU implementation
improves upon the running times of the algorithms from Lange et al. (2021) by
up to an order of magnitude. In particular, we come close to or outperform some
state-of-the-art specialized heuristics while being problem agnostic. - “Long-term future prediction under uncertainty and multi-modality,” Universität des Saarlandes, Saarbrücken, 2021.
- “Where and When: Space-Time Attention for Audio-Visual Explanations,” 2021. [Online]. Available: https://arxiv.org/abs/2105.01517.more
Abstract
Explaining the decision of a multi-modal decision-maker requires to determine
the evidence from both modalities. Recent advances in XAI provide explanations
for models trained on still images. However, when it comes to modeling multiple
sensory modalities in a dynamic world, it remains underexplored how to
demystify the mysterious dynamics of a complex multi-modal model. In this work,
we take a crucial step forward and explore learnable explanations for
audio-visual recognition. Specifically, we propose a novel space-time attention
network that uncovers the synergistic dynamics of audio and visual data over
both space and time. Our model is capable of predicting the audio-visual video
events, while justifying its decision by localizing where the relevant visual
cues appear, and when the predicted sounds occur in videos. We benchmark our
model on three audio-visual video event datasets, comparing extensively to
multiple recent multi-modal representation learners and intrinsic explanation
models. Experimental results demonstrate the clear superior performance of our
model over the existing methods on audio-visual video event recognition.
Moreover, we conduct an in-depth study to analyze the explainability of our
model based on robustness analysis via perturbation tests and pointing games
using human annotations. - “TADA: Taxonomy Adaptive Domain Adaptation,” 2021. [Online]. Available: https://arxiv.org/abs/2109.04813.more
Abstract
Traditional domain adaptation addresses the task of adapting a model to a
novel target domain under limited or no additional supervision. While tackling
the input domain gap, the standard domain adaptation settings assume no domain
change in the output space. In semantic prediction tasks, different datasets
are often labeled according to different semantic taxonomies. In many
real-world settings, the target domain task requires a different taxonomy than
the one imposed by the source domain. We therefore introduce the more general
taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent
taxonomies between the two domains. We further propose an approach that jointly
addresses the image-level and label-level domain adaptation. On the
label-level, we employ a bilateral mixed sampling strategy to augment the
target domain, and a relabelling method to unify and align the label spaces. We
address the image-level domain gap by proposing an uncertainty-rectified
contrastive learning method, leading to more domain-invariant and class
discriminative features. We extensively evaluate the effectiveness of our
framework under different TADA settings: open taxonomy, coarse-to-fine
taxonomy, and partially-overlapping taxonomy. Our framework outperforms
previous state-of-the-art by a large margin, while capable of adapting to
target taxonomies. - “Learning Graph Embeddings for Open World Compositional Zero-Shot Learning,” 2021. [Online]. Available: https://arxiv.org/abs/2105.01017.more
Abstract
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions
of state and object visual primitives seen during training. A problem with
standard CZSL is the assumption of knowing which unseen compositions will be
available at test time. In this work, we overcome this assumption operating on
the open world setting, where no limit is imposed on the compositional space at
test time, and the search space contains a large number of unseen compositions.
To address this problem, we propose a new approach, Compositional Cosine Graph
Embeddings (Co-CGE), based on two principles. First, Co-CGE models the
dependency between states, objects and their compositions through a graph
convolutional neural network. The graph propagates information from seen to
unseen concepts, improving their representations. Second, since not all unseen
compositions are equally feasible, and less feasible ones may damage the
learned representations, Co-CGE estimates a feasibility score for each unseen
composition, using the scores as margins in a cosine similarity-based loss and
as weights in the adjacency matrix of the graphs. Experiments show that our
approach achieves state-of-the-art performances in standard CZSL while
outperforming previous methods in the open world scenario. - “From Pixels to People,” Universität des Saarlandes, Saarbrücken, 2021.more
Abstract
Abstract
Humans are at the centre of a significant amount of research in computer vision.
Endowing machines with the ability to perceive people from visual data is an immense
scientific challenge with a high degree of direct practical relevance. Success in automatic
perception can be measured at different levels of abstraction, and this will depend on
which intelligent behaviour we are trying to replicate: the ability to localise persons in
an image or in the environment, understanding how persons are moving at the skeleton
and at the surface level, interpreting their interactions with the environment including
with other people, and perhaps even anticipating future actions. In this thesis we tackle
different sub-problems of the broad research area referred to as "looking at people",
aiming to perceive humans in images at different levels of granularity.
We start with bounding box-level pedestrian detection: We present a retrospective
analysis of methods published in the decade preceding our work, identifying various
strands of research that have advanced the state of the art. With quantitative exper-
iments, we demonstrate the critical role of developing better feature representations
and having the right training distribution. We then contribute two methods based
on the insights derived from our analysis: one that combines the strongest aspects of
past detectors and another that focuses purely on learning representations. The latter
method outperforms more complicated approaches, especially those based on hand-
crafted features. We conclude our work on pedestrian detection with a forward-looking
analysis that maps out potential avenues for future research.
We then turn to pixel-level methods: Perceiving humans requires us to both separate
them precisely from the background and identify their surroundings. To this end, we
introduce Cityscapes, a large-scale dataset for street scene understanding. This has since
established itself as a go-to benchmark for segmentation and detection. We additionally
develop methods that relax the requirement for expensive pixel-level annotations, focusing
on the task of boundary detection, i.e. identifying the outlines of relevant objects and
surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and
labelling to fine-grained spatial understanding. We contribute a method for recovering
3D human shape and pose, which marries the advantages of learning-based and model-
based approaches.
We conclude the thesis with a detailed discussion of benchmarking practices in
computer vision. Among other things, we argue that the design of future datasets
should be driven by the general goal of combinatorial robustness besides task-specific
considerations. - “Adversarial Content Manipulation for Analyzing and Improving Model Robustness,” Universität des Saarlandes, Saarbrücken, 2021.
- “Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes,” 2021. [Online]. Available: https://arxiv.org/abs/2102.01161.more
Abstract
Most learning methods for 3D data (point clouds, meshes) suffer significant
performance drops when the data is not carefully aligned to a canonical
orientation. Aligning real world 3D data collected from different sources is
non-trivial and requires manual intervention. In this paper, we propose the
Adjoint Rigid Transform (ART) Network, a neural module which can be integrated
with a variety of 3D networks to significantly boost their performance. ART
learns to rotate input shapes to a learned canonical orientation, which is
crucial for a lot of tasks such as shape reconstruction, interpolation,
non-rigid registration, and latent disentanglement. ART achieves this with
self-supervision and a rotation equivariance constraint on predicted rotations.
The remarkable result is that with only self-supervision, ART facilitates
learning a unique canonical orientation for both rigid and nonrigid shapes,
which leads to a notable boost in performance of aforementioned tasks. We will
release our code and pre-trained models for further research.
2020
- “Hierarchical Online Instance Matching for Person Search,” in AAAI Technical Track: Vision, New York, NY, USA, 2020.
- “Manipulating Attributes of Natural Scenes via Hallucination,” ACM Transactions on Graphics, vol. 39, no. 1, 2020.
- “XNect: Real-time Multi-person 3D Motion Capture with a Single RGB Camera,” ACM Transactions on Graphics, vol. 39, no. 4, 2020.
- “XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2020), vol. 39, no. 4, 2020.
- “LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- “GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- “Neural Unsigned Distance Fields for Implicit Function Learning,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- “Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- “Attribute Prototype Network for Zero-Shot Learning,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual Event, 2020.
- “GAN-Leaks: A Taxonomy of Membership Inference Attacks against GANs,” in CCS ’20, ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, 2020.
- “Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Kinematic 3D Object Detection in Monocular Video,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “NASA: Neural Articulated Shape Approximation,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “An Ensemble of Epoch-wise Empirical Bayes for Few-shot Learning,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Towards Recognizing Unseen Categories in Unseen Domains,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Towards Automated Testing and Robustification by Semantic Adversarial Data Generation,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Inclusive GAN: Improving Data and Minority Coverage in Generative Models,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Unsupervised Shape and Pose Disentanglement for 3D Meshes,” in Computer Vision -- ECCV 2020, Glasgow, UK, 2020.
- “Implicit Feature Networks for Texture Completion from Partial 3D Data,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2020.
- “Synthetic Convolutional Features for Improved Semantic Segmentation,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
- “Adversarial Training Against Location-Optimized Adversarial Patches,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
- “SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.
- “Body Shape Privacy in Images: Understanding Privacy and Preventing Automatic Shape Extraction,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2021.more
Abstract
Modern approaches to pose and body shape estimation have recently achieved
strong performance even under challenging real-world conditions. Even from a
single image of a clothed person, a realistic looking body shape can be
inferred that captures a users' weight group and body shape type well. This
opens up a whole spectrum of applications -- in particular in fashion -- where
virtual try-on and recommendation systems can make use of these new and
automatized cues. However, a realistic depiction of the undressed body is
regarded highly private and therefore might not be consented by most people.
Hence, we ask if the automatic extraction of such information can be
effectively evaded. While adversarial perturbations have been shown to be
effective for manipulating the output of machine learning models -- in
particular, end-to-end deep learning approaches -- state of the art shape
estimation methods are composed of multiple stages. We perform the first
investigation of different strategies that can be used to effectively
manipulate the automatic shape estimation while preserving the overall
appearance of the original image. - “Generalized Many-Way Few-Shot Video Classification,” in Computer Vision -- ECCV Workshops 2020, Glasgow, UK, 2020.
- “Sparse Recovery with Integrality Constraints,” Discrete Applied Mathematics, vol. 283, 2020.
- “Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Normalizing Flows With Multi-Scale Autoregressive Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Norm-Aware Embedding for Efficient Person Search,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Evaluating Weakly Supervised Object Localization Methods Right,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “DeepCap: Monocular Human Performance Capture Using Weak Supervision,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Learning Interactions and Relationships between Movie Characters,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Mnemonics Training: Multi-Class Incremental Learning Without Forgetting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Learning to Dress 3D People in Generative Clothing,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Learning to Transfer Texture from Clothing Images to 3D Humans,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “A U-Net Based Discriminator for Generative Adversarial Networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.
- “Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 1, 2020.
- “Person Recognition in Personal Photo Collections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 1, 2020.
- “SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- “DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, 2020.
- “Learning Robust Representations via Multi-View Information Bottleneck,” in International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethopia, 2020.
- “Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks,” in International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethopia, 2020.
- “Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based Image Retrieval,” International Journal of Computer Vision, vol. 128, 2020.
- “Deep Gaze Pooling: Inferring and Visually Decoding Search Intents from Human Gaze Fixations,” Neurocomputing, vol. 387, 2020.
- “Haar Wavelet based Block Autoregressive Flows for Trajectories,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- “Analyzing the Dependency of ConvNets on Spatial Information,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- “Long-Term Anticipation of Activities with Cycle Consistency,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- “On the Lifted Multicut Polytope for Trees,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- “Semantic Bottlenecks: Quantifying & Improving Inspectability of Deep Representations,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- “Long-Tailed Recognition Using Class-Balanced Experts,” in Pattern Recognition (GCPR 2020), Tübingen, Germany, 2021.
- “Anticipating Averted Gaze in Dyadic Interactions,” in Proceedings ETRA 2020 Full Papers, Stuttgart, Germany, 2020.
- “Diverse and Relevant Visual Storytelling with Scene Graph Embeddings,” in Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL 2020), Online, 2020.
- “Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning,” in Proceedings of the 29th USENIX Security Symposium, Virtual Event, 2020.
- “Lifted Disjoint Paths with Application in Multiple Object Tracking,” in Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Conference, 2020.
- “Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks,” in Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Conference, 2020.
- “A Primal-Dual Solver for Large-Scale Tracking-by-Assignment,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020), Virtual Conference, 2020.
- “CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations,” in xxAI -- Beyond Explainable AI (xxAI @ICML 2020), Vienna, Austria (Virtually), 2022.
- “CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations,” in xxAI -- Beyond Explainable AI (XXAI @ICML 2020), Vienna, Austria (Virtually), 2022.
- “PoseTrackReID: Dataset Description,” 2020. [Online]. Available: https://arxiv.org/abs/2011.06243.more
Abstract
Current datasets for video-based person re-identification (re-ID) do not
include structural knowledge in form of human pose annotations for the persons
of interest. Nonetheless, pose information is very helpful to disentangle
useful feature information from background or occlusion noise. Especially
real-world scenarios, such as surveillance, contain a lot of occlusions in
human crowds or by obstacles. On the other hand, video-based person re-ID can
benefit other tasks such as multi-person pose tracking in terms of robust
feature matching. For that reason, we present PoseTrackReID, a large-scale
dataset for multi-person pose tracking and video-based person re-ID. With
PoseTrackReID, we want to bridge the gap between person re-ID and multi-person
pose tracking. Additionally, this dataset provides a good benchmark for current
state-of-the-art methods on multi-frame person re-ID. - “Analyzing the Dependency of ConvNets on Spatial Information,” 2020. [Online]. Available: https://arxiv.org/abs/2002.01827.more
Abstract
Intuitively, image classification should profit from using spatial
information. Recent work, however, suggests that this might be overrated in
standard CNNs. In this paper, we are pushing the envelope and aim to further
investigate the reliance on spatial information. We propose spatial shuffling
and GAP+FC to destroy spatial information during both training and testing
phases. Interestingly, we observe that spatial information can be deleted from
later layers with small performance drops, which indicates spatial information
at later layers is not necessary for good performance. For example, test
accuracy of VGG-16 only drops by 0.03% and 2.66% with spatial information
completely removed from the last 30% and 53% layers on CIFAR100, respectively.
Evaluation on several object recognition datasets (CIFAR100, Small-ImageNet,
ImageNet) with a wide range of CNN architectures (VGG16, ResNet50, ResNet152)
shows an overall consistent pattern. - “Improved Methods and Analysis for Semantic Image Segmentation,” Universität des Saarlandes, Saarbrücken, 2020.more
Abstract
Modern deep learning has enabled amazing developments of computer vision in recent years (Hinton and Salakhutdinov, 2006; Krizhevsky et al., 2012). As a fundamental task, semantic segmentation aims to predict class labels for each pixel of images, which empowers machines perception of the visual world. In spite of recent successes of fully convolutional networks (Long etal., 2015), several challenges remain to be addressed. In this thesis, we focus on this topic, under different kinds of input formats and various types of scenes. Specifically, our study contains two aspects: (1) Data-driven neural modules for improved performance. (2) Leverage of datasets w.r.t.training systems with higher performances and better data privacy guarantees. In the first part of this thesis, we improve semantic segmentation by designing new modules which are compatible with existing architectures. First, we develop a spatio-temporal data-driven pooling, which brings additional information of data (i.e. superpixels) into neural networks, benefiting the training of neural networks as well as the inference on novel data. We investigate our approach in RGB-D videos for segmenting indoor scenes, where depth provides complementary cues to colors and our model performs particularly well. Second, we design learnable dilated convolutions, which are the extension of standard dilated convolutions, whose dilation factors (Yu and Koltun, 2016) need to be carefully determined by hand to obtain decent performance. We present a method to learn dilation factors together with filter weights of convolutions to avoid a complicated search of dilation factors. We explore extensive studies on challenging street scenes, across various baselines with different complexity as well as several datasets at varying image resolutions. In the second part, we investigate how to utilize expensive training data. First, we start from the generative modelling and study the network architectures and the learning pipeline for generating multiple examples. We aim to improve the diversity of generated examples but also to preserve the comparable quality of the examples. Second, we develop a generative model for synthesizing features of a network. With a mixture of real images and synthetic features, we are able to train a segmentation model with better generalization capability. Our approach is evaluated on different scene parsing tasks to demonstrate the effectiveness of the proposed method. Finally, we study membership inference on the semantic segmentation task. We propose the first membership inference attack system against black-box semantic segmentation models, that tries to infer if a data pair is used as training data or not. From our observations, information on training data is indeed leaking. To mitigate the leakage, we leverage our synthetic features to perform prediction obfuscations, reducing the posterior distribution gaps between a training and a testing set. Consequently, our study provides not only an approach for detecting illegal use of data, but also the foundations for a safer use of semantic segmentation models.
- “Towards Accurate Multi-Person Pose Estimation in the Wild,” Universität des Saarlandes, Saarbrücken, 2020.
- “Multicut Optimization Guarantees & Geometry of Lifted Multicuts,” Universität des Saarlandes, Saarbrücken, 2020.
- “Sensing, Interpreting, and Anticipating Human Social Behaviour in the Real World,” Universität des Saarlandes, Saarbrücken, 2020.
- “Understanding and Controlling Leakage in Machine Learning,” Universität des Saarlandes, Saarbrücken, 2020.
- “Learning from Limited Labeled Data - Zero-Shot and Few-Shot Learning,” Universität des Saarlandes, Saarbrücken, 2020.
2019
- “LiveCap: Real-time Human Performance Capture from Monocular Video,” ACM Transactions on Graphics, vol. 38, no. 2, 2019.
- “Modeling Conceptual Understanding in Image Reference Games,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019.
- “Combining Generative and Discriminative Models for Hybrid Inference,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019.
- “Learning to Self-Train for Semi-Supervised Few-Shot Classification,” in Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, Canada, 2019.
- “Everyday Eye Tracking for Real-World Consumer Behavior Analysis,” in A Handbook of Process Tracing Methods for Decision Research, 2nd ed., New York, NY: Taylor & Francis, 2019.
- “Conditional Flow Variational Autoencoders for Structured Sequence Prediction,” in Bayesian Deep Learning NeurIPS 2019 Workshop, Vancouver, Canada, 2019.
- “Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications,” in CHI 2019, CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 2019.
- “XNect Demo (v2): Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” in CVPR 2019 Demonstrations, Long Beach, CA, USA, 2019.
- “Towards Reverse-Engineering Black-Box Neural Networks,” in Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Berlin: Springer, 2019.
- “InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation,” GetMobile, vol. 23, no. 2, 2019.
- “Emergent Leadership Detection Across Datasets,” in ICMI ’19, International Conference on Multimodal Interaction, Suzhou, China, 2019.more
Abstract
Automatic detection of emergent leaders in small groups from nonverbal
behaviour is a growing research topic in social signal processing but existing
methods were evaluated on single datasets -- an unrealistic assumption for
real-world applications in which systems are required to also work in settings
unseen at training time. It therefore remains unclear whether current methods
for emergent leadership detection generalise to similar but new settings and to
which extent. To overcome this limitation, we are the first to study a
cross-dataset evaluation setting for the emergent leadership detection task. We
provide evaluations for within- and cross-dataset prediction using two current
datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the
robustness of commonly used feature channels (visual focus of attention, body
pose, facial action units, speaking activity) and online prediction in the
cross-dataset setting. Our evaluations show that using pose and eye contact
based features, cross-dataset prediction is possible with an accuracy of 0.68,
as such providing another important piece of the puzzle towards emergent
leadership detection in the real world. - “Learning to Reconstruct People in Clothing from a Single RGB Camera,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “In the Wild Human Pose Estimation using Explicit 2D Features and Intermediate 3D Representations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Time-Conditioned Action Anticipation in One Shot,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Combinatorial Persistency Criteria for Multicut and Max-Cut,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Knockoff Nets: Stealing Functionality of Black-Box Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Disentangling Adversarial Robustness and Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Meta-Transfer Learning for Few-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “MAP Inference via Block-Coordinate Frank-Wolfe Algorithm,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.more
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
- “A Convex Relaxation for Multi-Graph Matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.more
Abstract
When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.
- “Semantic Projection Network for Zero- and Few-Label Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Texture Mixer: A Network for Controllable Synthesis and Interpolation of Texture,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “SimulCap : Single-View Human Performance Capture with Cloth Simulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 2019.
- “Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard,” in IEEE International Conference on Systems, Man, and Cybernetics (SMC 2019), Bari, Italy, 2019.
- “Zero-shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, 2019.more
Abstract
Due to the importance of zero-shot learning, i.e. classifying images where
there is a lack of labeled training data, the number of proposed approaches has
recently increased steadily. We argue that it is time to take a step back and
to analyze the status quo of the area. The purpose of this paper is three-fold.
First, given the fact that there is no agreed upon zero-shot learning
benchmark, we first define a new benchmark by unifying both the evaluation
protocols and data splits of publicly available datasets used for this task.
This is an important contribution as published results are often not comparable
and sometimes even flawed due to, e.g. pre-training on zero-shot test classes.
Moreover, we propose a new zero-shot learning dataset, the Animals with
Attributes 2 (AWA2) dataset which we make publicly available both in terms of
image features and the images themselves. Second, we compare and analyze a
significant number of the state-of-the-art methods in depth, both in the
classic zero-shot setting but also in the more realistic generalized zero-shot
setting. Finally, we discuss in detail the limitations of the current status of
the area which can be taken as a basis for advancing it. - “MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, 2019.
- “Fashion is Taking Shape: Understanding Clothing Preference Based on Body Shape From Online Sources,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV 2019), Waikoloa Village, HI, USA, 2019.
- “360-Degree Textures of People in Clothing from a Single Image,” in International Conference on 3D Vision, Québec City, Canada, 2019.
- “Bottleneck Potentials in Markov Random Fields,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
- “Tex2Shape: Detailed Full Human Body Geometry from a Single Image,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.more
Abstract
We present a simple yet effective method to infer detailed full human body
shape from only a single photograph. Our model can infer full-body shape
including face, hair, and clothing including wrinkles at interactive
frame-rates. Results feature details even on parts that are occluded in the
input image. Our main idea is to turn shape regression into an aligned
image-to-image translation problem. The input to our method is a partial
texture map of the visible region obtained from off-the-shelf methods. From a
partial texture, we estimate detailed normal and vector displacement maps,
which can be applied to a low-resolution smooth body model to add detail and
clothing. Despite being trained purely with synthetic data, our model
generalizes well to real-world photographs. Numerous results demonstrate the
versatility and robustness of our method. - “HiPPI: Higher-Order Projected Power Iterations for Scalable Multi-Matching,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
- “Multi-Garment Net: Learning to Dress 3D People from Images,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
- “AMASS: Archive of Motion Capture as Surface Shapes,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
- “Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
- “Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints,” in International Conference on Computer Vision (ICCV 2019), Seoul, Korea, 2019.
- “Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods,” in International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 2019.
- “Lucid Data Dreaming for Video Object Segmentation,” International Journal of Computer Vision, vol. 127, no. 9, 2019.
- “Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour,” in MM ’19, 27th ACM International Conference on Multimedia, Nice, France, 2019.more
Abstract
Internal thought refers to the process of directing attention away from a
primary visual task to internal cognitive processing. Internal thought is a
pervasive mental activity and closely related to primary task performance. As
such, automatic detection of internal thought has significant potential for
user modelling in intelligent interfaces, particularly for e-learning
applications. Despite the close link between the eyes and the human mind, only
a few studies have investigated vergence behaviour during internal thought and
none has studied moment-to-moment detection of internal thought from gaze.
While prior studies relied on long-term data analysis and required a large
number of gaze characteristics, we describe a novel method that is
computationally light-weight and that only requires eye vergence information
that is readily available from binocular eye trackers. We further propose a
novel paradigm to obtain ground truth internal thought annotations that
exploits human blur perception. We evaluate our method for three increasingly
challenging detection tasks: (1) during a controlled math-solving task, (2)
during natural viewing of lecture videos, and (3) during daily activities, such
as coding, browsing, and reading. Results from these evaluations demonstrate
the performance and robustness of vergence-based detection of internal thought
and, as such, open up new directions for research on interfaces that adapt to
shifts of mental attention. - “Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders,” in Multilingual Surface Realisation (MSR 2019), Hong Kong, China, 2019.
- “SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.more
Abstract
Recent methods to automatically calibrate stationary eye trackers were shown
to effectively reduce inherent calibration distortion. However, these methods
require additional information, such as mouse clicks or on-screen content. We
propose the first method that only requires users' eye movements to reduce
calibration distortion in the background while users naturally look at an
interface. Our method exploits that calibration distortion makes straight
saccade trajectories appear curved between the saccadic start and end points.
We show that this curving effect is systematic and the result of distorted gaze
projection plane. To mitigate calibration distortion, our method undistorts
this plane by straightening saccade trajectories using image warping. We show
that this approach improves over the common six-point calibration and is
promising for reducing distortion. As such, it provides a non-intrusive
solution to alleviating accuracy decrease of eye tracker during long-term use. - “Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
- “Privacy-Aware Eye Tracking Using Differential Privacy,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
- “PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features,” in Proceedings ETRA 2019, Denver, CO, USA, 2019.
- “Detecting Stress from Mouse-Gaze Attraction,” in Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC 2019), Limassol, Cyprus, 2019.
- “Gradient-Leaks: Understanding Deanonymization in Federated Learning,” in The 2nd International Workshop on Federated Learning for Data Privacy and Confidentiality (FL-NeurIPS 2019), Vancouver, Canada, 2019.
- “Bottleneck Potentials in Markov Random Fields,” 2019. [Online]. Available: http://arxiv.org/abs/1904.08080.more
Abstract
We consider general discrete Markov Random Fields(MRFs) with additional
bottleneck potentials which penalize the maximum (instead of the sum) over
local potential value taken by the MRF-assignment. Bottleneck potentials or
analogous constructions have been considered in (i) combinatorial optimization
(e.g. bottleneck shortest path problem, the minimum bottleneck spanning tree
problem, bottleneck function minimization in greedoids), (ii) inverse problems
with $L_{\infty}$-norm regularization, and (iii) valued constraint satisfaction
on the $(\min,\max)$-pre-semirings. Bottleneck potentials for general discrete
MRFs are a natural generalization of the above direction of modeling work to
Maximum-A-Posteriori (MAP) inference in MRFs. To this end, we propose MRFs
whose objective consists of two parts: terms that factorize according to (i)
$(\min,+)$, i.e. potentials as in plain MRFs, and (ii) $(\min,\max)$, i.e.
bottleneck potentials. To solve the ensuing inference problem, we propose
high-quality relaxations and efficient algorithms for solving them. We
empirically show efficacy of our approach on large scale seismic horizon
tracking problems. - “‘Best-of-Many-Samples’ Distribution Matching,” 2019. [Online]. Available: http://arxiv.org/abs/1909.12598.more
Abstract
Generative Adversarial Networks (GANs) can achieve state-of-the-art sample
quality in generative modelling tasks but suffer from the mode collapse
problem. Variational Autoencoders (VAE) on the other hand explicitly maximize a
reconstruction-based data log-likelihood forcing it to cover all modes, but
suffer from poorer sample quality. Recent works have proposed hybrid VAE-GAN
frameworks which integrate a GAN-based synthetic likelihood to the VAE
objective to address both the mode collapse and sample quality issues, with
limited success. This is because the VAE objective forces a trade-off between
the data log-likelihood and divergence to the latent prior. The synthetic
likelihood ratio term also shows instability during training. We propose a
novel objective with a "Best-of-Many-Samples" reconstruction cost and a stable
direct estimate of the synthetic likelihood. This enables our hybrid VAE-GAN
framework to achieve high data log-likelihood and low divergence to the latent
prior at the same time and shows significant improvement over both hybrid
VAE-GANS and plain GANs in mode coverage and quality. - “LCC: Learning to Customize and Combine Neural Networks for Few-Shot Learning,” 2019. [Online]. Available: http://arxiv.org/abs/1904.08479.more
Abstract
Meta-learning has been shown to be an effective strategy for few-shot
learning. The key idea is to leverage a large number of similar few-shot tasks
in order to meta-learn how to best initiate a (single) base-learner for novel
few-shot tasks. While meta-learning how to initialize a base-learner has shown
promising results, it is well known that hyperparameter settings such as the
learning rate and the weighting of the regularization term are important to
achieve best performance. We thus propose to also meta-learn these
hyperparameters and in fact learn a time- and layer-varying scheme for learning
a base-learner on novel tasks. Additionally, we propose to learn not only a
single base-learner but an ensemble of several base-learners to obtain more
robust results. While ensembles of learners have shown to improve performance
in various settings, this is challenging for few-shot learning tasks due to the
limited number of training samples. Therefore, our approach also aims to
meta-learn how to effectively combine several base-learners. We conduct
extensive experiments and report top performance for five-class few-shot
recognition tasks on two challenging benchmarks: miniImageNet and
Fewshot-CIFAR100 (FC100). - “Learning Manipulation under Physics Constraints with Visual Perception,” 2019. [Online]. Available: http://arxiv.org/abs/1904.09860.more
Abstract
Understanding physical phenomena is a key competence that enables humans and
animals to act and interact under uncertain perception in previously unseen
environments containing novel objects and their configurations. In this work,
we consider the problem of autonomous block stacking and explore solutions to
learning manipulation under physics constraints with visual perception inherent
to the task. Inspired by the intuitive physics in humans, we first present an
end-to-end learning-based approach to predict stability directly from
appearance, contrasting a more traditional model-based approach with explicit
3D representations and physical simulation. We study the model's behavior
together with an accompanied human subject test. It is then integrated into a
real-world robotic system to guide the placement of a single wood block into
the scene without collapsing existing tower structure. To further automate the
process of consecutive blocks stacking, we present an alternative approach
where the model learns the physics constraint through the interaction with the
environment, bypassing the dedicated physics learning as in the former part of
this work. In particular, we are interested in the type of tasks that require
the agent to reach a given goal state that may be different for every new
trial. Thereby we propose a deep reinforcement learning framework that learns
policies for stacking tasks which are parametrized by a target structure. - “Interpretability Beyond Classification Output: Semantic Bottleneck Networks,” 2019. [Online]. Available: http://arxiv.org/abs/1907.10882.more
Abstract
Today's deep learning systems deliver high performance based on end-to-end
training. While they deliver strong performance, these systems are hard to
interpret. To address this issue, we propose Semantic Bottleneck Networks
(SBN): deep networks with semantically interpretable intermediate layers that
all downstream results are based on. As a consequence, the analysis on what the
final prediction is based on is transparent to the engineer and failure cases
and modes can be analyzed and avoided by high-level reasoning. We present a
case study on street scene segmentation to demonstrate the feasibility and
power of SBN. In particular, we start from a well performing classic deep
network which we adapt to house a SB-Layer containing task related semantic
concepts (such as object-parts and materials). Importantly, we can recover
state of the art performance despite a drastic dimensionality reduction from
1000s (non-semantic feature) to 10s (semantic concept) channels. Additionally
we show how the activations of the SB-Layer can be used for both the
interpretation of failure cases of the network as well as for confidence
prediction of the resulting output. For the first time, e.g., we show
interpretable segmentation results for most predictions at over 99% accuracy. - “A Novel BiLevel Paradigm for Image-to-Image Translation,” 2019. [Online]. Available: http://arxiv.org/abs/1904.09028.more
Abstract
Image-to-image (I2I) translation is a pixel-level mapping that requires a
large number of paired training data and often suffers from the problems of
high diversity and strong category bias in image scenes. In order to tackle
these problems, we propose a novel BiLevel (BiL) learning paradigm that
alternates the learning of two models, respectively at an instance-specific
(IS) and a general-purpose (GP) level. In each scene, the IS model learns to
maintain the specific scene attributes. It is initialized by the GP model that
learns from all the scenes to obtain the generalizable translation knowledge.
This GP initialization gives the IS model an efficient starting point, thus
enabling its fast adaptation to the new scene with scarce training data. We
conduct extensive I2I translation experiments on human face and street view
datasets. Quantitative results validate that our approach can significantly
boost the performance of classical I2I translation models, such as PG2 and
Pix2Pix. Our visualization results show both higher image quality and more
appropriate instance-specific details, e.g., the translated image of a person
looks more like that person in terms of identity. - “XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” 2019. [Online]. Available: http://arxiv.org/abs/1907.00837.more
Abstract
We present a real-time approach for multi-person 3D motion capture at over 30
fps using a single RGB camera. It operates in generic scenes and is robust to
difficult occlusions both by other people and objects. Our method operates in
subsequent stages. The first stage is a convolutional neural network (CNN) that
estimates 2D and 3D pose features along with identity assignments for all
visible joints of all individuals. We contribute a new architecture for this
CNN, called SelecSLS Net, that uses novel selective long and short range skip
connections to improve the information flow allowing for a drastically faster
network without compromising accuracy. In the second stage, a fully-connected
neural network turns the possibly partial (on account of occlusion) 2D pose and
3D pose features for each subject into a complete 3D pose estimate per
individual. The third stage applies space-time skeletal model fitting to the
predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose,
and enforce temporal coherence. Our method returns the full skeletal pose in
joint angles for each subject. This is a further key distinction from previous
work that neither extracted global body positions nor joint angle results of a
coherent skeleton in real time for multi-person scenes. The proposed system
runs on consumer hardware at a previously unseen speed of more than 30 fps
given 512x320 images as input while achieving state-of-the-art accuracy, which
we will demonstrate on a range of challenging real-world scenes. - “Shape Evasion: Preventing Body Shape Inference of Multi-Stage Approaches,” 2019. [Online]. Available: http://arxiv.org/abs/1905.11503.more
Abstract
Modern approaches to pose and body shape estimation have recently achieved
strong performance even under challenging real-world conditions. Even from a
single image of a clothed person, a realistic looking body shape can be
inferred that captures a users' weight group and body shape type well. This
opens up a whole spectrum of applications -- in particular in fashion -- where
virtual try-on and recommendation systems can make use of these new and
automatized cues. However, a realistic depiction of the undressed body is
regarded highly private and therefore might not be consented by most people.
Hence, we ask if the automatic extraction of such information can be
effectively evaded. While adversarial perturbations have been shown to be
effective for manipulating the output of machine learning models -- in
particular, end-to-end deep learning approaches -- state of the art shape
estimation methods are composed of multiple stages. We perform the first
investigation of different strategies that can be used to effectively
manipulate the automatic shape estimation while preserving the overall
appearance of the original image. - “Intents and Preferences Prediction Based on Implicit Human Cues,” Universität des Saarlandes, Saarbrücken, 2019.more
Abstract
Visual search is an important task, and it is part of daily human life. Thus, it has been a long-standing goal in Computer Vision to develop methods aiming at analysing human search intent and preferences. As the target of the search only exists in mind of the person, search intent prediction remains challenging for machine perception. In this thesis, we focus on advancing techniques for search target and preference prediction from implicit human cues. First, we propose a search target inference algorithm from human fixation data recorded during visual search. In contrast to previous work that has focused on individual instances as a search target in a closed world, we propose the first approach to predict the search target in open-world settings by learning the compatibility between observed fixations and potential search targets. Second, we further broaden the scope of search target prediction to categorical classes, such as object categories and attributes. However, state of the art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism – incorporating both spatial and temporal aspects of human gaze behaviour. Third, we go one step further and investigate the feasibility of combining our gaze embedding approach, with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Forth, for the first time, we studied the effect of body shape on people preferences of outfits. We propose a novel and robust multi-photo approach to estimate the body shapes of each user and build a conditional model of clothing categories given body-shape. We demonstrate that in real-world data, clothing categories and body-shapes are correlated. We show that our approach estimates a realistic looking body shape that captures a user’s weight group and body shape type, even from a single image of a clothed person. However, an accurate depiction of the naked body is considered highly private and therefore, might not be consented by most people. First, we studied the perception of such technology via a user study. Then, in the last part of this thesis, we ask if the automatic extraction of such information can be effectively evaded. In summary, this thesis addresses several different tasks that aims to enable the vision system to analyse human search intent and preferences in real-world scenarios. In particular, the thesis proposes several novel ideas and models in visual search target prediction from human fixation data, for the first time studied the correlation between shape and clothing categories opening a new direction in clothing recommendation systems, and introduces a new topic in privacy and computer vision, aimed at preventing automatic 3D shape extraction from images.
- “Mobile Eye Tracking for Everyone,” Universität des Saarlandes, Saarbrücken, 2019.more
Abstract
Eye tracking and gaze-based human-computer interfaces have become a practical modality in desktop settings, since remote eye tracking is efficient and affordable. However, remote eye tracking remains constrained to indoor, laboratory-like conditions, in which lighting and user position need to be controlled. Mobile eye tracking has the potential to overcome these limitations and to allow people to move around freely and to use eye tracking on a daily basis during their everyday routine. However, mobile eye tracking currently faces two fundamental challenges that prevent it from being practically usable and that, consequently, have to be addressed before mobile eye tracking can truly be used by everyone: Mobile eye tracking needs to be advanced and made fully functional in unconstrained environments, and it needs to be made socially acceptable. Numerous sensing and analysis methods were initially developed for remote eye tracking and have been successfully applied for decades. Unfortunately, these methods are limited in terms of functionality and correctness, or even unsuitable for application in mobile eye tracking. Therefore, the majority of fundamental definitions, eye tracking methods, and gaze estimation approaches cannot be borrowed from remote eye tracking without adaptation. For example, the definitions of specific eye movements, like classical fixations, need to be extended to mobile settings where natural user and head motion are omnipresent. Corresponding analytical methods need to be adjusted or completely reimplemented based on novel approaches encoding the human gaze behaviour. Apart from these technical challenges, an entirely new, and yet under-explored, topic required for the breakthrough of mobile eye tracking as everyday technology is the overcoming of social obstacles. A first crucial key issue to defuse social objections is the building of acceptance towards mobile eye tracking. Hence, it is essential to replace the bulky appearance of current head-mounted eye trackers with an unobtrusive, appealing, and trendy design. The second high-priority theme of increasing importance for everyone is privacy and its protection, given that research and industry have not focused on or taken care of this problem at all. To establish true confidence, future devices have to find a fine balance between protecting users’ and bystanders’ privacy and attracting and convincing users of their necessity, utility, and potential with useful and beneficial features. The solution of technical challenges and social obstacles is the prerequisite for the development of a variety of novel and exciting applications in order to establish mobile eye tracking as a new paradigm, which ease our everyday life. This thesis addresses core technical challenges of mobile eye tracking that currently prevent it from being widely adopted. Specifically, this thesis proves that 3D data used for the calibration of mobile eye trackers improves gaze estimation and significantly reduces the parallax error. Further, it presents the first effective fixation detection method for head-mounted devices that is robust against the prevalence of user and gaze target motion. In order to achieve social acceptability, this thesis proposes an innovative and unobtrusive design for future mobile eye tracking devices and builds the first prototype with fully frame-embedded eye cameras combined with a calibration-free deep-trained appearance-based gaze estimation approach. To protect users’ and bystanders’ privacy in the presence of head-mounted eye trackers, this thesis presents another first-of-its-kind prototype. It is able to identify privacy-sensitive situations to automatically enable and disable the eye tracker’s first-person camera by means of a mechanical shutter, leveraging the combination of deep scene and eye movement features. Nevertheless, solving technical challenges and social obstacles alone is not sufficient to make mobile eye tracking attractive for the masses. The key to success is the development of convincingly useful, innovative, and essential applications. To extend the protection of users’ privacy on the software side as well, this thesis presents the first privacy-aware VR gaze interface using differential privacy. This method adds noise to recorded eye tracking data so that privacy-sensitive information like a user’s gender or identity is protected without impeding the utility of the data itself. In addition, the first large-scale online survey is conducted to understand users’ concerns with eye tracking. To develop and evaluate novel applications, this thesis presents the first publicly available long-term eye tracking datasets. They are used to show the unsupervised detection of users’ activities from eye movements alone using novel and efficient video-based encoding approaches as well as to propose the first proof-of-concept method to forecast users’ attentive behaviour during everyday mobile interactions from phone-integrated and body-worn sensors. This opens up possibilities for the development of a variety of novel and exciting applications. With more advanced features, accompanied by technological progress and sensor miniaturisation, eye tracking is increasingly integrated into conventional glasses as well as virtual and augmented reality (VR/AR) head-mounted displays, becoming an integral component of mobile interfaces. This thesis paves the way for the development of socially acceptable, privacy-aware, but highly functional mobile eye tracking devices and novel applications, so that mobile eye tracking can develop its full potential to become an everyday technology for everyone.
- “Confidence-Calibrated Adversarial Training and Detection: More Robust Models Generalizing Beyond the Attack Used During Training,” 2019. [Online]. Available: http://arxiv.org/abs/1910.06259.more
Abstract
Adversarial training is the standard to train models robust against
adversarial examples. However, especially for complex datasets, adversarial
training incurs a significant loss in accuracy and is known to generalize
poorly to stronger attacks, e.g., larger perturbations or other threat models.
In this paper, we introduce confidence-calibrated adversarial training (CCAT)
where the key idea is to enforce that the confidence on adversarial examples
decays with their distance to the attacked examples. We show that CCAT
preserves better the accuracy of normal training while robustness against
adversarial examples is achieved via confidence thresholding, i.e., detecting
adversarial examples based on their confidence. Most importantly, in strong
contrast to adversarial training, the robustness of CCAT generalizes to larger
perturbations and other threat models, not encountered during training. For
evaluation, we extend the commonly used robust test error to our detection
setting, present an adaptive attack with backtracking and allow the attacker to
select, per test example, the worst-case adversarial example from multiple
black- and white-box attacks. We present experimental results using $L_\infty$,
$L_2$, $L_1$ and $L_0$ attacks on MNIST, SVHN and Cifar10.
2018
- “Sequential Attacks on Agents for Long-Term Adversarial Goals,” in 2. ACM Computer Science in Cars Symposium (CSCS 2018), Munich, Germany, 2018.
- “Detailed Human Avatars from Monocular Video,” in 3DV 2018 , International Conference on 3D Vision, Verona, Italy, 2018.
- “Single-Shot Multi-person 3D Pose Estimation from Monocular RGB,” in 3DV 2018 , International Conference on 3D Vision, Verona, Italy, 2018.
- “Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation,” in 3DV 2018 , International Conference on 3D Vision, Verona, Italy, 2018.
- “Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH Asia 2018), vol. 37, no. 6, 2018.
- “Quick Bootstrapping of a Personalized Gaze Model from Real-Use Interactions,” ACM Transactions on Intelligent Systems and Technology, vol. 9, no. 4, 2018.
- “Unsupervised Learning of Shape and Pose with Differentiable Point Clouds,” in Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, Canada, 2018.
- “Adversarial Scene Editing: Automatic Object Removal from Weak Supervision,” in Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, Canada, 2018.more
Abstract
While great progress has been made recently in automatic image manipulation,
it has been limited to object centric images like faces or structured scene
datasets. In this work, we take a step towards general scene-level image
editing by developing an automatic interaction-free object removal model. Our
model learns to find and remove objects from general scene images using
image-level labels and unpaired data in a generative adversarial network (GAN)
framework. We achieve this with two key contributions: a two-stage editor
architecture consisting of a mask generator and image in-painter that
co-operate to remove objects, and a novel GAN based prior for the mask
generator that allows us to flexibly incorporate knowledge about object shapes.
We experimentally show on two datasets that our method effectively removes a
wide variety of objects using weak supervision only - “VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements,” in AVI 2018, International Conference on Advanced Visual Interfaces, Grosseto, Italy, 2018.
- “JAMI: Fast Computation of Conditional Mutual Information for ceRNA Network Analysis,” Bioinformatics, vol. 34, no. 17, 2018.
- “Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild,” in CHI 2018, CHI Conference on Human Factors in Computing Systems, Montréal, Canada, 2018.
- “Which one is me? Identifying Oneself on Public Displays,” in CHI 2018, CHI Conference on Human Factors in Computing Systems, Montréal, Canada, 2018.
- “Training Person-Specific Gaze Estimators from Interactions with Multiple Devices,” in CHI 2018, CHI Conference on Human Factors in Computing Systems, Montréal, Canada, 2018.
- “GazeDirector: Fully Articulated Eye Gaze Redirection in Video,” Computer Graphics Forum (Proc. EUROGRAPHICS 2018), vol. 37, no. 2, 2018.
- “Video Object Segmentation with Language Referring Expressions,” in Computer Vision - ACCV 2018, Perth, Australia, 2019.
- “NightOwls: A Pedestrians at Night Dataset,” in Computer Vision - ACCV 2018, Perth, Australia, 2019.
- “Grounding Visual Explanations,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
- “Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
- “Textual Explanations for Self-Driving Vehicles,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.more
Abstract
Deep neural perception and control networks have become key com-
ponents of self-driving vehicles. User acceptance is likely to benefit from easy-
to-interpret textual explanations which allow end-users to understand what trig-
gered a particular behavior. Explanations may be triggered by the neural con-
troller, namely
introspective explanations
, or informed by the neural controller’s
output, namely
rationalizations
. We propose a new approach to introspective ex-
planations which consists of two parts. First, we use a visual (spatial) attention
model to train a convolutional network end-to-end from images to the vehicle
control commands,
i
.
e
., acceleration and change of course. The controller’s at-
tention identifies image regions that potentially influence the network’s output.
Second, we use an attention-based video-to-text model to produce textual ex-
planations of model actions. The attention maps of controller and explanation
model are aligned so that explanations are grounded in the parts of the scene that
mattered to the controller. We explore two approaches to attention alignment,
strong- and weak-alignment. Finally, we explore a version of our model that
generates rationalizations, and compare with introspective explanations on the
same video segments. We evaluate these models on a novel driving dataset with
ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-
X) dataset. Code is available at
github.com/JinkyuKimUCB/explainable-deep-driving - “A Hybrid Model for Identity Obfuscation by Face Replacement,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
- “Recovering Accurate {3D} Human Pose in the Wild Using {IMUs} and a Moving Camera,” in Computer Vision -- ECCV 2018, Munich, Germany, 2018.
- “Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions,” in Computer Vision - ECCV 2018 Workshops, Munich, Germany, 2019.
- “GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User,” in DroNet’18, 4th ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, Munich, Germany, 2018.
- “Demo of XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera,” in ECCV 2018 Demo Sessions, Munich, Germany, 2018.
- “A Vision-grounded Dataset for Predicting Typical Locations for Verbs,” in Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
- “Eye Movements During Everyday Behavior Predict Personality Traits,” Frontiers in Human Neuroscience, vol. 12, 2018.
- “Objects, Relationships, and Context in Visual Data,” in ICMR’18, International Conference on Multimedia Retrieval, Yokohama, Japan, 2018.
- “Video Based Reconstruction of 3D People Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “PoseTrack: A Benchmark for Human Pose Estimation and Tracking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Accurate and Diverse Sampling of Sequences based on a ‘Best of Many’ Sample Objective,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Long-Term On-Board Prediction of People in Traffic Scenes under Uncertainty,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Disentangled Person Image Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Multimodal Explanations: Justifying Decisions and Pointing to the Evidence,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Learning 3D Shape Completion from Laser Scan Data with Weak Supervision,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Natural and Effective Obfuscation by Head Inpainting,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Feature Generating Networks for Zero-Shot Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Fooling Vision and Language Models Despite Localization and Attention Mechanism,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Occluded Pedestrian Detection through Guided Attention in CNNs,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 2018.
- “Learning to Refine Human Pose Estimation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), Salt Lake City, UT, USA, 2018.
- “Image and Video Captioning with Augmented Neural Architectures,” IEEE MultiMedia, vol. 25, no. 2, 2018.
- “Fast-PADMA: Rapidly Adapting Facial Affect Model from Similar Individuals,” IEEE Transactions on Multimedia, vol. 20, no. 7, 2018.
- “Reflectance and Natural Illumination from Single-Material Specular Objects Using Deep Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, 2018.
- “Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 7, 2018.
- “Discriminatively Trained Latent Ordinal Model for Video Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, 2018.
- “Towards Reaching Human Performance in Pedestrian Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, 2018.more
Abstract
Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods
and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech
pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background-
versus-foreground errors.
To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can
improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets
for pedestrian detection, and discuss which factors affect their performance.
Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of
training and test annotations.
- “Learning 3D Shape Completion under Weak Supervision,” International Journal of Computer Vision, vol. 128, 2018.
- “Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos,” International Journal of Computer Vision, vol. 126, no. 2–4, 2018.
- “Every Little Movement Has a Meaning of Its Own: Using Past Mouse Movements to Predict the Next Interaction,” in IUI 2018, 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan, 2018.
- “Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour,” in IUI 2018, 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan, 2018.
- “Explainable AI: The New 42?,” in Machine Learning and Knowledge Extraction (CD-MAKE 2018), Hamburg, Germany, 2018.
- “Tracing Cell Lineages in Videos of Lens-free Microscopy,” Medical Image Analysis, vol. 48, 2018.
- “Cross-Species Learning: A Low-Cost Approach to Learning Human Fight from Animal Fight,” in MM’18, 26th ACM Multimedia Conference, Seoul, Korea, 2018.
- “The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned,” in MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, Barcelona, Spain, 2018.
- “Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors,” in MobileHCI 2018, 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, Barcelona, Spain, 2018.
- “NRST: Non-rigid Surface Tracking from Monocular Video,” in Pattern Recognition (GCPR 2018), Stuttgart, Germany, 2019.
- “Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
- “Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimuli’s Trajectory is Partially Hidden,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
- “Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
- “Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
- “Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
- “Revisiting Data Normalization for Appearance-Based Gaze Estimation,” in Proceedings ETRA 2018, Warsaw, Poland, 2018.
- “A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation,” in Proceedings of the 27th USENIX Security Symposium, Baltimore, MD, USA, 2018.
- “Partial Optimality and Fast Lower Bounds for Weighted Correlation Clustering,” in Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 2018.
- “A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
- “Generating Counterfactual Explanations with Natural Language,” in Proceedings of the 2018 ICML Workshop on Human Interpretability in Machine Learning (WHI 2018), Stockholm, Sweden, 2018.more
Abstract
Natural language explanations of deep neural network decisions provide an
intuitive way for a AI agent to articulate a reasoning process. Current textual
explanations learn to discuss class discriminative features in an image.
However, it is also helpful to understand which attributes might change a
classification decision if present in an image (e.g., "This is not a Scarlet
Tanager because it does not have black wings.") We call such textual
explanations counterfactual explanations, and propose an intuitive method to
generate counterfactual explanations by inspecting which evidence in an input
is missing, but might contribute to a different classification decision if
present in the image. To demonstrate our method we consider a fine-grained
image classification task in which we take as input an image and a
counterfactual class and output text which explains why the image does not
belong to a counterfactual class. We then analyze our generated counterfactual
explanations both qualitatively and quantitatively using proposed automatic
metrics. - “Advanced Steel Microstructure Classification by Deep Learning Methods,” Scientific Reports, vol. 8, 2018.more
Abstract
The inner structure of a material is called microstructure. It stores the
genesis of a material and determines all its physical and chemical properties.
While microstructural characterization is widely spread and well known, the
microstructural classification is mostly done manually by human experts, which
opens doors for huge uncertainties. Since the microstructure could be a
combination of different phases with complex substructures its automatic
classification is very challenging and just a little work in this field has
been carried out. Prior related works apply mostly designed and engineered
features by experts and classify microstructure separately from feature
extraction step. Recently Deep Learning methods have shown surprisingly good
performance in vision applications by learning the features from data together
with the classification step. In this work, we propose a deep learning method
for microstructure classification in the examples of certain microstructural
constituents of low carbon steel. This novel method employs pixel-wise
segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by
max-voting scheme. Our system achieves 93.94% classification accuracy,
drastically outperforming the state-of-the-art method of 48.89% accuracy,
indicating the effectiveness of pixel-wise approaches. Beyond the success
presented in this paper, this line of research offers a more robust and first
of all objective way for the difficult task of steel quality appreciation.
- “Towards Reverse-Engineering Black-Box Neural Networks,” in Sixth International Conference on Learning Representations (ICLR 2018), Vancouver, Canada, 2018.
- “Long-Term Image Boundary Prediction,” in Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2018.
- “Higher-order Projected Power Iterations for Scalable Multi-Matching,” 2018. [Online]. Available: http://arxiv.org/abs/1811.10541.more
Abstract
The matching of multiple objects (e.g. shapes or images) is a fundamental
problem in vision and graphics. In order to robustly handle ambiguities, noise
and repetitive patterns in challenging real-world settings, it is essential to
take geometric consistency between points into account. Computationally, the
multi-matching problem is difficult. It can be phrased as simultaneously
solving multiple (NP-hard) quadratic assignment problems (QAPs) that are
coupled via cycle-consistency constraints. The main limitations of existing
multi-matching methods are that they either ignore geometric consistency and
thus have limited robustness, or they are restricted to small-scale problems
due to their (relatively) high computational cost. We address these
shortcomings by introducing a Higher-order Projected Power Iteration method,
which is (i) efficient and scales to tens of thousands of points, (ii)
straightforward to implement, (iii) able to incorporate geometric consistency,
and (iv) guarantees cycle-consistent multi-matchings. Experimentally we show
that our approach is superior to existing methods. - “Bayesian Prediction of Future Street Scenes through Importance Sampling based Optimization,” 2018. [Online]. Available: http://arxiv.org/abs/1806.06939.more
Abstract
For autonomous agents to successfully operate in the real world, anticipation
of future events and states of their environment is a key competence. This
problem can be formalized as a sequence prediction problem, where a number of
observations are used to predict the sequence into the future. However,
real-world scenarios demand a model of uncertainty of such predictions, as
future states become increasingly uncertain and multi-modal -- in particular on
long time horizons. This makes modelling and learning challenging. We cast
state of the art semantic segmentation and future prediction models based on
deep learning into a Bayesian formulation that in turn allows for a full
Bayesian treatment of the prediction problem. We present a new sampling scheme
for this model that draws from the success of variational autoencoders by
incorporating a recognition network. In the experiments we show that our model
outperforms prior work in accuracy of the predicted segmentation and provides
calibrated probabilities that also better capture the multi-modal aspects of
possible future states of street scenes.
- Eds., Proceedings PETMEI 2018. ACM, 2018.
- “Primal-Dual Wasserstein GAN,” 2018. [Online]. Available: http://arxiv.org/abs/1805.09575.more
Abstract
We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for
building latent variable models of the data distribution based on the primal
and the dual formulations of the optimal transport (OT) problem. We utilize the
primal formulation to learn a flexible inference mechanism and to create an
optimal approximate coupling between the data distribution and the generative
model. In order to learn the generative model, we use the dual formulation and
train the decoder adversarially through a critic network that is regularized by
the approximate coupling obtained from the primal. Unlike previous methods that
violate various properties of the optimal critic, we regularize the norm and
the direction of the gradients of the critic function. Our model shares many of
the desirable properties of auto-encoding models in terms of mode coverage and
latent structure, while avoiding their undesirable averaging properties, e.g.
their inability to capture sharp visual features when modeling real images. We
compare our algorithm with several other generative modeling techniques that
utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception
Scores (IS).
- “MLCapsule: Guarded Offline Deployment of Machine Learning as a Service,” 2018. [Online]. Available: http://arxiv.org/abs/1808.00590.more
Abstract
With the widespread use of machine learning (ML) techniques, ML as a service
has become increasingly popular. In this setting, an ML model resides on a
server and users can query the model with their data via an API. However, if
the user's input is sensitive, sending it to the server is not an option.
Equally, the service provider does not want to share the model by sending it to
the client for protecting its intellectual property and pay-per-query business
model. In this paper, we propose MLCapsule, a guarded offline deployment of
machine learning as a service. MLCapsule executes the machine learning model
locally on the user's client and therefore the data never leaves the client.
Meanwhile, MLCapsule offers the service provider the same level of control and
security of its model as the commonly used server-side execution. In addition,
MLCapsule is applicable to offline applications that require local execution.
Beyond protecting against direct model access, we demonstrate that MLCapsule
allows for implementing defenses against advanced attacks on machine learning
models such as model stealing/reverse engineering and membership inference. - “Manipulating Attributes of Natural Scenes via Hallucination,” 2018. [Online]. Available: http://arxiv.org/abs/1808.07413.more
Abstract
In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.
- “Learning a Disentangled Embedding for Monocular 3D Shape Retrieval and Pose Estimation,” 2018. [Online]. Available: http://arxiv.org/abs/1812.09899.more
Abstract
We propose a novel approach to jointly perform 3D object retrieval and pose
estimation from monocular images.In order to make the method robust to real
world scene variations in the images, e.g. texture, lighting and background,we
learn an embedding space from 3D data that only includes the relevant
information, namely the shape and pose.Our method can then be trained for
robustness under real world scene variations without having to render a large
training set simulating these variations. Our learned embedding explicitly
disentangles a shape vector and a pose vector, which alleviates both pose bias
for 3D shape retrieval and categorical bias for pose estimation. Having the
learned disentangled embedding, we train a CNN to map the images to the
embedding space, and then retrieve the closest 3D shape from the database and
estimate the 6D pose of the object using the embedding vectors. Our method
achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for
category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore
outperforms the previous state-of-the-art methods on both tasks. - “From Perception over Anticipation to Manipulation,” Universität des Saarlandes, Saarbrücken, 2018.more
Abstract
From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.
- “Deep Appearance Maps,” 2018. [Online]. Available: http://arxiv.org/abs/1804.00863.more
Abstract
We propose a deep representation of appearance, i. e. the relation of color,
surface orientation, viewer position, material and illumination. Previous
approaches have used deep learning to extract classic appearance
representations relating to reflectance model parameters (e. g. Phong) or
illumination (e. g. HDR environment maps). We suggest to directly represent
appearance itself as a network we call a deep appearance map (DAM). This is a
4D generalization over 2D reflectance maps, which held the view direction
fixed. First, we show how a DAM can be learned from images or video frames and
later be used to synthesize appearance, given new surface orientations and
viewer positions. Second, we demonstrate how another network can be used to map
from an image or video frames to a DAM network to reproduce this appearance,
without using a lengthy optimization such as stochastic gradient descent
(learning-to-learn). Finally, we generalize this to an appearance
estimation-and-segmentation task, where we map from an image showing multiple
materials to multiple networks reproducing their appearance, as well as
per-pixel segmentation.
- “Image Manipulation against Learned Models Privacy and Security Implications,” Universität des Saarlandes, Saarbrücken, 2018.more
Abstract
Machine learning is transforming the world. Its application areas span privacy
sensitive and security critical tasks such as human identification and self-driving
cars. These applications raise privacy and security related questions that are not
fully understood or answered yet: Can automatic person recognisers identify people
in photos even when their faces are blurred? How easy is it to find an adversarial
input for a self-driving car that makes it drive off the road?
This thesis contributes one of the first steps towards a better understanding of
such concerns. We observe that many privacy and security critical scenarios for
learned models involve input data manipulation: users obfuscate their identity by
blurring their faces and adversaries inject imperceptible perturbations to the input
signal. We introduce a data manipulator framework as a tool for collectively describing
and analysing privacy and security relevant scenarios involving learned models.
A data manipulator introduces a shift in data distribution for achieving privacy or
security related goals, and feeds the transformed input to the target model. This
framework provides a common perspective on the studies presented in the thesis.
We begin the studies from the user’s privacy point of view. We analyse the
efficacy of common obfuscation methods like face blurring, and show that they
are surprisingly ineffective against state of the art person recognition systems. We
then propose alternatives based on head inpainting and adversarial examples. By
studying the user privacy, we also study the dual problem: model security. In model
security perspective, a model ought to be robust and reliable against small amounts
of data manipulation. In both cases, data are manipulated with the goal of changing
the target model prediction. User privacy and model security problems can be
described with the same objective.
We then study the knowledge aspect of the data manipulation problem. The more
one knows about the target model, the more effective manipulations one can craft.
We propose a game theoretic manipulation framework to systematically represent
the knowledge level on the target model and derive privacy and security guarantees.
We then discuss ways to increase knowledge about a black-box model by only querying
it, deriving implications that are relevant to both privacy and security perspectives. - “Understanding and Controlling User Linkability in Decentralized Learning,” 2018. [Online]. Available: http://arxiv.org/abs/1805.05838.more
Abstract
Machine Learning techniques are widely used by online services (e.g. Google,
Apple) in order to analyze and make predictions on user data. As many of the
provided services are user-centric (e.g. personal photo collections, speech
recognition, personal assistance), user data generated on personal devices is
key to provide the service. In order to protect the data and the privacy of the
user, federated learning techniques have been proposed where the data never
leaves the user's device and "only" model updates are communicated back to the
server. In our work, we propose a new threat model that is not concerned with
learning about the content - but rather is concerned with the linkability of
users during such decentralized learning scenarios.
We show that model updates are characteristic for users and therefore lend
themselves to linkability attacks. We show identification and matching of users
across devices in closed and open world scenarios. In our experiments, we find
our attacks to be highly effective, achieving 20x-175x chance-level
performance.
In order to mitigate the risks of linkability attacks, we study various
strategies. As adding random noise does not offer convincing operation points,
we propose strategies based on using calibrated domain-specific data; we find
these strategies offers substantial protection against linkability threats with
little effect to utility.
- “End-to-end Learning for Graph Decomposition,” 2018. [Online]. Available: http://arxiv.org/abs/1812.09737.more
Abstract
We propose a novel end-to-end trainable framework for the graph decomposition
problem. The minimum cost multicut problem is first converted to an
unconstrained binary cubic formulation where cycle consistency constraints are
incorporated into the objective function. The new optimization problem can be
viewed as a Conditional Random Field (CRF) in which the random variables are
associated with the binary edge labels of the initial graph and the hard
constraints are introduced in the CRF as high-order potentials. The parameters
of a standard Neural Network and the fully differentiable CRF are optimized in
an end-to-end manner. Furthermore, our method utilizes the cycle constraints as
meta-supervisory signals during the learning of the deep feature
representations by taking the dependencies between the output random variables
into account. We present analyses of the end-to-end learned representations,
showing the impact of the joint training, on the task of clustering images of
MNIST. We also validate the effectiveness of our approach both for the feature
learning and the final clustering on the challenging task of real-world
multi-person pose estimation. - “PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis,” 2018. [Online]. Available: http://arxiv.org/abs/1801.04457.more
Abstract
As first-person cameras in head-mounted displays become increasingly
prevalent, so does the problem of infringing user and bystander privacy. To
address this challenge, we present PrivacEye, a proof-of-concept system that
detects privacysensitive everyday situations and automatically enables and
disables the first-person camera using a mechanical shutter. To close the
shutter, PrivacEye detects sensitive situations from first-person camera videos
using an end-to-end deep-learning model. To open the shutter without visual
input, PrivacEye uses a separate, smaller eye camera to detect changes in
users' eye movements to gauge changes in the "privacy level" of the current
situation. We evaluate PrivacEye on a dataset of first-person videos recorded
in the daily life of 17 participants that they annotated with privacy
sensitivity levels. We discuss the strengths and weaknesses of our
proof-of-concept system based on a quantitative technical evaluation as well as
qualitative insights from semi-structured interviews.
- “Gaze Estimation and Interaction in Real-World Environments,” Universität des Saarlandes, Saarbrücken, 2018.more
Abstract
Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.
2017
- “They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers,” in 16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), Stuttgart, Germany, 2017.
- “EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging,” in 16th International Conference on Mobile and Ubiquitous Multimedia (MUM 2017), Stuttgart, Germany, 2017.
- “Long-Term On-Board Prediction of Pedestrians in Traffic Scenes,” in 1st Conference on Robot Learning (CoRL 2017), Mountain View, CA, USA, 2017.
- “Gradient-free Policy Architecture Search and Adaptation,” in 1st Conference on Robot Learning (CoRL 2017), Mountain View, CA, USA, 2017.
- “STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Learning Non-maximum Suppression,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “ArtTrack: Articulated Multi-Person Tracking in the Wild,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Gaze Embeddings for Zero-Shot Image Classification,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Learning Video Object Segmentation from Static Images,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Simple Does It: Weakly Supervised Instance and Semantic Segmentation,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “InstanceCut: from Edges to Instances with MultiCut,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Exploiting Saliency for Object Segmentation from Image Level Labels,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Generating Descriptions with Grounded and Co-Referenced People,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “A Domain Based Approach to Social Relation Recognition,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “A Message Passing Algorithm for the Minimum Cost Multicut Problem,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Multiple People Tracking by Lifted Multicut and Person Re-identification,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “Zero-shot learning - The Good, the Bad and the Ugly,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
- “CityPersons: A Diverse Dataset for Pedestrian Detection,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.more
Abstract
Convnets have enabled significant progress in pedestrian detection recently,
but there are still open questions regarding suitable architectures and
training data. We revisit CNN design and point out key adaptations, enabling
plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset.
To achieve further improvement from more and better data, we introduce
CityPersons, a new set of person annotations on top of the Cityscapes dataset.
The diversity of CityPersons allows us for the first time to train one single
CNN model that generalizes well over multiple benchmarks. Moreover, with
additional training with CityPersons, we obtain top results using FasterRCNN on
Caltech, improving especially for more difficult cases (heavy occlusion and
small scale) and providing higher localization quality.
- “It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation,” in 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), Honolulu, HI, USA, 2017.
- “Visual Stability Prediction and Its Application to Manipulation,” in AAAI 2017 Spring Symposia 05, Interactive Multisensory Object Perception for Embodied Agents, Palo Alto, CA, 2017.
- “Pose Guided Person Image Generation,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 2017.
- “ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones,” in CHI’17, 35th Annual ACM Conference on Human Factors in Computing Systems, Denver, CO, USA, 2017.
- “Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications,” in CHI 2017 Extended Abstracts, Denver, CO, USA, 2017.
- “Lucid Data Dreaming for Object Tracking,” in DAVIS Challenge on Video Object Segmentation 2017, Honolulu, HI, USA, 2017.
- “GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication,” in ICMI’17, 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 2017.
- “What Is Around The Camera?,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
- “Adversarial Image Perturbation for Privacy Protection -- A Game Theory Perspective,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
- “Towards a Visual Privacy Advisor: Understanding and Predicting Privacy Risks in Images,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
- “Efficient Algorithms for Moral Lineage Tracing,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
- “Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
- “Paying Attention to Descriptions Generated by Image Captioning Models,” in IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017.
- “Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling,” in 2017 IEEE International Conference on Computer Vision Workshops (MBCC @ICCV 2017), Venice, Italy, 2017.more
Abstract
Previous work focused on predicting visual search targets from human
fixations but, in the real world, a specific target is often not known, e.g.
when searching for a present for a friend. In this work we instead study the
problem of predicting the mental picture, i.e. only an abstract idea instead of
a specific target. This task is significantly more challenging given that
mental pictures of the same target category can vary widely depending on
personal biases, and given that characteristic target attributes can often not
be verbalised explicitly. We instead propose to use gaze information as
implicit information on users' mental picture and present a novel gaze pooling
layer to seamlessly integrate semantic and localized fixation information into
a deep image representation. We show that we can robustly predict both the
mental picture's category as well as attributes on a novel dataset containing
fixation data of 14 users searching for targets on a subset of the DeepFahion
dataset. Our results have important implications for future search interfaces
and suggest deep gaze pooling as a general-purpose approach for gaze-supported
computer vision systems.
- “Visual Stability Prediction for Robotic Manipulation,” in IEEE International Conference on Robotics and Automation (ICRA 2017), Singapore, 2017.
- “MARCOnI -- ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, 2017.
- “Novel Views of Objects from a Single Image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 8, 2017.
- “Expanded Parts Model for Semantic Description of Humans in Still Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, 2017.
- “A Compact Representation of Human Actions by Sliding Coordinate Coding,” International Journal of Advanced Robotic Systems, vol. 14, no. 6, 2017.
- “Ask Your Neurons: A Deep Learning Approach to Visual Question Answering,” International Journal of Computer Vision, vol. 125, no. 1–3, 2017.
- “Movie Description,” International Journal of Computer Vision, vol. 123, no. 1, 2017.more
Abstract
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015.
- “Cell Lineage Tracing in Lens-Free Microscopy Videos,” in Medical Image Computing and Computer Assisted Intervention -- MICCAI 2017, Quebec City, Canada, 2017.
- “Building Statistical Shape Spaces for 3D Human Modeling,” Pattern Recognition, vol. 67, 2017.
- “Online Growing Neural Gas for Anomaly Detection in Changing Surveillance Scenes,” Pattern Recognition, vol. 64, 2017.
- “Learning Dilation Factors for Semantic Segmentation of Street Scenes,” in Pattern Recognition (GCPR 2017), Basel, Switzerland, 2017.
- “A Comparative Study of Local Search Algorithms for Correlation Clustering,” in Pattern Recognition (GCPR 2017), Basel, Switzerland, 2017.
- “Look Together: Using Gaze for Assisting Co-located Collaborative Search,” Personal and Ubiquitous Computing, vol. 21, no. 1, 2017.
- “GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices,” in Pervasive Displays 2017 (PerDis 2017), Lugano, Switzerland, 2017.
- “Analysis and Optimization of Graph Decompositions by Lifted Multicuts,” in Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 2017.
- “EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 4, 2017.
- “InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 1, no. 3, 2017.
- “Efficiently Summarising Event Sequences with Rich Interleaving Patterns,” in Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM 2017), Houston, TX, USA, 2017.
- “Are you stressed? Your eyes and the mouse can tell,” in Seventh International Conference on Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, 2017.
- “EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays,” in UIST’17, 30th Annual Symposium on User Interface Software and Technology, Quebec City, Canada, 2017.
- “Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery,” in UIST’17, 30th Annual Symposium on User Interface Software and Technology, Quebec City, Canada, 2017.
- “Analysis and Improvement of the Visual Object Detection Pipeline,” Universität des Saarlandes, Saarbrücken, 2017.more
Abstract
Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression.
- “Learning to Segment in Images and Videos with Different Forms of Supervision,” Universität des Saarlandes, Saarbrücken, 2017.more
Abstract
Much progress has been made in image and video segmentation
over the last years. To a large extent, the success can be attributed to
the strong appearance models completely learned from data, in particular
using deep learning methods. However,to perform best these methods require
large representative datasets for training with expensive pixel-level
annotations, which in case of videos are prohibitive to obtain. Therefore,
there is a need to relax this constraint and to consider alternative forms
of supervision, which are easier and cheaper to collect. In this thesis,
we aim to develop algorithms for learning to segment in images and videos
with different levels of supervision.
First, we develop approaches for training convolutional networks with weaker
forms of supervision, such as bounding boxes or image labels, for object
boundary estimation and semantic/instance labelling tasks. We propose to
generate pixel-level approximate groundtruth from these weaker forms of
annotations to train a network, which allows to achieve high-quality
results comparable to the full supervision quality without any
modifications of the network architecture or the training procedure.
Second, we address the problem of the excessive computational and memory
costs inherent to solving video segmentation via graphs. We propose
approaches to improve the runtime and memory efficiency as well as the
output segmentation quality by learning from the available training data
the best representation of the graph. In particular, we contribute with
learning must-link constraints, the topology and edge weights of the graph
as well as enhancing the graph nodes - superpixels - themselves.
Third, we tackle the task of pixel-level object tracking and address the
problem of the limited amount of densely annotated video data for training
convolutional networks. We introduce an architecture which allows training
with static images only and propose an elaborate data synthesis scheme
which creates a large number of training examples close to the target
domain from the given first frame mask. With the proposed techniques we
show that densely annotated consequent video data is not necessary to
achieve high-quality temporally coherent video segmentationresults.
In summary, this thesis advances the state of the art in weakly supervised
image segmentation, graph-based video segmentation and pixel-level object
tracking and contributes with the new ways of training convolutional
networks with a limited amount of pixel-level annotated training data. - “Lucid Data Dreaming for Multiple Object Tracking,” 2017. [Online]. Available: http://arxiv.org/abs/1703.09554.more
Abstract
Convolutional networks reach top quality in pixel-level object tracking but
require a large amount of training data (1k ~ 10k) to deliver such results. We
propose a new training strategy which achieves state-of-the-art results across
three evaluation datasets while using 20x ~ 100x less annotated data than
competing methods. Instead of using large training sets hoping to generalize
across domains, we generate in-domain training data using the provided
annotation on the first frame of each video to synthesize ("lucid dream")
plausible future video frames. In-domain per-video training data allows us to
train high quality appearance- and motion-based models, as well as tune the
post-processing stage. This approach allows to reach competitive results even
when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the tracking task a smaller training set
that is closer to the target domain is more effective. This changes the mindset
regarding how many training samples and general "objectness" knowledge are
required for the object tracking task.
- “Image Classification with Limited Training Data and Class Ambiguity,” Universität des Saarlandes, Saarbrücken, 2017.more
Abstract
Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations.
- “Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning,” 2017. [Online]. Available: http://arxiv.org/abs/1711.00267.more
Abstract
Understanding physical phenomena is a key component of human intelligence and
enables physical interaction with previously unseen environments. In this
paper, we study how an artificial agent can autonomously acquire this intuition
through interaction with the environment. We created a synthetic block stacking
environment with physics simulation in which the agent can learn a policy
end-to-end through trial and error. Thereby, we bypass to explicitly model
physical knowledge within the policy. We are specifically interested in tasks
that require the agent to reach a given goal state that may be different for
every new trial. To this end, we propose a deep reinforcement learning
framework that learns policies which are parametrized by a goal. We validated
the model on a toy example navigating in a grid world with different target
positions and in a block stacking task with different target structures of the
final tower. In contrast to prior work, our policies show better generalization
across different goals.
- “Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image,” Universität des Saarlandes, Saarbrücken, 2017.more
Abstract
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
- “Person Recognition in Social Media Photos,” 2017. [Online]. Available: http://arxiv.org/abs/1710.03224.more
Abstract
People nowadays share large parts of their personal lives through social
media. Being able to automatically recognise people in personal photos may
greatly enhance user convenience by easing photo album organisation. For human
identification task, however, traditional focus of computer vision has been
face recognition and pedestrian re-identification. Person recognition in social
media photos sets new challenges for computer vision, including non-cooperative
subjects (e.g. backward viewpoints, unusual poses) and great changes in
appearance. To tackle this problem, we build a simple person recognition
framework that leverages convnet features from multiple image regions (head,
body, etc.). We propose new recognition scenarios that focus on the time and
appearance gap between training and testing samples. We present an in-depth
analysis of the importance of different features according to time and
viewpoint generalisability. In the process, we verify that our simple approach
achieves the state of the art result on the PIPA benchmark, arguably the
largest social media based benchmark for person recognition to date with
diverse poses, viewpoints, social groups, and events.
Compared the conference version of the paper, this paper additionally
presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2
that combines the conference version method naeil and DeepID2+ to achieve state
of the art results even compared to post-conference works, (3) discussion of
related work since the conference version, (4) additional analysis including
the head viewpoint-wise breakdown of performance, and (5) results on the
open-world setup.
- “Whitening Black-Box Neural Networks,” 2017. [Online]. Available: http://arxiv.org/abs/1711.01768.more
Abstract
Many deployed learned models are black boxes: given input, returns output.
Internal information about the model, such as the architecture, optimisation
procedure, or training data, is not disclosed explicitly as it might contain
proprietary information or make the system more vulnerable. This work shows
that such attributes of neural networks can be exposed from a sequence of
queries. This has multiple implications. On the one hand, our work exposes the
vulnerability of black-box neural networks to different types of attacks -- we
show that the revealed internal information helps generate more effective
adversarial examples against the black box model. On the other hand, this
technique can be used for better protection of private content from automatic
recognition models using adversarial examples. Our paper suggests that it is
actually hard to draw a line between white box and black box models.
- “Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract),” 2017. [Online]. Available: http://arxiv.org/abs/1711.07373.more
Abstract
Deep models are the defacto standard in visual decision problems due to their
impressive performance on a wide array of visual tasks. On the other hand,
their opaqueness has led to a surge of interest in explainable systems. In this
work, we emphasize the importance of model explanation in various forms such as
visual pointing and textual justification. The lack of data with justification
annotations is one of the bottlenecks of generating multimodal explanations.
Thus, we propose two large-scale datasets with annotations that visually and
textually justify a classification decision for various activities, i.e. ACT-X,
and for question answering, i.e. VQA-X. We also introduce a multimodal
methodology for generating visual and textual explanations simultaneously. We
quantitatively show that training with the textual explanations not only yields
better textual justification models, but also models that better localize the
evidence that support their decision. - “Generation and Grounding of Natural Language Descriptions for Visual Data,” Universität des Saarlandes, Saarbrücken, 2017.more
Abstract
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition app