D2
Computer Vision and Machine Learning
2021
Real-time Deep Dynamic Characters
M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2021), Volume 40, Number 4, 2021
Combinatorial Optimization for Panoptic Segmentation: A Fully Differentiable Approach
A. Abbas and P. Swoboda
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
RMM: Reinforced Memory Management for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
Shape your Space: A Gaussian Mixture Regularization Approach to Deterministic Autoencoders
A. Saseendran, K. Skubch, S. Falkner and M. Keuper
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
Monocular 3D Multi-Person Pose Estimation via Predicting Factorized Correction Factors
Y. Guo, L. Ma, Z. Li, X. Wang and F. Wang
Computer Vision and Image Understanding, Volume 213, 2021
Learning to Teach and Learn for Semi-supervised Few-shot Image Classification
X. Li, J. Huang, Y. Liu, Q. Zhou, S. Zheng, B. Schiele and Q. Sun
Computer Vision and Image Understanding, Volume 212, 2021
Learning Decision Trees Recurrently Through Communication
S. Alaniz, D. Marcos, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers
A. Bhattacharyya, D. O. Reino, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Convolutional Dynamic Alignment Networks for Interpretable Classifications
M. D. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Y. Chen, Y. Xian, A. S. Koepke and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Stereo Radiance Fields (SRF): Learning View Synthesis from Sparse Views of Novel Scenes
J. Chibane, A. Bansal, V. Lazova and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Learning Spatially-Variant MAP Models for Non-blind Image Deblurring
J. Dong, S. Roth and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors
V. Guzov, A. Mir, T. Sattler, and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Adaptive Aggregation Networks for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Open World Compositional Zero-Shot Learning
M. Mancini, M. F. Naeem, Y. Xian and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Learning Graph Embeddings for Compositional Zero-shot Learning
M. F. Naeem, Y. Xian, F. Tombari and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
SMPLicit: Topology-aware Generative Model for Clothed People
G. Pons-Moll, F. Moreno-Noguer, E. Corona, A. Pumarola and G. Alenyà
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
D-NeRF: Neural Radiance Fields for Dynamic Scenes
A. Pumarola, E. Corona, G. Pons-Moll and F. Moreno-Noguer
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs
H.-P. Wang, N. Yu and M. Fritz
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets
R. Gong, D. Dai, Y. Chen, W. Li and L. Van Gool
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather
M. Hahner, C. Sakaridis, D. Dai and L. Van Gool
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths
A. Horňáková, T. Kaiser, P. Swoboda, M. Rolinek, B. Rosenhahn and R. Henschel
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
M. Kayser, O.-M. Camburu, L. Salewski, C. Emde, V. Do, Z. Akata and T. Lukasiewicz
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Keep CALM and Improve Visual Feature Attribution
J. M. Kim, J. Choe, Z. Akata and S. J. Oh
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting
A. Kukleva, H. Kuehne and B. Schiele
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection
F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang and B. Schiele
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding
C. Sakaridis, D. Dai and L. Van Gool
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Relating Adversarially Robust Generalization to Flat Minima
D. Stutz, M. Hein and B. Schiele
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Task Switching Network for Multi-task Learning
G. Sun, T. Probst, D. P. Paudel, N. Popovic, M. Kanakis, J. Patel, D. Dai and L. Van Gool
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing
G. Tiwari, N. Sarafianos, T. Tung and G. Pons-Moll
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
Q. Wang, D. Dai, L. Hoyer, L. Van Gool and O. Fink
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data
N. Yu, V. Skripniuk, S. Abdelnabi and M. Fritz
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Dual Contrastive Loss and Attention for GANs
N. Yu, G. Liu, A. Dundar, A. Tao, B. Catanzaro, L. Davis and M. Fritz
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
End-to-End Urban Driving by Imitating a Reinforcement Learning Coach
Z. Zhang, A. Liniger, D. Dai, F. Yu and L. Van Gool
IEEE/CVF International Conference on Computer Vision (ICCV 2021), 2021
Deep Outlier Handling for Image Deblurring
J. Dong and J. Pan
IEEE Transactions on Image Processing, Volume 30, 2021
Generating Face Images With Attributes for Free
Y. Liu, Q. Sun, X. He, A.-A. Liu, Y. Su and T.-S. Chua
IEEE Transactions on Neural Networks and Learning Systems, Volume 32, Number 6, 2021
A Deeper Look into DeepCap
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
Abstract
Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness. This work is an extended version of DeepCap where we provide more detailed explanations, comparisons and results as well as applications.
Future Moment Assessment for Action Query
Q. Ke, M. Fritz and B. Schiele
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences
R. G. VidalMata, W. J. Scheirer, A. Kukleva, D. Cox and H. Kuehne
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data
T. Nguyen, D. H. M. Nguyen, H. Nguyen, B. T. Nguyen and B. A. Wade
Information Sciences, Volume 567, 2021
You Only Need Adversarial Supervision for Semantic Image Synthesis
E. Schönfeld, V. Sushko, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Conference on Learning Representations (ICLR 2021), 2021
Norm-Aware Embedding for Efficient Person Search and Tracking
D. Chen, S. Zhang, J. Yang and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
Guest Editorial: Special Issue on “Computer Vision for All Seasons: Adverse Weather and Lighting Conditions”
D. Dai, R. T. Tan, V. Patel, J. Matas, B. Schiele and L. Van Gool
International Journal of Computer Vision, Volume 129, 2021
DLOW: Domain Flow and Applications
R. Gong, W. Li, Y. Chen, D. Dai and L. Van Gool
International Journal of Computer Vision, Volume 129, 2021
Semantic Bottlenecks: Quantifying and Improving Inspectability of Deep Representations
M. Losch, M. Fritz and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
Guided Attention in CNNs for Occluded Pedestrian Detection and Re-identification
S. Zhang, D. Chen, J. Yang and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
SampleFix: Learning to Correct Programs by Sampling Diverse Fixes
H. Hajipour, A. Bhattacharyya, C.-A. Staicu and M. Fritz
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), 2021
DARTS for Inverse Problems: a Study on Stability
J. Geiping, J. Lukasik, M. Keuper and M. Moeller
NeurIPS 2021 Workshop on Deep Learning and Inverse Problems (NeurIPS 2021 Deep Inverse Workshop), 2021
(Accepted/in press)
Internalized Biases in Fréchet Inception Distance
S. Jung and M. Keuper
NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2021 Workshop DistShift), 2021
(Accepted/in press)
Efficient Message Passing for 0–1 ILPs with Binary Decision Diagrams
J.-H. Lange and P. Swoboda
Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021
Bit Error Robustness for Energy-Efficient DNN Accelerators
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Proceedings of the 4th MLSys Conference, 2021
Abstract
Deep neural network (DNN) accelerators received considerable attention in past years due to saved energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption significantly, however, causes bit-level failures in the memory storing the quantized DNN weights. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, and random bit error training (RandBET) improves robustness against random bit errors in (quantized) DNN weights significantly. This leads to high energy savings from both low-voltage operation as well as low-precision quantization. Our approach generalizes across operating voltages and accelerators, as demonstrated on bit errors from profiled SRAM arrays. We also discuss why weight clipping alone is already a quite effective way to achieve robustness against bit errors. Moreover, we specifically discuss the involved trade-offs regarding accuracy, robustness and precision: Without losing more than 1% in accuracy compared to a normally trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even for 4-bit DNNs.
A Closer Look at Self-training for Zero-Label Semantic Segmentation
G. Pastore, F. Cermelli, Y. Xian, M. Mancini, Z. Akata and B. Caputo
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
InfoScrub: Towards Attribute Privacy by Targeted Obfuscation
H.-P. Wang, T. Orekondy and M. Fritz
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis
Y. He, N. Yu, M. Keuper and M. Fritz
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2021), 2021
Spectral Distribution Aware Image Generation
S. Jung and M. Keuper
Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
RAMA: A Rapid Multicut Algorithm on GPU
A. Abbas and P. Swoboda
Technical Report, 2021a
(arXiv: 2109.01838)
Abstract
We propose a highly parallel primal-dual algorithm for the multicut (a.k.a. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision. Our algorithm consists of three steps executed recursively: (1) Finding conflicted cycles that correspond to violated inequalities of the underlying multicut relaxation, (2) Performing message passing between the edges and cycles to optimize the Lagrange relaxation coming from the found violated cycles producing reduced costs and (3) Contracting edges with high reduced costs through matrix-matrix multiplications. Our algorithm produces primal solutions and dual lower bounds that estimate the distance to optimum. We implement our algorithm on GPUs and show resulting one to two order-of-magnitudes improvements in execution speed without sacrificing solution quality compared to traditional serial algorithms that run on CPUs. We can solve very large scale benchmark problems with up to $\mathcal{O}(10^8)$ variables in a few seconds with small primal-dual gaps. We make our code available at https://github.com/pawelswoboda/RAMA.
FastDOG: Fast Discrete Optimization on GPU
A. Abbas and P. Swoboda
Technical Report, 2021b
(arXiv: 2111.10270)
Abstract
We present a massively parallel Lagrange decomposition method for solving 0-1 integer linear programs occurring in structured prediction. We propose a new iterative update scheme for solving the Lagrangean dual and a perturbation technique for decoding primal solutions. For representing subproblems we follow Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and dual algorithms require little synchronization between subproblems and optimization over BDDs needs only elementary operations without complicated control flow. This allows us to exploit the parallelism offered by GPUs for all components of our method. We present experimental results on combinatorial problems from MAP inference for Markov Random Fields, quadratic assignment and cell tracking for developmental biology. Our highly parallel GPU implementation improves upon the running times of the algorithms from Lange et al. (2021) by up to an order of magnitude. In particular, we come close to or outperform some state-of-the-art specialized heuristics while being problem agnostic.
Long-term future prediction under uncertainty and multi-modality
A. Bhattacharyya
PhD Thesis, Universität des Saarlandes, 2021
Optimising for Interpretability: Convolutional Dynamic Alignment Networks
M. D. Böhle, M. Fritz and B. Schiele
Technical Report, 2021
(arXiv: 2109.13004)
Abstract
We introduce a new family of neural network models called Convolutional Dynamic Alignment Networks (CoDA Nets), which are performant classifiers with a high degree of inherent interpretability. Their core building blocks are Dynamic Alignment Units (DAUs), which are optimised to transform their inputs with dynamically computed weight vectors that align with task-relevant patterns. As a result, CoDA Nets model the classification prediction through a series of input-dependent linear transformations, allowing for linear decomposition of the output into individual input contributions. Given the alignment of the DAUs, the resulting contribution maps align with discriminative input patterns. These model-inherent decompositions are of high visual quality and outperform existing attribution methods under quantitative metrics. Further, CoDA Nets constitute performant classifiers, achieving on par results to ResNet and VGG models on e.g. CIFAR-10 and TinyImagenet. Lastly, CoDA Nets can be combined with conventional neural network models to yield powerful classifiers that more easily scale to complex datasets such as Imagenet whilst exhibiting an increased interpretable depth, i.e., the output can be explained well in terms of contributions from intermediate layers within the network.
Where and When: Space-Time Attention for Audio-Visual Explanations
Y. Chen, T. Hummel, A. S. Koepke and Z. Akata
Technical Report, 2021
(arXiv: 2105.01517)
Abstract
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets, comparing extensively to multiple recent multi-modal representation learners and intrinsic explanation models. Experimental results demonstrate the clear superior performance of our model over the existing methods on audio-visual video event recognition. Moreover, we conduct an in-depth study to analyze the explainability of our model based on robustness analysis via perturbation tests and pointing games using human annotations.
Text-image synergy for multimodal retrieval and annotation
S. N. Chowdhury
PhD Thesis, Universität des Saarlandes, 2021
Abstract
Text and images are the two most common data modalities found on the Internet. Understanding the synergy between text and images, that is, seamlessly analyzing information from these modalities may be trivial for humans, but is challenging for software systems. In this dissertation we study problems where deciphering text-image synergy is crucial for finding solutions. We propose methods and ideas that establish semantic connections between text and images in multimodal contents, and empirically show their effectiveness in four interconnected problems: Image Retrieval, Image Tag Refinement, Image-Text Alignment, and Image Captioning. Our promising results and observations open up interesting scopes for future research involving text-image data understanding.Text and images are the two most common data modalities found on the Internet. Understanding the synergy between text and images, that is, seamlessly analyzing information from these modalities may be trivial for humans, but is challenging for software systems. In this dissertation we study problems where deciphering text-image synergy is crucial for finding solutions. We propose methods and ideas that establish semantic connections between text and images in multimodal contents, and empirically show their effectiveness in four interconnected problems: Image Retrieval, Image Tag Refinement, Image-Text Alignment, and Image Captioning. Our promising results and observations open up interesting scopes for future research involving text-image data understanding.
TADA: Taxonomy Adaptive Domain Adaptation
R. Gong, M. Danelljan, D. Dai, W. Wang, D. P. Paudel, A. Chhatkuli, F. Yu and L. Van Gool
Technical Report, 2021
(arXiv: 2109.04813)
Abstract
Traditional domain adaptation addresses the task of adapting a model to a novel target domain under limited or no additional supervision. While tackling the input domain gap, the standard domain adaptation settings assume no domain change in the output space. In semantic prediction tasks, different datasets are often labeled according to different semantic taxonomies. In many real-world settings, the target domain task requires a different taxonomy than the one imposed by the source domain. We therefore introduce the more general taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent taxonomies between the two domains. We further propose an approach that jointly addresses the image-level and label-level domain adaptation. On the label-level, we employ a bilateral mixed sampling strategy to augment the target domain, and a relabelling method to unify and align the label spaces. We address the image-level domain gap by proposing an uncertainty-rectified contrastive learning method, leading to more domain-invariant and class discriminative features. We extensively evaluate the effectiveness of our framework under different TADA settings: open taxonomy, coarse-to-fine taxonomy, and partially-overlapping taxonomy. Our framework outperforms previous state-of-the-art by a large margin, while capable of adapting to target taxonomies.
Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
L. Hoyer, D. Dai, Q. Wang, Y. Chen and L. Van Gool
Technical Report, 2021
(arXiv: 2108.12545)
Abstract
Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised and domain-adaptive semantic segmentation, which is enhanced by self-supervised monocular depth estimation (SDE) trained only on unlabeled image sequences. In particular, we utilize SDE as an auxiliary task comprehensively across the entire learning framework: First, we automatically select the most useful samples to be annotated for semantic segmentation based on the correlation of sample diversity and difficulty between SDE and semantic segmentation. Second, we implement a strong data augmentation by mixing images and labels using the geometry of the scene. Third, we transfer knowledge from features learned during SDE to semantic segmentation by means of transfer and multi-task learning. And fourth, we exploit additional labeled synthetic data with Cross-Domain DepthMix and Matching Geometry Sampling to align synthetic and real data. We validate the proposed model on the Cityscapes dataset, where all four contributions demonstrate significant performance gains, and achieve state-of-the-art results for semi-supervised semantic segmentation as well as for semi-supervised domain adaptation. In particular, with only 1/30 of the Cityscapes labels, our method achieves 92% of the fully-supervised baseline performance and even 97% when exploiting additional data from GTA. The source code is available at https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.
Learning Graph Embeddings for Open World Compositional Zero-Shot Learning
M. Mancini, M. F. Naeem, Y. Xian and Z. Akata
Technical Report, 2021
(arXiv: 2105.01017)
Abstract
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available at test time. In this work, we overcome this assumption operating on the open world setting, where no limit is imposed on the compositional space at test time, and the search space contains a large number of unseen compositions. To address this problem, we propose a new approach, Compositional Cosine Graph Embeddings (Co-CGE), based on two principles. First, Co-CGE models the dependency between states, objects and their compositions through a graph convolutional neural network. The graph propagates information from seen to unseen concepts, improving their representations. Second, since not all unseen compositions are equally feasible, and less feasible ones may damage the learned representations, Co-CGE estimates a feasibility score for each unseen composition, using the scores as margins in a cosine similarity-based loss and as weights in the adjacency matrix of the graphs. Experiments show that our approach achieves state-of-the-art performances in standard CZSL while outperforming previous methods in the open world scenario.
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking
D. H. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag and P. Swoboda
Technical Report, 2021
(arXiv: 2111.11892)
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the computer vision field due to its superior performance in real-world applications such as video surveillance with crowded scenes or in vast space. In this work, we propose a mathematically elegant multi-camera multiple object tracking approach based on a spatial-temporal lifted multicut formulation. Our model utilizes state-of-the-art tracklets produced by single-camera trackers as proposals. As these tracklets may contain ID-Switch errors, we refine them through a novel pre-clustering obtained from 3D geometry projections. As a result, we derive a better tracking graph without ID switches and more precise affinity costs for the data association phase. Tracklets are then matched to multi-camera trajectories by solving a global lifted multicut formulation that incorporates short and long-range temporal interactions on tracklets located in the same camera as well as inter-camera ones. Experimental results on the WildTrack dataset yield near-perfect result, outperforming state-of-the-art trackers on Campus while being on par on the PETS-09 dataset. We will make our implementations available upon acceptance of the paper.
From Pixels to People
M. Omran
PhD Thesis, Universität des Saarlandes, 2021
Abstract
Abstract Humans are at the centre of a significant amount of research in computer vision. Endowing machines with the ability to perceive people from visual data is an immense scientific challenge with a high degree of direct practical relevance. Success in automatic perception can be measured at different levels of abstraction, and this will depend on which intelligent behaviour we are trying to replicate: the ability to localise persons in an image or in the environment, understanding how persons are moving at the skeleton and at the surface level, interpreting their interactions with the environment including with other people, and perhaps even anticipating future actions. In this thesis we tackle different sub-problems of the broad research area referred to as "looking at people", aiming to perceive humans in images at different levels of granularity. We start with bounding box-level pedestrian detection: We present a retrospective analysis of methods published in the decade preceding our work, identifying various strands of research that have advanced the state of the art. With quantitative exper- iments, we demonstrate the critical role of developing better feature representations and having the right training distribution. We then contribute two methods based on the insights derived from our analysis: one that combines the strongest aspects of past detectors and another that focuses purely on learning representations. The latter method outperforms more complicated approaches, especially those based on hand- crafted features. We conclude our work on pedestrian detection with a forward-looking analysis that maps out potential avenues for future research. We then turn to pixel-level methods: Perceiving humans requires us to both separate them precisely from the background and identify their surroundings. To this end, we introduce Cityscapes, a large-scale dataset for street scene understanding. This has since established itself as a go-to benchmark for segmentation and detection. We additionally develop methods that relax the requirement for expensive pixel-level annotations, focusing on the task of boundary detection, i.e. identifying the outlines of relevant objects and surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and labelling to fine-grained spatial understanding. We contribute a method for recovering 3D human shape and pose, which marries the advantages of learning-based and model- based approaches. We conclude the thesis with a detailed discussion of benchmarking practices in computer vision. Among other things, we argue that the design of future datasets should be driven by the general goal of combinatorial robustness besides task-specific considerations.
Adversarial Content Manipulation for Analyzing and Improving Model Robustness
R. Shetty
PhD Thesis, Universität des Saarlandes, 2021
Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Technical Report, 2021
(arXiv: 2104.08323)
Abstract
Deep neural network (DNN) accelerators received considerable attention in recent years due to the potential to save energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption significantly, however, causes bit-level failures in the memory storing the quantized DNN weights. Furthermore, DNN accelerators have been shown to be vulnerable to adversarial attacks on voltage controllers or individual bits. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, as well as random bit error training (RandBET) or adversarial bit error training (AdvBET) improves robustness against random or adversarial bit errors in quantized DNN weights significantly. This leads not only to high energy savings for low-voltage operation as well as low-precision quantization, but also improves security of DNN accelerators. Our approach generalizes across operating voltages and accelerators, as demonstrated on bit errors from profiled SRAM arrays, and achieves robustness against both targeted and untargeted bit-level attacks. Without losing more than 0.8%/2% in test accuracy, we can reduce energy consumption on CIFAR10 by 20%/30% for 8/4-bit quantization using RandBET. Allowing up to 320 adversarial bit errors, AdvBET reduces test error from above 90% (chance level) to 26.22% on CIFAR10.
Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes
K. Zhou, B. L. Bhatnagar, B. Schiele and G. Pons-Moll
Technical Report, 2021
(arXiv: 2102.01161)
Abstract
Most learning methods for 3D data (point clouds, meshes) suffer significant performance drops when the data is not carefully aligned to a canonical orientation. Aligning real world 3D data collected from different sources is non-trivial and requires manual intervention. In this paper, we propose the Adjoint Rigid Transform (ART) Network, a neural module which can be integrated with a variety of 3D networks to significantly boost their performance. ART learns to rotate input shapes to a learned canonical orientation, which is crucial for a lot of tasks such as shape reconstruction, interpolation, non-rigid registration, and latent disentanglement. ART achieves this with self-supervision and a rotation equivariance constraint on predicted rotations. The remarkable result is that with only self-supervision, ART facilitates learning a unique canonical orientation for both rigid and nonrigid shapes, which leads to a notable boost in performance of aforementioned tasks. We will release our code and pre-trained models for further research.
2020
Implicit Feature Networks for Texture Completion from Partial 3D Data
J. Chibane and G. Pons-Moll
Computer Vision -- ECCV Workshops 2020, 2020
Synthetic Convolutional Features for Improved Semantic Segmentation
Y. He, B. Schiele and M. Fritz
Computer Vision -- ECCV Workshops 2020, 2020
Adversarial Training Against Location-Optimized Adversarial Patches
S. Rao, D. Stutz and B. Schiele
Computer Vision -- ECCV Workshops 2020, 2020
SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results
A. Saint, A. Kacem, K. Cherenkova, K. Papadopoulos, J. Chibane, G. Pons-Moll, G. Gusev, D. Fofi, D. Aouada and B. Ottersten
Computer Vision -- ECCV Workshops 2020, 2020
Body Shape Privacy in Images: Understanding Privacy and Preventing Automatic Shape Extraction
H. Sattar, K. Krombholz, G. Pons-Moll and M. Fritz
Computer Vision -- ECCV Workshops 2020, 2020
Abstract
Modern approaches to pose and body shape estimation have recently achieved strong performance even under challenging real-world conditions. Even from a single image of a clothed person, a realistic looking body shape can be inferred that captures a users' weight group and body shape type well. This opens up a whole spectrum of applications -- in particular in fashion -- where virtual try-on and recommendation systems can make use of these new and automatized cues. However, a realistic depiction of the undressed body is regarded highly private and therefore might not be consented by most people. Hence, we ask if the automatic extraction of such information can be effectively evaded. While adversarial perturbations have been shown to be effective for manipulating the output of machine learning models -- in particular, end-to-end deep learning approaches -- state of the art shape estimation methods are composed of multiple stages. We perform the first investigation of different strategies that can be used to effectively manipulate the automatic shape estimation while preserving the overall appearance of the original image.
Haar Wavelet based Block Autoregressive Flows for Trajectories
A. Bhattacharyya, C.-N. Straehle, M. Fritz and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Analyzing the Dependency of ConvNets on Spatial Information
Y. Fan, Y. Xian, M. M. Losch and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Long-Term Anticipation of Activities with Cycle Consistency
Y. A. Farha, Q. Ke, B. Schiele and J. Gall
Pattern Recognition (GCPR 2020), 2020
On the Lifted Multicut Polytope for Trees
J.-H. Lange and B. Andres
Pattern Recognition (GCPR 2020), 2020
Semantic Bottlenecks: Quantifying & Improving Inspectability of Deep Representations
M. Losch, M. Fritz and B. Schiele
Pattern Recognition (GCPR 2020), 2020
Long-Tailed Recognition Using Class-Balanced Experts
S. Sharma, N. Yu, M. Fritz and B. Schiele
Pattern Recognition (GCPR 2020), 2020