2021
Real-time Deep Dynamic Characters
M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2021), Volume 40, Number 4, 2021
M. Habermann, L. Liu, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
ACM Transactions on Graphics (Proc. ACM SIGGRAPH 2021), Volume 40, Number 4, 2021
RMM: Reinforced Memory Management for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
Y. Liu, B. Schiele and Q. Sun
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
Shape your Space: A Gaussian Mixture Regularization Approach to Deterministic Autoencoders
A. Saseendran, K. Skubch, S. Falkner and M. Keuper
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
A. Saseendran, K. Skubch, S. Falkner and M. Keuper
Advances in Neural Information Processing Systems 34 Pre-Proceedings (NeurIPS 2021), 2021
Learning Decision Trees Recurrently Through Communication
S. Alaniz, D. Marcos, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
S. Alaniz, D. Marcos, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers
A. Bhattacharyya, D. O. Reino, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
A. Bhattacharyya, D. O. Reino, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Convolutional Dynamic Alignment Networks for Interpretable Classifications
M. D. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
M. D. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Stereo Radiance Fields (SRF): Learning View Synthesis from Sparse Views of Novel Scenes
J. Chibane, A. Bansal, V. Lazova and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
J. Chibane, A. Bansal, V. Lazova and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Learning Spatially-Variant MAP Models for Non-blind Image Deblurring
J. Dong, S. Roth and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
J. Dong, S. Roth and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors
V. Guzov, A. Mir, T. Sattler, and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
V. Guzov, A. Mir, T. Sattler, and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Adaptive Aggregation Networks for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Y. Liu, B. Schiele and Q. Sun
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
Learning Graph Embeddings for Compositional Zero-shot Learning
M. F. Naeem, Y. Xian, F. Tombari and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
M. F. Naeem, Y. Xian, F. Tombari and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
D-NeRF: Neural Radiance Fields for Dynamic Scenes
A. Pumarola, E. Corona, G. Pons-Moll and F. Moreno-Noguer
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
A. Pumarola, E. Corona, G. Pons-Moll and F. Moreno-Noguer
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 2021
A Deeper Look into DeepCap
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll and C. Theobalt
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
Abstract
Human performance capture is a highly important computer vision problem with
many applications in movie production and virtual/augmented reality. Many
previous performance capture approaches either required expensive multi-view
setups or did not recover dense space-time coherent geometry with
frame-to-frame correspondences. We propose a novel deep learning approach for
monocular dense human performance capture. Our method is trained in a weakly
supervised manner based on multi-view supervision completely removing the need
for training data with 3D ground truth annotations. The network architecture is
based on two separate networks that disentangle the task into a pose estimation
and a non-rigid surface deformation step. Extensive qualitative and
quantitative evaluations show that our approach outperforms the state of the
art in terms of quality and robustness. This work is an extended version of
DeepCap where we provide more detailed explanations, comparisons and results as
well as applications.
Future Moment Assessment for Action Query
Q. Ke, M. Fritz and B. Schiele
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
Q. Ke, M. Fritz and B. Schiele
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences
R. G. VidalMata, W. J. Scheirer, A. Kukleva, D. Cox and H. Kuehne
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
R. G. VidalMata, W. J. Scheirer, A. Kukleva, D. Cox and H. Kuehne
IEEE Winter Conference on Applications of Computer Vision (WACV 2021), 2021
mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets
R. Gong, D. Dai, Y. Chen, W. Li and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) R. Gong, D. Dai, Y. Chen, W. Li and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather
M. Hahner, C. Sakaridis, D. Dai and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) M. Hahner, C. Sakaridis, D. Dai and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths
A. Horňáková, T. Kaiser, P. Swoboda, M. Rolinek, B. Rosenhahn and R. Henschel
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) A. Horňáková, T. Kaiser, P. Swoboda, M. Rolinek, B. Rosenhahn and R. Henschel
International Conference on Computer Vision (ICCV 2021), 2021
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
M. Kayser, O.-M. Camburu, L. Salewski, C. Emde, V. Do, Z. Akata and T. Lukasiewicz
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) M. Kayser, O.-M. Camburu, L. Salewski, C. Emde, V. Do, Z. Akata and T. Lukasiewicz
International Conference on Computer Vision (ICCV 2021), 2021
Keep CALM and Improve Visual Feature Attribution
J. M. Kim, J. Choe, Z. Akata and S. J. Oh
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) J. M. Kim, J. Choe, Z. Akata and S. J. Oh
International Conference on Computer Vision (ICCV 2021), 2021
ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding
C. Sakaridis, D. Dai and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) C. Sakaridis, D. Dai and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
Q. Wang, D. Dai, L. Hoyer, L. Van Gool and O. Fink
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) Q. Wang, D. Dai, L. Hoyer, L. Van Gool and O. Fink
International Conference on Computer Vision (ICCV 2021), 2021
End-to-End Urban Driving by Imitating a Reinforcement Learning Coach
Z. Zhang, A. Liniger, D. Dai, F. Yu and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
(Accepted/in press) Z. Zhang, A. Liniger, D. Dai, F. Yu and L. Van Gool
International Conference on Computer Vision (ICCV 2021), 2021
You Only Need Adversarial Supervision for Semantic Image Synthesis
E. Schönfeld, V. Sushko, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Conference on Learning Representations (ICLR 2021), 2021
E. Schönfeld, V. Sushko, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Conference on Learning Representations (ICLR 2021), 2021
Semantic Bottlenecks: Quantifying and Improving Inspectability of Deep Representations
M. Losch, M. Fritz and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
M. Losch, M. Fritz and B. Schiele
International Journal of Computer Vision, Volume 129, 2021
SampleFix: Learning to Correct Programs by Sampling Diverse Fixes
H. Hajipour, A. Bhattacharyya, C.-A. Staicu and M. Fritz
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), 2021
H. Hajipour, A. Bhattacharyya, C.-A. Staicu and M. Fritz
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), 2021
Internalized Biases in Fréchet Inception Distance
S. Jung and M. Keuper
NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2021 Workshop DistShift), 2021
(Accepted/in press) S. Jung and M. Keuper
NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2021 Workshop DistShift), 2021
Efficient Message Passing for 0–1 ILPs with Binary Decision Diagrams
J.-H. Lange and P. Swoboda
Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021
J.-H. Lange and P. Swoboda
Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021
Bit Error Robustness for Energy-Efficient DNN Accelerators
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Proceedings of the 4th MLSys Conference, 2021
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Proceedings of the 4th MLSys Conference, 2021
Abstract
Deep neural network (DNN) accelerators received considerable attention in
past years due to saved energy compared to mainstream hardware. Low-voltage
operation of DNN accelerators allows to further reduce energy consumption
significantly, however, causes bit-level failures in the memory storing the
quantized DNN weights. In this paper, we show that a combination of robust
fixed-point quantization, weight clipping, and random bit error training
(RandBET) improves robustness against random bit errors in (quantized) DNN
weights significantly. This leads to high energy savings from both low-voltage
operation as well as low-precision quantization. Our approach generalizes
across operating voltages and accelerators, as demonstrated on bit errors from
profiled SRAM arrays. We also discuss why weight clipping alone is already a
quite effective way to achieve robustness against bit errors. Moreover, we
specifically discuss the involved trade-offs regarding accuracy, robustness and
precision: Without losing more than 1% in accuracy compared to a normally
trained 8-bit DNN, we can reduce energy consumption on CIFAR-10 by 20%. Higher
energy savings of, e.g., 30%, are possible at the cost of 2.5% accuracy, even
for 4-bit DNNs.
A Closer Look at Self-training for Zero-Label Semantic Segmentation
G. Pastore, F. Cermelli, Y. Xian, M. Mancini, Z. Akata and B. Caputo
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
G. Pastore, F. Cermelli, Y. Xian, M. Mancini, Z. Akata and B. Caputo
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
InfoScrub: Towards Attribute Privacy by Targeted Obfuscation
H.-P. Wang, T. Orekondy and M. Fritz
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
H.-P. Wang, T. Orekondy and M. Fritz
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2021), 2021
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis
Y. He, N. Yu, M. Keuper and M. Fritz
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2021), 2021
Y. He, N. Yu, M. Keuper and M. Fritz
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2021), 2021
RAMA: A Rapid Multicut Algorithm on GPU
A. Abbas and P. Swoboda
Technical Report, 2021a
(arXiv: 2109.01838) A. Abbas and P. Swoboda
Technical Report, 2021a
Abstract
We propose a highly parallel primal-dual algorithm for the multicut (a.k.a.
correlation clustering) problem, a classical graph clustering problem widely
used in machine learning and computer vision. Our algorithm consists of three
steps executed recursively: (1) Finding conflicted cycles that correspond to
violated inequalities of the underlying multicut relaxation, (2) Performing
message passing between the edges and cycles to optimize the Lagrange
relaxation coming from the found violated cycles producing reduced costs and
(3) Contracting edges with high reduced costs through matrix-matrix
multiplications.
Our algorithm produces primal solutions and dual lower bounds that estimate
the distance to optimum. We implement our algorithm on GPUs and show resulting
one to two order-of-magnitudes improvements in execution speed without
sacrificing solution quality compared to traditional serial algorithms that run
on CPUs. We can solve very large scale benchmark problems with up to
$\mathcal{O}(10^8)$ variables in a few seconds with small primal-dual gaps. We
make our code available at https://github.com/pawelswoboda/RAMA.
FastDOG: Fast Discrete Optimization on GPU
A. Abbas and P. Swoboda
Technical Report, 2021b
(arXiv: 2111.10270) A. Abbas and P. Swoboda
Technical Report, 2021b
Abstract
We present a massively parallel Lagrange decomposition method for solving 0-1
integer linear programs occurring in structured prediction. We propose a new
iterative update scheme for solving the Lagrangean dual and a perturbation
technique for decoding primal solutions. For representing subproblems we follow
Lange et al. (2021) and use binary decision diagrams (BDDs). Our primal and
dual algorithms require little synchronization between subproblems and
optimization over BDDs needs only elementary operations without complicated
control flow. This allows us to exploit the parallelism offered by GPUs for all
components of our method. We present experimental results on combinatorial
problems from MAP inference for Markov Random Fields, quadratic assignment and
cell tracking for developmental biology. Our highly parallel GPU implementation
improves upon the running times of the algorithms from Lange et al. (2021) by
up to an order of magnitude. In particular, we come close to or outperform some
state-of-the-art specialized heuristics while being problem agnostic.
Long-term future prediction under uncertainty and multi-modality
A. Bhattacharyya
PhD Thesis, Universität des Saarlandes, 2021
A. Bhattacharyya
PhD Thesis, Universität des Saarlandes, 2021
Optimising for Interpretability: Convolutional Dynamic Alignment Networks
M. D. Böhle, M. Fritz and B. Schiele
Technical Report, 2021
(arXiv: 2109.13004) M. D. Böhle, M. Fritz and B. Schiele
Technical Report, 2021
Abstract
We introduce a new family of neural network models called Convolutional
Dynamic Alignment Networks (CoDA Nets), which are performant classifiers with a
high degree of inherent interpretability. Their core building blocks are
Dynamic Alignment Units (DAUs), which are optimised to transform their inputs
with dynamically computed weight vectors that align with task-relevant
patterns. As a result, CoDA Nets model the classification prediction through a
series of input-dependent linear transformations, allowing for linear
decomposition of the output into individual input contributions. Given the
alignment of the DAUs, the resulting contribution maps align with
discriminative input patterns. These model-inherent decompositions are of high
visual quality and outperform existing attribution methods under quantitative
metrics. Further, CoDA Nets constitute performant classifiers, achieving on par
results to ResNet and VGG models on e.g. CIFAR-10 and TinyImagenet. Lastly,
CoDA Nets can be combined with conventional neural network models to yield
powerful classifiers that more easily scale to complex datasets such as
Imagenet whilst exhibiting an increased interpretable depth, i.e., the output
can be explained well in terms of contributions from intermediate layers within
the network.
Where and When: Space-Time Attention for Audio-Visual Explanations
Y. Chen, T. Hummel, A. S. Koepke and Z. Akata
Technical Report, 2021
(arXiv: 2105.01517) Y. Chen, T. Hummel, A. S. Koepke and Z. Akata
Technical Report, 2021
Abstract
Explaining the decision of a multi-modal decision-maker requires to determine
the evidence from both modalities. Recent advances in XAI provide explanations
for models trained on still images. However, when it comes to modeling multiple
sensory modalities in a dynamic world, it remains underexplored how to
demystify the mysterious dynamics of a complex multi-modal model. In this work,
we take a crucial step forward and explore learnable explanations for
audio-visual recognition. Specifically, we propose a novel space-time attention
network that uncovers the synergistic dynamics of audio and visual data over
both space and time. Our model is capable of predicting the audio-visual video
events, while justifying its decision by localizing where the relevant visual
cues appear, and when the predicted sounds occur in videos. We benchmark our
model on three audio-visual video event datasets, comparing extensively to
multiple recent multi-modal representation learners and intrinsic explanation
models. Experimental results demonstrate the clear superior performance of our
model over the existing methods on audio-visual video event recognition.
Moreover, we conduct an in-depth study to analyze the explainability of our
model based on robustness analysis via perturbation tests and pointing games
using human annotations.
Text-image synergy for multimodal retrieval and annotation
S. N. Chowdhury
PhD Thesis, Universität des Saarlandes, 2021
S. N. Chowdhury
PhD Thesis, Universität des Saarlandes, 2021
Abstract
Text and images are the two most common data modalities found on the Internet. Understanding the synergy between text and images, that is, seamlessly analyzing information from these modalities may be trivial for humans, but is challenging for software systems. In this dissertation we study problems where deciphering text-image synergy is crucial for finding solutions. We propose methods and ideas that establish semantic connections between text and images in multimodal contents, and empirically show their effectiveness in four interconnected problems: Image Retrieval, Image Tag Refinement, Image-Text Alignment, and Image Captioning. Our promising results and observations open up interesting scopes for future research involving text-image data understanding.Text and images are the two most common data modalities found on the Internet. Understanding the synergy between text and images, that is, seamlessly analyzing information from these modalities may be trivial for humans, but is challenging for software systems. In this dissertation we study problems where deciphering text-image synergy is crucial for finding solutions. We propose methods and ideas that establish semantic connections between text and images in multimodal contents, and empirically show their effectiveness in four interconnected problems: Image Retrieval, Image Tag Refinement, Image-Text Alignment, and Image Captioning. Our promising results and observations open up interesting scopes for future research involving text-image data understanding.
TADA: Taxonomy Adaptive Domain Adaptation
R. Gong, M. Danelljan, D. Dai, W. Wang, D. P. Paudel, A. Chhatkuli, F. Yu and L. Van Gool
Technical Report, 2021
(arXiv: 2109.04813) R. Gong, M. Danelljan, D. Dai, W. Wang, D. P. Paudel, A. Chhatkuli, F. Yu and L. Van Gool
Technical Report, 2021
Abstract
Traditional domain adaptation addresses the task of adapting a model to a
novel target domain under limited or no additional supervision. While tackling
the input domain gap, the standard domain adaptation settings assume no domain
change in the output space. In semantic prediction tasks, different datasets
are often labeled according to different semantic taxonomies. In many
real-world settings, the target domain task requires a different taxonomy than
the one imposed by the source domain. We therefore introduce the more general
taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent
taxonomies between the two domains. We further propose an approach that jointly
addresses the image-level and label-level domain adaptation. On the
label-level, we employ a bilateral mixed sampling strategy to augment the
target domain, and a relabelling method to unify and align the label spaces. We
address the image-level domain gap by proposing an uncertainty-rectified
contrastive learning method, leading to more domain-invariant and class
discriminative features. We extensively evaluate the effectiveness of our
framework under different TADA settings: open taxonomy, coarse-to-fine
taxonomy, and partially-overlapping taxonomy. Our framework outperforms
previous state-of-the-art by a large margin, while capable of adapting to
target taxonomies.
Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation
L. Hoyer, D. Dai, Q. Wang, Y. Chen and L. Van Gool
Technical Report, 2021
(arXiv: 2108.12545) L. Hoyer, D. Dai, Q. Wang, Y. Chen and L. Van Gool
Technical Report, 2021
Abstract
Training deep networks for semantic segmentation requires large amounts of
labeled training data, which presents a major challenge in practice, as
labeling segmentation masks is a highly labor-intensive process. To address
this issue, we present a framework for semi-supervised and domain-adaptive
semantic segmentation, which is enhanced by self-supervised monocular depth
estimation (SDE) trained only on unlabeled image sequences.
In particular, we utilize SDE as an auxiliary task comprehensively across the
entire learning framework: First, we automatically select the most useful
samples to be annotated for semantic segmentation based on the correlation of
sample diversity and difficulty between SDE and semantic segmentation. Second,
we implement a strong data augmentation by mixing images and labels using the
geometry of the scene. Third, we transfer knowledge from features learned
during SDE to semantic segmentation by means of transfer and multi-task
learning. And fourth, we exploit additional labeled synthetic data with
Cross-Domain DepthMix and Matching Geometry Sampling to align synthetic and
real data.
We validate the proposed model on the Cityscapes dataset, where all four
contributions demonstrate significant performance gains, and achieve
state-of-the-art results for semi-supervised semantic segmentation as well as
for semi-supervised domain adaptation. In particular, with only 1/30 of the
Cityscapes labels, our method achieves 92% of the fully-supervised baseline
performance and even 97% when exploiting additional data from GTA. The source
code is available at
https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.
Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting
A. Kukleva, H. Kuehne and B. Schiele
Technical Report, 2021
(arXiv: 2108.08165) A. Kukleva, H. Kuehne and B. Schiele
Technical Report, 2021
Abstract
Both generalized and incremental few-shot learning have to deal with three
major challenges: learning novel classes from only few samples per class,
preventing catastrophic forgetting of base classes, and classifier calibration
across novel and base classes. In this work we propose a three-stage framework
that allows to explicitly and effectively address these challenges. While the
first phase learns base classes with many samples, the second phase learns a
calibrated classifier for novel classes from few samples while also preventing
catastrophic forgetting. In the final phase, calibration is achieved across all
classes. We evaluate the proposed framework on four challenging benchmark
datasets for image and video few-shot classification and obtain
state-of-the-art results for both generalized and incremental few shot
learning.
Learning Graph Embeddings for Open World Compositional Zero-Shot Learning
M. Mancini, M. F. Naeem, Y. Xian and Z. Akata
Technical Report, 2021
(arXiv: 2105.01017) M. Mancini, M. F. Naeem, Y. Xian and Z. Akata
Technical Report, 2021
Abstract
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions
of state and object visual primitives seen during training. A problem with
standard CZSL is the assumption of knowing which unseen compositions will be
available at test time. In this work, we overcome this assumption operating on
the open world setting, where no limit is imposed on the compositional space at
test time, and the search space contains a large number of unseen compositions.
To address this problem, we propose a new approach, Compositional Cosine Graph
Embeddings (Co-CGE), based on two principles. First, Co-CGE models the
dependency between states, objects and their compositions through a graph
convolutional neural network. The graph propagates information from seen to
unseen concepts, improving their representations. Second, since not all unseen
compositions are equally feasible, and less feasible ones may damage the
learned representations, Co-CGE estimates a feasibility score for each unseen
composition, using the scores as margins in a cosine similarity-based loss and
as weights in the adjacency matrix of the graphs. Experiments show that our
approach achieves state-of-the-art performances in standard CZSL while
outperforming previous methods in the open world scenario.
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking
D. H. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag and P. Swoboda
Technical Report, 2021
(arXiv: 2111.11892) D. H. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag and P. Swoboda
Technical Report, 2021
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the
computer vision field due to its superior performance in real-world
applications such as video surveillance with crowded scenes or in vast space.
In this work, we propose a mathematically elegant multi-camera multiple object
tracking approach based on a spatial-temporal lifted multicut formulation. Our
model utilizes state-of-the-art tracklets produced by single-camera trackers as
proposals. As these tracklets may contain ID-Switch errors, we refine them
through a novel pre-clustering obtained from 3D geometry projections. As a
result, we derive a better tracking graph without ID switches and more precise
affinity costs for the data association phase. Tracklets are then matched to
multi-camera trajectories by solving a global lifted multicut formulation that
incorporates short and long-range temporal interactions on tracklets located in
the same camera as well as inter-camera ones. Experimental results on the
WildTrack dataset yield near-perfect result, outperforming state-of-the-art
trackers on Campus while being on par on the PETS-09 dataset. We will make our
implementations available upon acceptance of the paper.
Seeking Similarities over Differences: Similarity-based Domain Alignment for Adaptive Object Detection
F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang and B. Schiele
Technical Report, 2021
(arXiv: 2110.01428) F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang and B. Schiele
Technical Report, 2021
Abstract
In order to robustly deploy object detectors across a wide range of
scenarios, they should be adaptable to shifts in the input distribution without
the need to constantly annotate new data. This has motivated research in
Unsupervised Domain Adaptation (UDA) algorithms for detection. UDA methods
learn to adapt from labeled source domains to unlabeled target domains, by
inducing alignment between detector features from source and target domains.
Yet, there is no consensus on what features to align and how to do the
alignment. In our work, we propose a framework that generalizes the different
components commonly used by UDA methods laying the ground for an in-depth
analysis of the UDA design space. Specifically, we propose a novel UDA
algorithm, ViSGA, a direct implementation of our framework, that leverages the
best design choices and introduces a simple but effective method to aggregate
features at instance-level based on visual similarity before inducing group
alignment via adversarial training. We show that both similarity-based grouping
and adversarial training allows our model to focus on coarsely aligning feature
groups, without being forced to match all instances across loosely aligned
domains. Finally, we examine the applicability of ViSGA to the setting where
labeled data are gathered from different sources. Experiments show that not
only our method outperforms previous single-source approaches on Sim2Real and
Adverse Weather, but also generalizes well to the multi-source setting.
Adversarial Content Manipulation for Analyzing and Improving Model Robustness
R. Shetty
PhD Thesis, Universität des Saarlandes, 2021
R. Shetty
PhD Thesis, Universität des Saarlandes, 2021
Relating Adversarially Robust Generalization to Flat Minima
D. Stutz, M. Hein and B. Schiele
Technical Report, 2021
(arXiv: 2104.04448) D. Stutz, M. Hein and B. Schiele
Technical Report, 2021
Abstract
Adversarial training (AT) has become the de-facto standard to obtain models
robust against adversarial examples. However, AT exhibits severe robust
overfitting: cross-entropy loss on adversarial examples, so-called robust loss,
decreases continuously on training examples, while eventually increasing on
test examples. In practice, this leads to poor robust generalization, i.e.,
adversarial robustness does not generalize well to new examples. In this paper,
we study the relationship between robust generalization and flatness of the
robust loss landscape in weight space, i.e., whether robust loss changes
significantly when perturbing weights. To this end, we propose average- and
worst-case metrics to measure flatness in the robust loss landscape and show a
correlation between good robust generalization and flatness. For example,
throughout training, flatness reduces significantly during overfitting such
that early stopping effectively finds flatter minima in the robust loss
landscape. Similarly, AT variants achieving higher adversarial robustness also
correspond to flatter minima. This holds for many popular choices, e.g.,
AT-AWP, TRADES, MART, AT with self-supervision or additional unlabeled
examples, as well as simple regularization techniques, e.g., AutoAugment,
weight decay or label noise. For fair comparison across these approaches, our
flatness measures are specifically designed to be scale-invariant and we
conduct extensive experiments to validate our findings.
Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators
D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Technical Report, 2021
(arXiv: 2104.08323) D. Stutz, N. Chandramoorthy, M. Hein and B. Schiele
Technical Report, 2021
Abstract
Deep neural network (DNN) accelerators received considerable attention in
recent years due to the potential to save energy compared to mainstream
hardware. Low-voltage operation of DNN accelerators allows to further reduce
energy consumption significantly, however, causes bit-level failures in the
memory storing the quantized DNN weights. Furthermore, DNN accelerators have
been shown to be vulnerable to adversarial attacks on voltage controllers or
individual bits. In this paper, we show that a combination of robust
fixed-point quantization, weight clipping, as well as random bit error training
(RandBET) or adversarial bit error training (AdvBET) improves robustness
against random or adversarial bit errors in quantized DNN weights
significantly. This leads not only to high energy savings for low-voltage
operation as well as low-precision quantization, but also improves security of
DNN accelerators. Our approach generalizes across operating voltages and
accelerators, as demonstrated on bit errors from profiled SRAM arrays, and
achieves robustness against both targeted and untargeted bit-level attacks.
Without losing more than 0.8%/2% in test accuracy, we can reduce energy
consumption on CIFAR10 by 20%/30% for 8/4-bit quantization using RandBET.
Allowing up to 320 adversarial bit errors, AdvBET reduces test error from above
90% (chance level) to 26.22% on CIFAR10.
Neural-GIF: Neural Generalized Implicit Functions for Animating People in Clothing
G. Tiwari, N. Sarafianos, T. Tung and G. Pons-Moll
Technical Report, 2021
(arXiv: 2108.08807) G. Tiwari, N. Sarafianos, T. Tung and G. Pons-Moll
Technical Report, 2021
Abstract
We present Neural Generalized Implicit Functions(Neural-GIF), to animate
people in clothing as a function of the body pose. Given a sequence of scans of
a subject in various poses, we learn to animate the character for new poses.
Existing methods have relied on template-based representations of the human
body (or clothing). However such models usually have fixed and limited
resolutions, require difficult data pre-processing steps and cannot be used
with complex clothing. We draw inspiration from template-based methods, which
factorize motion into articulation and non-rigid deformation, but generalize
this concept for implicit shape learning to obtain a more flexible model. We
learn to map every point in the space to a canonical space, where a learned
deformation field is applied to model non-rigid effects, before evaluating the
signed distance field. Our formulation allows the learning of complex and
non-rigid deformations of clothing and soft tissue, without computing a
template registration as it is common with current approaches. Neural-GIF can
be trained on raw 3D scans and reconstructs detailed complex surface geometry
and deformations. Moreover, the model can generalize to new poses. We evaluate
our method on a variety of characters from different public datasets in diverse
clothing styles and show significant improvements over baseline methods,
quantitatively and qualitatively. We also extend our model to multiple shape
setting. To stimulate further research, we will make the model, code and data
publicly available at: https://virtualhumans.mpi-inf.mpg.de/neuralgif/
Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes
K. Zhou, B. L. Bhatnagar, B. Schiele and G. Pons-Moll
Technical Report, 2021
(arXiv: 2102.01161) K. Zhou, B. L. Bhatnagar, B. Schiele and G. Pons-Moll
Technical Report, 2021
Abstract
Most learning methods for 3D data (point clouds, meshes) suffer significant
performance drops when the data is not carefully aligned to a canonical
orientation. Aligning real world 3D data collected from different sources is
non-trivial and requires manual intervention. In this paper, we propose the
Adjoint Rigid Transform (ART) Network, a neural module which can be integrated
with a variety of 3D networks to significantly boost their performance. ART
learns to rotate input shapes to a learned canonical orientation, which is
crucial for a lot of tasks such as shape reconstruction, interpolation,
non-rigid registration, and latent disentanglement. ART achieves this with
self-supervision and a rotation equivariance constraint on predicted rotations.
The remarkable result is that with only self-supervision, ART facilitates
learning a unique canonical orientation for both rigid and nonrigid shapes,
which leads to a notable boost in performance of aforementioned tasks. We will
release our code and pre-trained models for further research.
2020
Body Shape Privacy in Images: Understanding Privacy and Preventing Automatic Shape Extraction
H. Sattar, K. Krombholz, G. Pons-Moll and M. Fritz
Computer Vision -- ECCV Workshops 2020, 2020
H. Sattar, K. Krombholz, G. Pons-Moll and M. Fritz
Computer Vision -- ECCV Workshops 2020, 2020
Abstract
Modern approaches to pose and body shape estimation have recently achieved
strong performance even under challenging real-world conditions. Even from a
single image of a clothed person, a realistic looking body shape can be
inferred that captures a users' weight group and body shape type well. This
opens up a whole spectrum of applications -- in particular in fashion -- where
virtual try-on and recommendation systems can make use of these new and
automatized cues. However, a realistic depiction of the undressed body is
regarded highly private and therefore might not be consented by most people.
Hence, we ask if the automatic extraction of such information can be
effectively evaded. While adversarial perturbations have been shown to be
effective for manipulating the output of machine learning models -- in
particular, end-to-end deep learning approaches -- state of the art shape
estimation methods are composed of multiple stages. We perform the first
investigation of different strategies that can be used to effectively
manipulate the automatic shape estimation while preserving the overall
appearance of the original image.