2020
LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration
B. L. Bhatnagar, C. Sminchisescu, C. Theobalt and G. Pons-Moll
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
B. L. Bhatnagar, C. Sminchisescu, C. Theobalt and G. Pons-Moll
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
Neural Unsigned Distance Fields for Implicit Function Learning
J. Chibane, A. Mir and G. Pons-Moll
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
J. Chibane, A. Mir and G. Pons-Moll
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring
J. Dong, S. Roth and B. Schiele
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
J. Dong, S. Roth and B. Schiele
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
Attribute Prototype Network for Zero-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
Advances in Neural Information Processing Systems 33 (NIPS 2020), 2020
SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera
D. Tome, T. Alldieck, P. Peluse, G. Pons-Moll, L. Agapito, H. Badino and F. de la Torre
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
D. Tome, T. Alldieck, P. Peluse, G. Pons-Moll, L. Agapito, H. Badino and F. de la Torre
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
Learning Robust Representations via Multi-View Information Bottleneck
M. Federici, A. Dutta, P. Forré, N. Kushman and Z. Akata
International Conference on Learning Representations (ICLR 2020), 2020
M. Federici, A. Dutta, P. Forré, N. Kushman and Z. Akata
International Conference on Learning Representations (ICLR 2020), 2020
Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks
T. Orekondy, B. Schiele and M. Fritz
International Conference on Learning Representations (ICLR 2020), 2020
T. Orekondy, B. Schiele and M. Fritz
International Conference on Learning Representations (ICLR 2020), 2020
Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based Image Retrieval
A. Dutta and Z. Akata
International Journal of Computer Vision, Volume 128, 2020
A. Dutta and Z. Akata
International Journal of Computer Vision, Volume 128, 2020
Diverse and Relevant Visual Storytelling with Scene Graph Embeddings
X. Hong, R. Shetty, A. Sayeed, K. Mehra, V. Demberg and B. Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL 2020), 2020
X. Hong, R. Shetty, A. Sayeed, K. Mehra, V. Demberg and B. Schiele
Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL 2020), 2020
Lifted Disjoint Paths with Application in Multiple Object Tracking
A. Horňáková, R. Henschel, B. Rosenhahn and P. Swoboda
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020
A. Horňáková, R. Henschel, B. Rosenhahn and P. Swoboda
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020
Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks
D. Stutz, M. Hein and B. Schiele
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020
D. Stutz, M. Hein and B. Schiele
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020
A Primal-Dual Solver for Large-Scale Tracking-by-Assignment
S. Haller, M. Prakash, L. Hutschenreiter, T. Pietzsch, C. Rother, F. Jug, P. Swoboda and B. Savchynskyy
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020
S. Haller, M. Prakash, L. Hutschenreiter, T. Pietzsch, C. Rother, F. Jug, P. Swoboda and B. Savchynskyy
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020
Haar Wavelet based Block Autoregressive Flows for Trajectories
A. Bhattacharyya, C.-N. Straehle, M. Fritz and B. Schiele
Technical Report, 2020
(arXiv: 2009.09878) A. Bhattacharyya, C.-N. Straehle, M. Fritz and B. Schiele
Technical Report, 2020
Abstract
Prediction of trajectories such as that of pedestrians is crucial to the
performance of autonomous agents. While previous works have leveraged
conditional generative models like GANs and VAEs for learning the likely future
trajectories, accurately modeling the dependency structure of these multimodal
distributions, particularly over long time horizons remains challenging.
Normalizing flow based generative models can model complex distributions
admitting exact inference. These include variants with split coupling
invertible transformations that are easier to parallelize compared to their
autoregressive counterparts. To this end, we introduce a novel Haar wavelet
based block autoregressive model leveraging split couplings, conditioned on
coarse trajectories obtained from Haar wavelet based transformations at
different levels of granularity. This yields an exact inference method that
models trajectories at different spatio-temporal resolutions in a hierarchical
manner. We illustrate the advantages of our approach for generating diverse and
accurate trajectories on two real-world datasets - Stanford Drone and
Intersection Drone.
PoseTrackReID: Dataset Description
A. Doering, D. Chen, S. Zhang, B. Schiele and J. Gall
Technical Report, 2020
(arXiv: 2011.06243) A. Doering, D. Chen, S. Zhang, B. Schiele and J. Gall
Technical Report, 2020
Abstract
Current datasets for video-based person re-identification (re-ID) do not
include structural knowledge in form of human pose annotations for the persons
of interest. Nonetheless, pose information is very helpful to disentangle
useful feature information from background or occlusion noise. Especially
real-world scenarios, such as surveillance, contain a lot of occlusions in
human crowds or by obstacles. On the other hand, video-based person re-ID can
benefit other tasks such as multi-person pose tracking in terms of robust
feature matching. For that reason, we present PoseTrackReID, a large-scale
dataset for multi-person pose tracking and video-based person re-ID. With
PoseTrackReID, we want to bridge the gap between person re-ID and multi-person
pose tracking. Additionally, this dataset provides a good benchmark for current
state-of-the-art methods on multi-frame person re-ID.
Analyzing the Dependency of ConvNets on Spatial Information
Y. Fan, Y. Xian, M. M. Losch and B. Schiele
Technical Report, 2020
(arXiv: 2002.01827) Y. Fan, Y. Xian, M. M. Losch and B. Schiele
Technical Report, 2020
Abstract
Intuitively, image classification should profit from using spatial
information. Recent work, however, suggests that this might be overrated in
standard CNNs. In this paper, we are pushing the envelope and aim to further
investigate the reliance on spatial information. We propose spatial shuffling
and GAP+FC to destroy spatial information during both training and testing
phases. Interestingly, we observe that spatial information can be deleted from
later layers with small performance drops, which indicates spatial information
at later layers is not necessary for good performance. For example, test
accuracy of VGG-16 only drops by 0.03% and 2.66% with spatial information
completely removed from the last 30% and 53% layers on CIFAR100, respectively.
Evaluation on several object recognition datasets (CIFAR100, Small-ImageNet,
ImageNet) with a wide range of CNN architectures (VGG16, ResNet50, ResNet152)
shows an overall consistent pattern.
Improved Methods and Analysis for Semantic Image Segmentation
Y. He
PhD Thesis, Universität des Saarlandes, 2020
Y. He
PhD Thesis, Universität des Saarlandes, 2020
Abstract
Modern deep learning has enabled amazing developments of computer vision in recent years (Hinton and Salakhutdinov, 2006; Krizhevsky et al., 2012). As a fundamental task, semantic segmentation aims to predict class labels for each pixel of images, which empowers machines perception of the visual world. In spite of recent successes of fully convolutional networks (Long etal., 2015), several challenges remain to be addressed. In this thesis, we focus on this topic, under different kinds of input formats and various types of scenes. Specifically, our study contains two aspects: (1) Data-driven neural modules for improved performance. (2) Leverage of datasets w.r.t.training systems with higher performances and better data privacy guarantees. In the first part of this thesis, we improve semantic segmentation by designing new modules which are compatible with existing architectures. First, we develop a spatio-temporal data-driven pooling, which brings additional information of data (i.e. superpixels) into neural networks, benefiting the training of neural networks as well as the inference on novel data. We investigate our approach in RGB-D videos for segmenting indoor scenes, where depth provides complementary cues to colors and our model performs particularly well. Second, we design learnable dilated convolutions, which are the extension of standard dilated convolutions, whose dilation factors (Yu and Koltun, 2016) need to be carefully determined by hand to obtain decent performance. We present a method to learn dilation factors together with filter weights of convolutions to avoid a complicated search of dilation factors. We explore extensive studies on challenging street scenes, across various baselines with different complexity as well as several datasets at varying image resolutions. In the second part, we investigate how to utilize expensive training data. First, we start from the generative modelling and study the network architectures and the learning pipeline for generating multiple examples. We aim to improve the diversity of generated examples but also to preserve the comparable quality of the examples. Second, we develop a generative model for synthesizing features of a network. With a mixture of real images and synthetic features, we are able to train a segmentation model with better generalization capability. Our approach is evaluated on different scene parsing tasks to demonstrate the effectiveness of the proposed method. Finally, we study membership inference on the semantic segmentation task. We propose the first membership inference attack system against black-box semantic segmentation models, that tries to infer if a data pair is used as training data or not. From our observations, information on training data is indeed leaking. To mitigate the leakage, we leverage our synthetic features to perform prediction obfuscations, reducing the posterior distribution gaps between a training and a testing set. Consequently, our study provides not only an approach for detecting illegal use of data, but also the foundations for a safer use of semantic segmentation models.
Multicut Optimization Guarantees & Geometry of Lifted Multicuts
J.-H. Lange
PhD Thesis, Universität des Saarlandes, 2020
J.-H. Lange
PhD Thesis, Universität des Saarlandes, 2020
Meta-Aggregating Networks for Class-Incremental Learning
Y. Liu, B. Schiele and Q. Sun
Technical Report, 2020
(arXiv: 2010.05063) Y. Liu, B. Schiele and Q. Sun
Technical Report, 2020
Abstract
Class-Incremental Learning (CIL) aims to learn a classification model with
the number of classes increasing phase-by-phase. The inherent problem in CIL is
the stability-plasticity dilemma between the learning of old and new classes,
i.e., high-plasticity models easily forget old classes but high-stability
models are weak to learn new classes. We alleviate this issue by proposing a
novel network architecture called Meta-Aggregating Networks (MANets) in which
we explicitly build two residual blocks at each residual level (taking ResNet
as the baseline architecture): a stable block and a plastic block. We aggregate
the output feature maps from these two blocks and then feed the results to the
next-level blocks. We meta-learn the aggregating weights in order to
dynamically optimize and balance between two types of blocks, i.e., between
stability and plasticity. We conduct extensive experiments on three CIL
benchmarks: CIFAR-100, ImageNet-Subset, and ImageNet, and show that many
existing CIL methods can be straightforwardly incorporated on the architecture
of MANets to boost their performance.
D-NeRF: Neural Radiance Fields for Dynamic Scenes
A. Pumarola, E. Corona, G. Pons-Moll and F. Moreno-Noguer
Technical Report, 2020
(arXiv: 2011.13961) A. Pumarola, E. Corona, G. Pons-Moll and F. Moreno-Noguer
Technical Report, 2020
Abstract
Neural rendering techniques combining machine learning with geometric
reasoning have arisen as one of the most promising approaches for synthesizing
novel views of a scene from a sparse set of images. Among these, stands out the
Neural radiance fields (NeRF), which trains a deep network to map 5D input
coordinates (representing spatial location and viewing direction) into a volume
density and view-dependent emitted radiance. However, despite achieving an
unprecedented level of photorealism on the generated images, NeRF is only
applicable to static scenes, where the same spatial location can be queried
from different images. In this paper we introduce D-NeRF, a method that extends
neural radiance fields to a dynamic domain, allowing to reconstruct and render
novel images of objects under rigid and non-rigid motions from a \emph{single}
camera moving around the scene. For this purpose we consider time as an
additional input to the system, and split the learning process in two main
stages: one that encodes the scene into a canonical space and another that maps
this canonical representation into the deformed scene at a particular time.
Both mappings are simultaneously learned using fully-connected networks. Once
the networks are trained, D-NeRF can render novel images, controlling both the
camera view and the time variable, and thus, the object movement. We
demonstrate the effectiveness of our approach on scenes with objects under
rigid, articulated and non-rigid motions. Code, model weights and the dynamic
scenes dataset will be released.
Adversarial Training against Location-Optimized Adversarial Patches
S. Rao, D. Stutz and B. Schiele
Technical Report, 2020
(arXiv: 2005.02313) S. Rao, D. Stutz and B. Schiele
Technical Report, 2020
Abstract
Deep neural networks have been shown to be susceptible to adversarial
examples -- small, imperceptible changes constructed to cause
mis-classification in otherwise highly accurate image classifiers. As a
practical alternative, recent work proposed so-called adversarial patches:
clearly visible, but adversarially crafted rectangular patches in images. These
patches can easily be printed and applied in the physical world. While defenses
against imperceptible adversarial examples have been studied extensively,
robustness against adversarial patches is poorly understood. In this work, we
first devise a practical approach to obtain adversarial patches while actively
optimizing their location within the image. Then, we apply adversarial training
on these location-optimized adversarial patches and demonstrate significantly
improved robustness on CIFAR10 and GTSRB. Additionally, in contrast to
adversarial training on imperceptible adversarial examples, our adversarial
patch training does not reduce accuracy.
Learning from Limited Labeled Data - Zero-Shot and Few-Shot Learning
Y. Xian
PhD Thesis, Universität des Saarlandes, 2020
Y. Xian
PhD Thesis, Universität des Saarlandes, 2020
Generalized Many-Way Few-Shot Video Classification
Y. Xian, B. Korbar, M. Douze, B. Schiele, Z. Akata and L. Torresani
Technical Report, 2020
(arXiv: 2007.04755) Y. Xian, B. Korbar, M. Douze, B. Schiele, Z. Akata and L. Torresani
Technical Report, 2020
Abstract
Few-shot learning methods operate in low data regimes. The aim is to learn
with few training examples per class. Although significant progress has been
made in few-shot image classification, few-shot video recognition is relatively
unexplored and methods based on 2D CNNs are unable to learn temporal
information. In this work we thus develop a simple 3D CNN baseline, surpassing
existing methods by a large margin. To circumvent the need of labeled examples,
we propose to leverage weakly-labeled videos from a large dataset using tag
retrieval followed by selecting the best clips with visual similarities,
yielding further improvement. Our results saturate current 5-way benchmarks for
few-shot video classification and therefore we propose a new challenging
benchmark involving more classes and a mixture of classes with varying
supervision.
2019