2022
B-cos Networks: Alignment is All We Need for Interpretability
M. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press) M. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation
S. Cai, A. Obukhov, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press) S. Cai, A. Obukhov, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Decoupling Zero-Shot Semantic Segmentation
J. Ding, N. Xue, G.-S. Xia and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2112.07910, Accepted/in press) J. Ding, N. Xue, G.-S. Xia and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories
that have not been seen in the training. Existing works formulate ZS3 as a
pixel-level zero-shot classification problem, and transfer semantic knowledge
from seen classes to unseen ones with the help of language models pre-trained
only with texts. While simple, the pixel-level ZS3 formulation shows the
limited capability to integrate vision-language models that are often
pre-trained with image-text pairs and currently demonstrate great potential for
vision tasks. Inspired by the observation that humans often perform
segment-level semantic labeling, we propose to decouple the ZS3 into two
sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.
2) a zero-shot classification task on segments. The former sub-task does not
involve category information and can be directly transferred to group pixels
for unseen classes. The latter subtask performs at segment-level and provides a
natural way to leverage large-scale vision-language models pre-trained with
image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we
propose a simple and effective zero-shot semantic segmentation model, called
ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by
large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff
in terms of mIoU for unseen classes. Code will be released at
https://github.com/dingjiansw101/ZegFormer.
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
Y. Fan, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2112.04564, Accepted/in press) Y. Fan, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with
decoupled representation learning and classifier learning for imbalanced SSL.
To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)
for classifier learning. Furthermore, the current evaluation protocol for
imbalanced SSL focuses only on balanced test sets, which has limited
practicality in real-world scenarios. Therefore, we further conduct a
comprehensive evaluation under various shifted test distributions. In
experiments, we show that our approach outperforms other methods over a large
range of shifted distributions, achieving state-of-the-art performance on
benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our
code will be made publicly available.
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2111.14887, Accepted/in press) L. Hoyer, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
As acquiring pixel-wise annotations of real-world images for semantic
segmentation is a costly process, a model can instead be trained with more
accessible synthetic data and adapted to real images without requiring their
annotations. This process is studied in unsupervised domain adaptation (UDA).
Even though a large number of methods propose new adaptation strategies, they
are mostly based on outdated network architectures. As the influence of recent
network architectures has not been systematically studied, we first benchmark
different network architectures for UDA and then propose a novel UDA method,
DAFormer, based on the benchmark results. The DAFormer network consists of a
Transformer encoder and a multi-level context-aware feature fusion decoder. It
is enabled by three simple but crucial training strategies to stabilize the
training and to avoid overfitting DAFormer to the source domain: While the Rare
Class Sampling on the source domain improves the quality of pseudo-labels by
mitigating the confirmation bias of self-training towards common classes, the
Thing-Class ImageNet Feature Distance and a learning rate warmup promote
feature transfer from ImageNet pretraining. DAFormer significantly improves the
state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for
Synthia->Cityscapes and enables learning even difficult classes such as train,
bus, and truck well. The implementation is available at
https://github.com/lhoyer/DAFormer.
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding
X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai and C.-W. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press) X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai and C.-W. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Although considerable progress has been made in semantic scene understanding
under clear weather, it is still a tough problem under adverse weather
conditions, such as dense fog, due to the uncertainty caused by imperfect
observations. Besides, difficulties in collecting and labeling foggy images
hinder the progress of this field. Considering the success in semantic scene
understanding under clear weather, we think it is reasonable to transfer
knowledge learned from clear images to the foggy domain. As such, the problem
becomes to bridge the domain gap between clear images and foggy images. Unlike
previous methods that mainly focus on closing the domain gap caused by fog --
defogging the foggy images or fogging the clear images, we propose to alleviate
the domain gap by considering fog influence and style variation simultaneously.
The motivation is based on our finding that the style-related gap and the
fog-related gap can be divided and closed respectively, by adding an
intermediate domain. Thus, we propose a new pipeline to cumulatively adapt
style, fog and the dual-factor (style and fog). Specifically, we devise a
unified framework to disentangle the style factor and the fog factor
separately, and then the dual-factor from images in different domains.
Furthermore, we collaborate the disentanglement of three factors with a novel
cumulative loss to thoroughly disentangle these three factors. Our method
achieves the state-of-the-art performance on three benchmarks and shows
generalization ability in rainy and snowy scenes.
Towards Better Understanding Attribution Methods
S. Rao, M. Böhle and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press) S. Rao, M. Böhle and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Sound and Visual Representation Learning with Multiple Pretraining Tasks
A. B. Vasudevan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2201.01046, Accepted/in press) A. B. Vasudevan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Different self-supervised tasks (SSL) reveal different features from the
data. The learned feature representations can exhibit different performance for
each downstream task. In this light, this work aims to combine Multiple SSL
tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically,
for this study, we investigate binaural sounds and image data in isolation. For
binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal
synchronization of foreground objects and binaural audio and temporal gap
prediction. We investigate several approaches of Multi-SSL and give insights
into the downstream task performance on video retrieval, spatial sound super
resolution, and semantic prediction on the OmniAudio dataset. Our experiments
on binaural sound representations demonstrate that Multi-SSL via incremental
learning (IL) of SSL tasks outperforms single SSL task models and fully
supervised models in the downstream task performance. As a check of
applicability on other modality, we also formulate our Multi-SSL models for
image representation learning and we use the recently proposed SSL tasks,
MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2,
DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83,
+1.56 and +1.61 AP on COCO detection. Code will be made publicly available.
Adiabatic Quantum Computing for Multi Object Tracking
J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2202.08837, Accepted/in press) J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Multi-Object Tracking (MOT) is most often approached in the
tracking-by-detection paradigm, where object detections are associated through
time. The association step naturally leads to discrete optimization problems.
As these optimization problems are often NP-hard, they can only be solved
exactly for small instances on current hardware. Adiabatic quantum computing
(AQC) offers a solution for this, as it has the potential to provide a
considerable speedup on a range of NP-hard optimization problems in the near
future. However, current MOT formulations are unsuitable for quantum computing
due to their scaling properties. In this work, we therefore propose the first
MOT formulation designed to be solved with AQC. We employ an Ising model that
represents the quantum mechanical system implemented on the AQC. We show that
our approach is competitive compared with state-of-the-art optimization-based
approaches, even when using of-the-shelf integer programming solvers. Finally,
we demonstrate that our MOT problem is already solvable on the current
generation of real quantum computers for small examples, and analyze the
properties of the measured solutions.
Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation
E. Levinkov, A. Kardoost, B. Andres and M. Keuper
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
E. Levinkov, A. Kardoost, B. Andres and M. Keuper
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Abstract
Minimum cost lifted multicut problem is a generalization of the multicut problem and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. Its main advantage is that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e. hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is NP-hard and the branch-and-bound algorithm is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: geometric multiple model fitting, homography and motion estimation, motion segmentation.
Attribute Prototype Network for Any-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
International Journal of Computer Vision, 2022
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
International Journal of Computer Vision, 2022
2021
(SP)2Net for Generalized Zero-Label Semantic Segmentation
A. Das, Y. Xian, Y. He, B. Schiele and Z. Akata
Pattern Recognition (GCPR 2021), 2021
A. Das, Y. Xian, Y. He, B. Schiele and Z. Akata
Pattern Recognition (GCPR 2021), 2021