D2
Computer Vision and Machine Learning
2022
BEHAVE: Dataset and Method for Tracking Human Object Interactions
B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
B-cos Networks: Alignment is All We Need for Interpretability
M. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation
S. Cai, A. Obukhov, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Decoupling Zero-Shot Semantic Segmentation
J. Ding, N. Xue, G.-S. Xia and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2112.07910, Accepted/in press)
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former sub-task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter subtask performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
Y. Fan, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2112.04564, Accepted/in press)
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with decoupled representation learning and classifier learning for imbalanced SSL. To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE) for classifier learning. Furthermore, the current evaluation protocol for imbalanced SSL focuses only on balanced test sets, which has limited practicality in real-world scenarios. Therefore, we further conduct a comprehensive evaluation under various shifted test distributions. In experiments, we show that our approach outperforms other methods over a large range of shifted distributions, achieving state-of-the-art performance on benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our code will be made publicly available.
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2111.14887, Accepted/in press)
Abstract
As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and then propose a novel UDA method, DAFormer, based on the benchmark results. The DAFormer network consists of a Transformer encoder and a multi-level context-aware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting DAFormer to the source domain: While the Rare Class Sampling on the source domain improves the quality of pseudo-labels by mitigating the confirmation bias of self-training towards common classes, the Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer significantly improves the state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for Synthia->Cityscapes and enables learning even difficult classes such as train, bus, and truck well. The implementation is available at https://github.com/lhoyer/DAFormer.
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding
X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai and C.-W. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Abstract
Although considerable progress has been made in semantic scene understanding under clear weather, it is still a tough problem under adverse weather conditions, such as dense fog, due to the uncertainty caused by imperfect observations. Besides, difficulties in collecting and labeling foggy images hinder the progress of this field. Considering the success in semantic scene understanding under clear weather, we think it is reasonable to transfer knowledge learned from clear images to the foggy domain. As such, the problem becomes to bridge the domain gap between clear images and foggy images. Unlike previous methods that mainly focus on closing the domain gap caused by fog -- defogging the foggy images or fogging the clear images, we propose to alleviate the domain gap by considering fog influence and style variation simultaneously. The motivation is based on our finding that the style-related gap and the fog-related gap can be divided and closed respectively, by adding an intermediate domain. Thus, we propose a new pipeline to cumulatively adapt style, fog and the dual-factor (style and fog). Specifically, we devise a unified framework to disentangle the style factor and the fog factor separately, and then the dual-factor from images in different domains. Furthermore, we collaborate the disentanglement of three factors with a novel cumulative loss to thoroughly disentangle these three factors. Our method achieves the state-of-the-art performance on three benchmarks and shows generalization ability in rainy and snowy scenes.
Towards Better Understanding Attribution Methods
S. Rao, M. Böhle and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Sound and Visual Representation Learning with Multiple Pretraining Tasks
A. B. Vasudevan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2201.01046, Accepted/in press)
Abstract
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modality, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection. Code will be made publicly available.
Adiabatic Quantum Computing for Multi Object Tracking
J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2202.08837, Accepted/in press)
Abstract
Multi-Object Tracking (MOT) is most often approached in the tracking-by-detection paradigm, where object detections are associated through time. The association step naturally leads to discrete optimization problems. As these optimization problems are often NP-hard, they can only be solved exactly for small instances on current hardware. Adiabatic quantum computing (AQC) offers a solution for this, as it has the potential to provide a considerable speedup on a range of NP-hard optimization problems in the near future. However, current MOT formulations are unsuitable for quantum computing due to their scaling properties. In this work, we therefore propose the first MOT formulation designed to be solved with AQC. We employ an Ising model that represents the quantum mechanical system implemented on the AQC. We show that our approach is competitive compared with state-of-the-art optimization-based approaches, even when using of-the-shelf integer programming solvers. Finally, we demonstrate that our MOT problem is already solvable on the current generation of real quantum computers for small examples, and analyze the properties of the measured solutions.
Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform
S. Li, X. Chen, Y. Liu, D. Dai, C. Stachniss and J. Gall
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Improving Depth Estimation Using Map-Based Depth Priors
V. Patil, A. Liniger, D. Dai and L. Van Gool
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization
N. Vödisch, O. Unal, K. Li, L. Van Gool and D. Dai
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Learnable Online Graph Representations for 3D Multi-Object Tracking
J.-N. Zaech, D. Dai, A. Liniger, M. Danelljan and L. Van Gool
IEEE Robotics and Automation Letters, 2022
Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds
D. Dai, A. B. Vasudevan, J. Matas and L. Van Gool
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation
E. Levinkov, A. Kardoost, B. Andres and M. Keuper
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Abstract
Minimum cost lifted multicut problem is a generalization of the multicut problem and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. Its main advantage is that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e. hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is NP-hard and the branch-and-bound algorithm is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: geometric multiple model fitting, homography and motion estimation, motion segmentation.
Meta-Transfer Learning through Hard Tasks
Q. Sun, Y. Liu, Z. Chen, T.-S. Chua and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 3, 2022
ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks
D. H. M. Nguyen, D. M. Nguyen, T. T. N. Mai, T. Nguyen, K. T. Tran, A. T. Nguyen, B. T. Pham and B. T. Nguyen
Information Sciences, Volume 591, 2022
Attribute Prototype Network for Any-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
International Journal of Computer Vision, 2022
DPER: Direct Parameter Estimation for Randomly Missing Data
T. T. Nguyen, K. M. Nguyen-Duy, D. H. M. Nguyen, B. T. Nguyen and B. A. Wade
Knowledge-Based Systems, Volume 240, 2022
TATL: Task Agnostic Transfer Learning for Skin Attributes Detection
D. H. M. Nguyen, T. T. Nguyen, H. Vu, Q. Pham, B. T. Nguyen, D. Sonntag and M.-D. Nguyen
Medical Image Analysis, Volume 78, 2022
2021
(SP)2Net for Generalized Zero-Label Semantic Segmentation
A. Das, Y. Xian, Y. He, B. Schiele and Z. Akata
Pattern Recognition (GCPR 2021), 2021