D2
Computer Vision and Machine Learning
2022
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval
A. Chaudhuri, M. Mancini, Y. Chen, Z. Akata and A. Dutta
33rd British Machine Vision Conference (BMVC 2022), 2022
Distilling Knowledge from Self-Supervised Teacher by Embedding Graph Alignment
Y. Ma, Y. Chen and Z. Akata
33rd British Machine Vision Conference (BMVC 2022), 2022
SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Y. Zhou, W. Xiang, C. Li, B. Wang, X. Wei, L. Zhang, M. Keuper and X. Hua
33rd British Machine Vision Conference (BMVC 2022), 2022
PlanT: Explainable Planning Transformers via Object-Level Representations
K. Renz, K. Chitta, O.-B. Mercea, A. S. Koepke, Z. Akata and A. Geiger
6th Annual Conference on Robot Learning (CoRL 2022), 2022
Abstract
Planning an optimal route in a complex environment requires efficient<br>reasoning about the surrounding scene. While human drivers prioritize important<br>objects and ignore details not relevant to the decision, learning-based<br>planners typically extract features from dense, high-dimensional grid<br>representations containing all vehicle and road context information. In this<br>paper, we propose PlanT, a novel approach for planning in the context of<br>self-driving that uses a standard transformer architecture. PlanT is based on<br>imitation learning with a compact object-level input representation. On the<br>Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the<br>driving score of the expert) while being 5.3x faster than equivalent<br>pixel-based planning baselines during inference. Combining PlanT with an<br>off-the-shelf perception module provides a sensor-based driving system that is<br>more than 10 points better in terms of driving score than the existing state of<br>the art. Furthermore, we propose an evaluation protocol to quantify the ability<br>of planners to identify relevant objects, providing insights regarding their<br>decision-making. Our results indicate that PlanT can focus on the most relevant<br>object in the scene, even when this object is geometrically distant.<br>
Relational Proxies: Emergent Relationships as Fine-Grained Discriminators
A. Chaudhuri, M. Mancini, Z. Akata and A. Dutta
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
Robust Models are less Over-Confident
J. Grabinski, P. Gavrikov, J. Keuper and M. Keuper
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
Trading off Image Quality for Robustness is not Necessary with Regularized Deterministic Autoencoders
A. Saseendran, K. Skubch and M. Keuper
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
Motion Transformer with Global Intention Localization and Local Movement Refinement
S. Shi, L. Jiang, D. Dai and B. Schiele
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds
H. Wang, L. Ding, S. Dong, S. Shi, A. Li, J. Li, Z. Li and L. Wang
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
USB: A Unified Semi-supervised Learning Benchmark for Classification
Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo, H. Qi, Z. Wu, Y.-F. Li, S. Nakamura, W. Ye, M. Savvides, B. Raj, T. Shinozaki, B. Schiele, J. Wang, X. Xie and Y. Zhang
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
Towards Efficient 3D Object Detection with Knowledge Distillation
J. Yang, S. Shi, R. Ding, Z. Wang and X. Qi
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
Abstracting Sketches Through Simple Primitives
S. Alaniz, M. Mancini, A. Dutta, D. Marcos and Z. Akata
Computer Vision -- ECCV 2022, 2022
MPPNet: Multi-frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection
X. Chen, S. Shi, B. Zhu, K. C. Cheung, H. Xu and H. Li
Computer Vision -- ECCV 2022, 2022
Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation using Bounding Boxes
J. Chibane, F. Engelmann, A. T. Tran and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
Learned Vertex Descent: A New Direction for 3D Human Model Fitting
E. Corona, G. Pons-Moll, G. Alenyà and F. Moreno-Noguer
Computer Vision -- ECCV 2022, 2022
TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation
R. Gong, M. Danelljan, D. Dai, D. P. Paudel, A. Chhatkuli, F. Yu and L. Van Gool
Computer Vision -- ECCV 2022, 2022
Class-Agnostic Object Counting Robust to Intraclass Diversity
S. Gong, S. Zhang, J. Yang, D. Dai and B. Schiele
Computer Vision -- ECCV 2022, 2022
FrequencyLowCut Pooling - Plug & Play against Catastrophic Overfitting
J. Grabinski, S. Jung, J. Keuper and M. Keuper
Computer Vision -- ECCV 2022, 2022
Improving Robustness by Enhancing Weak Subnets
Y. Guo, D. Stutz and B. Schiele
Computer Vision -- ECCV 2022, 2022
A Comparative Study of Graph Matching Algorithms in Computer Vision
S. Haller, L. Feineis, L. Hutschenreiter, F. Bernard, C. Rother, D. Kainmüller, P. Swoboda and B. Savchynskyy
Computer Vision -- ECCV 2022, 2022
HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
Computer Vision -- ECCV 2022, 2022
A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning
M. Kirchhof, K. Roth, Z. Akata and E. Kasneci
Computer Vision -- ECCV 2022, 2022
Skeleton-Free Pose Transfer for Stylized 3D Characters
Z. Liao, J. Yang, J. Saito, G. Pons-Moll and Y. Zhou
Computer Vision -- ECCV 2022, 2022
CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video
W. Lin, A. Kukleva, K. Sun, H. Possegger, H. Kuehne and H. Bischof
Computer Vision -- ECCV 2022, 2022
Learning Where To Look - Generative NAS is Surprisingly Efficient
J. Lukasik, S. Jung and M. Keuper
Computer Vision -- ECCV 2022, 2022
Temporal and Cross-modal Attention for Audio-Visual Zero-Shot Learning
O.-B. Mercea, T. Hummel, A. S. Koepke and Z. Akata
Computer Vision -- ECCV 2022, 2022
HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance
S. Shimada, V. Golyanik, Z. Li, P. Pérez, W. Xu and C. Theobalt
Computer Vision -- ECCV 2022, 2022
Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields
G. Tiwari, D. Antic, J. E. Lenssen, N. Sarafianos, T. Tung and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
BayesCap: Bayesian Identity Cap for Calibrated Uncertainty in Frozen Neural Networks
U. Upadhyay, S. Karthik, Y. Chen, M. Mancini and Z. Akata
Computer Vision -- ECCV 2022, 2022
CHORE: Contact, Human and Object Reconstruction from a Single RGB Image
X. Xie, B. L. Bhatnagar and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
COUCH: Towards Controllable Human-Chair Interactions
X. Zhang, B. L. Bhatnagar, S. Starke, V. Guzov and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
K. Zhou, B. L. Bhatnagar, J. E. Lenssen and G. Pons-Moll
Computer Vision -- ECCV 2022, 2022
Advancing Translational Research in Neuroscience through Multi-task Learning
H. Cao, X. Hong, H. Tost, A. Meyer-Lindenberg and E. Schwarz
Frontiers in Psychiatry, Volume 13, 2022
Semantic Image Synthesis with Semantically Coupled VQ-Model
S. Alaniz, T. Hummel and Z. Akata
ICLR Workshop on Deep Generative Models for Highly Structured Data (ICLR 2022 DGM4HSD), 2022
RAMA: A Rapid Multicut Algorithm on GPU
A. Abbas and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
FastDOG: Fast Discrete Optimization on GPU
A. Abbas and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
BEHAVE: Dataset and Method for Tracking Human Object Interactions
B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
B-cos Networks: Alignment is All We Need for Interpretability
M. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation
S. Cai, A. Obukhov, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Decoupling Zero-Shot Semantic Segmentation
J. Ding, N. Xue, G.-S. Xia and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories<br>that have not been seen in the training. Existing works formulate ZS3 as a<br>pixel-level zero-shot classification problem, and transfer semantic knowledge<br>from seen classes to unseen ones with the help of language models pre-trained<br>only with texts. While simple, the pixel-level ZS3 formulation shows the<br>limited capability to integrate vision-language models that are often<br>pre-trained with image-text pairs and currently demonstrate great potential for<br>vision tasks. Inspired by the observation that humans often perform<br>segment-level semantic labeling, we propose to decouple the ZS3 into two<br>sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.<br>2) a zero-shot classification task on segments. The former sub-task does not<br>involve category information and can be directly transferred to group pixels<br>for unseen classes. The latter subtask performs at segment-level and provides a<br>natural way to leverage large-scale vision-language models pre-trained with<br>image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we<br>propose a simple and effective zero-shot semantic segmentation model, called<br>ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by<br>large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff<br>in terms of mIoU for unseen classes. Code will be released at<br>https://github.com/dingjiansw101/ZegFormer.<br>
PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking
A. Doering, D. Chen, S. Zhang, B. Schiele and J. Gall
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
Y. Fan, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with<br>decoupled representation learning and classifier learning for imbalanced SSL.<br>To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)<br>for classifier learning. Furthermore, the current evaluation protocol for<br>imbalanced SSL focuses only on balanced test sets, which has limited<br>practicality in real-world scenarios. Therefore, we further conduct a<br>comprehensive evaluation under various shifted test distributions. In<br>experiments, we show that our approach outperforms other methods over a large<br>range of shifted distributions, achieving state-of-the-art performance on<br>benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our<br>code will be made publicly available.<br>
Bi-level Alignment for Cross-Domain Crowd Counting
S. Gong, S. Zhang, J. Yang, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
LiDAR Snowfall Simulation for Robust 3D Object Detection
M. Hahner, C. Sakaridis, M. Bijelic, F. Heide, F. Yu, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
As acquiring pixel-wise annotations of real-world images for semantic<br>segmentation is a costly process, a model can instead be trained with more<br>accessible synthetic data and adapted to real images without requiring their<br>annotations. This process is studied in unsupervised domain adaptation (UDA).<br>Even though a large number of methods propose new adaptation strategies, they<br>are mostly based on outdated network architectures. As the influence of recent<br>network architectures has not been systematically studied, we first benchmark<br>different network architectures for UDA and then propose a novel UDA method,<br>DAFormer, based on the benchmark results. The DAFormer network consists of a<br>Transformer encoder and a multi-level context-aware feature fusion decoder. It<br>is enabled by three simple but crucial training strategies to stabilize the<br>training and to avoid overfitting DAFormer to the source domain: While the Rare<br>Class Sampling on the source domain improves the quality of pseudo-labels by<br>mitigating the confirmation bias of self-training towards common classes, the<br>Thing-Class ImageNet Feature Distance and a learning rate warmup promote<br>feature transfer from ImageNet pretraining. DAFormer significantly improves the<br>state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for<br>Synthia->Cityscapes and enables learning even difficult classes such as train,<br>bus, and truck well. The implementation is available at<br>https://github.com/lhoyer/DAFormer.<br>
KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning
S. Karthik, M. Mancini and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Large Loss Matters in Weakly Supervised Multi-Label Classification
Y. Kim, J. M. Kim, Z. Akata and J. Lee
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Stratified Transformer for 3D Point Cloud Segmentation
X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi and J. Jia
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding
X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai and C.-W. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Although considerable progress has been made in semantic scene understanding<br>under clear weather, it is still a tough problem under adverse weather<br>conditions, such as dense fog, due to the uncertainty caused by imperfect<br>observations. Besides, difficulties in collecting and labeling foggy images<br>hinder the progress of this field. Considering the success in semantic scene<br>understanding under clear weather, we think it is reasonable to transfer<br>knowledge learned from clear images to the foggy domain. As such, the problem<br>becomes to bridge the domain gap between clear images and foggy images. Unlike<br>previous methods that mainly focus on closing the domain gap caused by fog --<br>defogging the foggy images or fogging the clear images, we propose to alleviate<br>the domain gap by considering fog influence and style variation simultaneously.<br>The motivation is based on our finding that the style-related gap and the<br>fog-related gap can be divided and closed respectively, by adding an<br>intermediate domain. Thus, we propose a new pipeline to cumulatively adapt<br>style, fog and the dual-factor (style and fog). Specifically, we devise a<br>unified framework to disentangle the style factor and the fog factor<br>separately, and then the dual-factor from images in different domains.<br>Furthermore, we collaborate the disentanglement of three factors with a novel<br>cumulative loss to thoroughly disentangle these three factors. Our method<br>achieves the state-of-the-art performance on three benchmarks and shows<br>generalization ability in rainy and snowy scenes.<br>
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
O.-B. Mercea, L. Riesch, A. S. Koepke and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking
D. H. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag and P. Swoboda
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Abstract
Multi-Camera Multi-Object Tracking is currently drawing attention in the<br>computer vision field due to its superior performance in real-world<br>applications such as video surveillance with crowded scenes or in vast space.<br>In this work, we propose a mathematically elegant multi-camera multiple object<br>tracking approach based on a spatial-temporal lifted multicut formulation. Our<br>model utilizes state-of-the-art tracklets produced by single-camera trackers as<br>proposals. As these tracklets may contain ID-Switch errors, we refine them<br>through a novel pre-clustering obtained from 3D geometry projections. As a<br>result, we derive a better tracking graph without ID switches and more precise<br>affinity costs for the data association phase. Tracklets are then matched to<br>multi-camera trajectories by solving a global lifted multicut formulation that<br>incorporates short and long-range temporal interactions on tracklets located in<br>the same camera as well as inter-camera ones. Experimental results on the<br>WildTrack dataset yield near-perfect result, outperforming state-of-the-art<br>trackers on Campus while being on par on the PETS-09 dataset. We will make our<br>implementations available upon acceptance of the paper.<br>
Towards Better Understanding Attribution Methods
S. Rao, M. Böhle and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching
P. Roetzer, P. Swoboda, D. Cremers and F. Bernard
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Non-Isotropy Regularization for Proxy-Based Deep Metric Learning
K. Roth, O. Vinyals and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Integrating Language Guidance Into Vision-Based Deep Metric Learning
K. Roth, O. Vinyals and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation
T. Sun, M. Segù, J. Postels, Y. Wang, L. Van Gool, B. Schiele, F. Tombari and F. Yu
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Generalized Few-shot Semantic Segmentation
Z. Tian, X. Lai, L. Jiang, S. Liu, M. Shu, H. Zhao and J. Jia
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Scribble-Supervised LiDAR Semantic Segmentation
O. Unal, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Sound and Visual Representation Learning with Multiple Pretraining Tasks
A. B. Vasudevan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
RBGNet: Ray-based Grouping for 3D Object Detection
H. Wang, S. Shi, Z. Yang, R. Fang, Q. Qian, H. Li, B. Schiele and L. Wang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Continual Test-Time Domain Adaptation
Q. Wang, O. Fink, L. Van Gool and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
A Unified Query-based Paradigm for Point Cloud Understanding
Z. Yang, L. Jiang, Y. Sun, B. Schiele and J. Jia
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Adiabatic Quantum Computing for Multi Object Tracking
J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform
S. Li, X. Chen, Y. Liu, D. Dai, C. Stachniss and J. Gall
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Improving Depth Estimation Using Map-Based Depth Priors
V. Patil, A. Liniger, D. Dai and L. Van Gool
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization
N. Vödisch, O. Unal, K. Li, L. Van Gool and D. Dai
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Learnable Online Graph Representations for 3D Multi-Object Tracking
J.-N. Zaech, D. Dai, A. Liniger, M. Danelljan and L. Van Gool
IEEE Robotics and Automation Letters, 2022
DWDN: Deep Wiener Deconvolution Network for Non-Blind Image Deblurring
J. Dong, S. Roth and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 12, 2022
Meta-Transfer Learning through Hard Tasks
Q. Sun, Y. Liu, Z. Chen, T.-S. Chua and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 3, 2022
Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation
Y. Xian, B. Korbar, M. Douze, L. Torresani, B. Schiele and Z. Akata
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 12, 2022
Hyperspectral Image Super-Resolution with RGB Image Super-Resolution as an Auxiliary Task
K. Li, D. Dai and L. van Gool
2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022), 2022
ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks
D. H. M. Nguyen, D. M. Nguyen, T. T. N. Mai, T. Nguyen, K. T. Tran, A. T. Nguyen, B. T. Pham and B. T. Nguyen
Information Sciences, Volume 591, 2022
MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes
Z. Li, S. Shimada, B. Schiele, C. Theobalt and V. Golyanik
International Conference on 3D Vision, 2022
(arXiv: 2208.08439, Accepted/in press)
Abstract
3D human motion capture from monocular RGB images respecting interactions of<br>a subject with complex and possibly deformable environments is a very<br>challenging, ill-posed and under-explored problem. Existing methods address it<br>only weakly and do not model possible surface deformations often occurring when<br>humans interact with scene surfaces. In contrast, this paper proposes<br>MoCapDeform, i.e., a new framework for monocular 3D human motion capture that<br>is the first to explicitly model non-rigid deformations of a 3D scene for<br>improved 3D human pose estimation and deformable environment reconstruction.<br>MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the<br>camera space. It first localises a subject in the input monocular video along<br>with dense contact labels using a new raycasting based strategy. Next, our<br>human-environment interaction constraints are leveraged to jointly optimise<br>global 3D human poses and non-rigid surface deformations. MoCapDeform achieves<br>superior accuracy than competing methods on several datasets, including our<br>newly recorded one with deforming background scenes.<br>
Revisiting Consistency Regularization for Semi-supervised Learning
Y. Fan, A. Kukleva, D. Dai and B. Schiele
International Journal of Computer Vision, 2022
PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection
S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang and H. Li
International Journal of Computer Vision, 2022
OASIS: Only Adversarial Supervision for Semantic Image Synthesis
V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Journal of Computer Vision, 2022
Attribute Prototype Network for Any-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
International Journal of Computer Vision, 2022
DPER: Direct Parameter Estimation for Randomly Missing Data
T. T. Nguyen, K. M. Nguyen-Duy, D. H. M. Nguyen, B. T. Nguyen and B. A. Wade
Knowledge-Based Systems, Volume 240, 2022
Aliasing and Adversarial Robust Generalization of CNNs
J. Grabinski, J. Keuper and M. Keuper
Machine Learning, Volume 111, 2022
Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks
S. Jung and M. Keuper
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022), 2022
(Accepted/in press)
Abstract
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial<br>optimization problem of partitioning a real-valued edge-weighted graph such as<br>to minimize the total cost of the partition. While graph convolutional neural<br>networks (GNN) have proven to be promising in the context of combinatorial<br>optimization, most of them are only tailored to or tested on positive-valued<br>edge weights, i.e. they do not comply to the nature of the multicut problem. We<br>therefore adapt various GNN architectures including Graph Convolutional<br>Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to<br>facilitate the efficient encoding of real-valued edge costs. Moreover, we<br>employ a reformulation of the multicut ILP constraints to a polynomial program<br>as loss function that allows to learn feasible multicut solutions in a scalable<br>way. Thus, we provide the first approach towards end-to-end trainable<br>multicuts. Our findings support that GNN approaches can produce good solutions<br>in practice while providing lower computation times and largely improved<br>scalability compared to LP solvers and optimized heuristics, especially when<br>considering large instances.<br>
TATL: Task Agnostic Transfer Learning for Skin Attributes Detection
D. H. M. Nguyen, T. T. Nguyen, H. Vu, Q. Pham, B. T. Nguyen, D. Sonntag and M.-D. Nguyen
Medical Image Analysis, Volume 78, 2022
Impact of Realistic Properties of the Point Spread Function on Classification Tasks to Reveal a Possible Distribution Shift
P. Müller, A. Braun and M. Keuper
NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2022 Workshop DistShift), 2022
BDA-SketRet: Bi-Level Domain Adaptation for Zero-Shot SBIR
U. Chaudhuri, R. Chavan, B. Banerjee, A. Dutta and Z. Akata
Neurocomputing, Volume 514, 2022
Optimizing Edge Detection for Image Segmentation with Multicut Penalties
S. Jung, S. Ziegler, A. Kardoost and M. Keuper
Pattern Recognition (DAGM GCPR 2022), 2022
Abstract
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph<br>decomposition by optimizing binary edge labels over edge costs. While the<br>formulation of a MP from independently estimated costs per edge is highly<br>flexible and intuitive, solving the MP is NP-hard and time-expensive. As a<br>remedy, recent work proposed to predict edge probabilities with awareness to<br>potential conflicts by incorporating cycle constraints in the prediction<br>process. We argue that such formulation, while providing a first step towards<br>end-to-end learnable edge weights, is suboptimal, since it is built upon a<br>loose relaxation of the MP. We therefore propose an adaptive CRF that allows to<br>progressively consider more violated constraints and, in consequence, to issue<br>solutions with higher validity. Experiments on the BSDS500 benchmark for<br>natural image segmentation as well as on electron microscopic recordings show<br>that our approach yields more precise edge detection and image segmentation.<br>
Keypoint Message Passing for Video-Based Person Re-identification
D. Chen, A. Doering, S. Zhang, J. Yang, J. Gall and B. Schiele
Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022
HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection
T. Broedermann, C. Sakaridis, D. Dai and L. Van Gool
Technical Report, 2022
(arXiv: 2206.15157)
Abstract
Besides standard cameras, autonomous vehicles typically include multiple<br>additional sensors, such as lidars and radars, which help acquire richer<br>information for perceiving the content of the driving scene. While several<br>recent works focus on fusing certain pairs of sensors - such as camera and<br>lidar or camera and radar - by using architectural components specific to the<br>examined setting, a generic and modular sensor fusion architecture is missing<br>from the literature. In this work, we focus on 2D object detection, a<br>fundamental high-level task which is defined on the 2D image domain, and<br>propose HRFuser, a multi-resolution sensor fusion architecture that scales<br>straightforwardly to an arbitrary number of input modalities. The design of<br>HRFuser is based on state-of-the-art high-resolution networks for image-only<br>dense prediction and incorporates a novel multi-window cross-attention block as<br>the means to perform fusion of multiple modalities at multiple resolutions.<br>Even though cameras alone provide very informative features for 2D detection,<br>we demonstrate via extensive experiments on the nuScenes and Seeing Through Fog<br>datasets that our model effectively leverages complementary features from<br>additional modalities, substantially improving upon camera-only performance and<br>consistently outperforming state-of-the-art fusion methods for 2D detection<br>both in normal and adverse conditions. The source code will be made publicly<br>available.<br>
An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised Learning
H. Chen, Y. Fan, Y. Wang, J. Wang, B. Schiele, X. Xie, M. Savvides and B. Raj
Technical Report, 2022
(arXiv: 2211.11086)
Abstract
Semi-supervised learning (SSL) has shown great promise in leveraging<br>unlabeled data to improve model performance. While standard SSL assumes uniform<br>data distribution, we consider a more realistic and challenging setting called<br>imbalanced SSL, where imbalanced class distributions occur in both labeled and<br>unlabeled data. Although there are existing endeavors to tackle this challenge,<br>their performance degenerates when facing severe imbalance since they can not<br>reduce the class imbalance sufficiently and effectively. In this paper, we<br>study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance<br>by simply supplementing labeled data with pseudo-labels, according to the<br>difference in class distribution from the most frequent class. Such a simple<br>baseline turns out to be highly effective in reducing class imbalance. It<br>outperforms existing methods by a significant margin, e.g., 12.8%, 13.6%, and<br>16.7% over previous SOTA on CIFAR100-LT, FOOD101-LT, and ImageNet127<br>respectively. The reduced imbalance results in faster convergence and better<br>pseudo-label accuracy of SimiS. The simplicity of our method also makes it<br>possible to be combined with other re-balancing techniques to improve the<br>performance further. Moreover, our method shows great robustness to a wide<br>range of data distributions, which holds enormous potential in practice. Code<br>will be publicly available.<br>
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Y. Chen, M. Mancini, X. Zhu and Z. Akata
Technical Report, 2022
(arXiv: 2208.11296)
Abstract
State-of-the-art deep learning models are often trained with a large amount<br>of costly labeled training data. However, requiring exhaustive manual<br>annotations may degrade the model's generalizability in the limited-label<br>regime. Semi-supervised learning and unsupervised learning offer promising<br>paradigms to learn from an abundance of unlabeled visual data. Recent progress<br>in these paradigms has indicated the strong benefits of leveraging unlabeled<br>data to improve model generalization and provide better model initialization.<br>In this survey, we review the recent advanced deep learning algorithms on<br>semi-supervised learning (SSL) and unsupervised learning (UL) for visual<br>recognition from a unified perspective. To offer a holistic understanding of<br>the state-of-the-art in these areas, we propose a unified taxonomy. We<br>categorize existing representative SSL and UL with comprehensive and insightful<br>analysis to highlight their design rationales in different learning scenarios<br>and applications in different computer vision tasks. Lastly, we discuss the<br>emerging trends and open challenges in SSL and UL to shed light on future<br>critical research directions.<br>
Leveraging Self-Supervised Training for Unintentional Action Recognition
E. Duka, A. Kukleva and B. Schiele
Technical Report, 2022
(arXiv: 2209.11870)
Abstract
Unintentional actions are rare occurrences that are difficult to define<br>precisely and that are highly dependent on the temporal context of the action.<br>In this work, we explore such actions and seek to identify the points in videos<br>where the actions transition from intentional to unintentional. We propose a<br>multi-stage framework that exploits inherent biases such as motion speed,<br>motion direction, and order to recognize unintentional actions. To enhance<br>representations via self-supervised training for the task of unintentional<br>action recognition we propose temporal transformations, called Temporal<br>Transformations of Inherent Biases of Unintentional Actions (T2IBUA). The<br>multi-stage approach models the temporal information on both the level of<br>individual frames and full clips. These enhanced representations show strong<br>performance for unintentional action recognition tasks. We provide an extensive<br>ablation study of our framework and report results that significantly improve<br>over the state-of-the-art.<br>
Normalization Perturbation: A Simple Domain Generalization Method for Real-World Domain Shifts
Q. Fan, M. Segu, Y.-W. Tai, F. Yu, C.-K. Tang, B. Schiele and D. Dai
Technical Report, 2022
(arXiv: 2211.04393)
Abstract
Improving model's generalizability against domain shifts is crucial,<br>especially for safety-critical applications such as autonomous driving.<br>Real-world domain styles can vary substantially due to environment changes and<br>sensor noises, but deep models only know the training domain style. Such domain<br>style gap impedes model generalization on diverse real-world domains. Our<br>proposed Normalization Perturbation (NP) can effectively overcome this domain<br>style overfitting problem. We observe that this problem is mainly caused by the<br>biased distribution of low-level features learned in shallow CNN layers. Thus,<br>we propose to perturb the channel statistics of source domain features to<br>synthesize various latent styles, so that the trained deep model can perceive<br>diverse potential domains and generalizes well even without observations of<br>target domain data in training. We further explore the style-sensitive channels<br>for effective style synthesis. Normalization Perturbation only relies on a<br>single source domain and is surprisingly effective and extremely easy to<br>implement. Extensive experiments verify the effectiveness of our method for<br>generalizing models under real-world domain shifts.<br>
Visually Plausible Human-Object Interaction Capture from Wearable Sensors
V. Guzov, T. Sattler and G. Pons-Moll
Technical Report, 2022
(arXiv: 2205.02830)
Abstract
In everyday lives, humans naturally modify the surrounding environment<br>through interactions, e.g., moving a chair to sit on it. To reproduce such<br>interactions in virtual spaces (e.g., metaverse), we need to be able to capture<br>and model them, including changes in the scene geometry, ideally from<br>ego-centric input alone (head camera and body-worn inertial sensors). This is<br>an extremely hard problem, especially since the object/scene might not be<br>visible from the head camera (e.g., a human not looking at a chair while<br>sitting down, or not looking at the door handle while opening a door). In this<br>paper, we present HOPS, the first method to capture interactions such as<br>dragging objects and opening doors from ego-centric data alone. Central to our<br>method is reasoning about human-object interactions, allowing to track objects<br>even when they are not visible from the head camera. HOPS localizes and<br>registers both the human and the dynamic object in a pre-scanned static scene.<br>HOPS is an important first step towards advanced AR/VR applications based on<br>immersive virtual universes, and can provide human-centric training data to<br>teach machines to interact with their surroundings. The supplementary video,<br>data, and code will be available on our project page at<br>http://virtualhumans.mpi-inf.mpg.de/hops/<br>
Lifted Edges as Connectivity Priors for Multicut and Disjoint Paths
A. Horňáková
PhD Thesis, Universität des Saarlandes, 2022
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
L. Hoyer, D. Dai, H. Wang and L. Van Gool
Technical Report, 2022
(arXiv: 2212.01322)
Abstract
In unsupervised domain adaptation (UDA), a model trained on source data (e.g.<br>synthetic) is adapted to target data (e.g. real-world) without access to target<br>annotation. Most previous UDA methods struggle with classes that have a similar<br>visual appearance on the target domain as no ground truth is available to learn<br>the slight appearance differences. To address this problem, we propose a Masked<br>Image Consistency (MIC) module to enhance UDA by learning spatial context<br>relations of the target domain as additional clues for robust visual<br>recognition. MIC enforces the consistency between predictions of masked target<br>images, where random patches are withheld, and pseudo-labels that are generated<br>based on the complete image by an exponential moving average teacher. To<br>minimize the consistency loss, the network has to learn to infer the<br>predictions of the masked regions from their context. Due to its simple and<br>universal concept, MIC can be integrated into various UDA methods across<br>different visual recognition tasks such as image classification, semantic<br>segmentation, and object detection. MIC significantly improves the<br>state-of-the-art performance across the different recognition tasks for<br>synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. For<br>instance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8%<br>on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to an<br>improvement of +2.1 and +3.0 percent points over the previous state of the art.<br>The implementation is available at https://github.com/lhoyer/MIC.<br>
Deep Gradient Learning for Efficient Camouflaged Object Detection
G.-P. Ji, D.-P. Fan, Y.-C. Chou, D. Dai, A. Liniger and L. Van Gool
Technical Report, 2022
(arXiv: 2205.12853)
Abstract
This paper introduces DGNet, a novel deep framework that exploits object<br>gradient supervision for camouflaged object detection (COD). It decouples the<br>task into two connected branches, i.e., a context and a texture encoder. The<br>essential connection is the gradient-induced transition, representing a soft<br>grouping between context and texture features. Benefiting from the simple but<br>efficient framework, DGNet outperforms existing state-of-the-art COD models by<br>a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80<br>fps) and achieves comparable results to the cutting-edge model<br>JCSOD-CVPR$_{21}$ with only 6.82% parameters. Application results also show<br>that the proposed DGNet performs well in polyp segmentation, defect detection,<br>and transparent object segmentation tasks. Codes will be made available at<br>https://github.com/GewelsJI/DGNet.<br>
Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation
V. Lazova, V. Guzov, K. Olszewski, S. Tulyakov and G. Pons-Moll
Technical Report, 2022
(arXiv: 2204.10850)
Abstract
We present a novel method for performing flexible, 3D-aware image content<br>manipulation while enabling high-quality novel view synthesis. While NeRF-based<br>approaches are effective for novel view synthesis, such models memorize the<br>radiance for every point in a scene within a neural network. Since these models<br>are scene-specific and lack a 3D scene representation, classical editing such<br>as shape manipulation, or combining scenes is not possible. Hence, editing and<br>combining NeRF-based scenes has not been demonstrated. With the aim of<br>obtaining interpretable and controllable scene representations, our model<br>couples learnt scene-specific feature volumes with a scene agnostic neural<br>rendering network. With this hybrid representation, we decouple neural<br>rendering from scene-specific geometry and appearance. We can generalize to<br>novel scenes by optimizing only the scene-specific 3D feature representation,<br>while keeping the parameters of the rendering network fixed. The rendering<br>function learnt during the initial training stage can thus be easily applied to<br>new scenes, making our approach more flexible. More importantly, since the<br>feature volumes are independent of the rendering model, we can manipulate and<br>combine scenes by editing their corresponding feature volumes. The edited<br>volume can then be plugged into the rendering model to synthesize high-quality<br>novel views. We demonstrate various scene manipulations, including mixing<br>scenes, deforming objects and inserting objects into scenes, while still<br>producing photo-realistic results.<br>
Discovering Class-Specific GAN Controls for Semantic Image Synthesis
E. Schönfeld, J. Borges, V. Sushko, B. Schiele and A. Khoreva
Technical Report, 2022
(arXiv: 2212.01455)
Abstract
Prior work has extensively studied the latent space structure of GANs for<br>unconditional image synthesis, enabling global editing of generated images by<br>the unsupervised discovery of interpretable latent directions. However, the<br>discovery of latent directions for conditional GANs for semantic image<br>synthesis (SIS) has remained unexplored. In this work, we specifically focus on<br>addressing this gap. We propose a novel optimization method for finding<br>spatially disentangled class-specific directions in the latent space of<br>pretrained SIS models. We show that the latent directions found by our method<br>can effectively control the local appearance of semantic classes, e.g.,<br>changing their internal structure, texture or color independently from each<br>other. Visual inspection and quantitative evaluation of the discovered GAN<br>controls on various datasets demonstrate that our method discovers a diverse<br>set of unique and semantically meaningful latent directions for class-specific<br>edits.<br>
MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge -- Motion Prediction
S. Shi, L. Jiang, D. Dai and B. Schiele
Technical Report, 2022
(arXiv: 2209.10033)
Abstract
In this report, we present the 1st place solution for motion prediction track<br>in 2022 Waymo Open Dataset Challenges. We propose a novel Motion Transformer<br>framework for multimodal motion prediction, which introduces a small set of<br>novel motion query pairs for generating better multimodal future trajectories<br>by jointly performing the intention localization and iterative motion<br>refinement. A simple model ensemble strategy with non-maximum-suppression is<br>adopted to further boost the final performance. Our approach achieves the 1st<br>place on the motion prediction leaderboard of 2022 Waymo Open Dataset<br>Challenges, outperforming other methods with remarkable margins. Code will be<br>available at https://github.com/sshaoshuai/MTR.<br>
Understanding and Improving Robustness and Uncertainty Estimation in Deep Learning
D. Stutz
PhD Thesis, Universität des Saarlandes, 2022
Abstract
Deep learning is becoming increasingly relevant for many high-stakes applications such as autonomous driving or medical diagnosis where wrong decisions can have massive impact on human lives. Unfortunately, deep neural networks are typically assessed solely based on generalization, e.g., accuracy on a fixed test set. However, this is clearly insufficient for safe deployment as potential malicious actors and distribution shifts or the effects of quantization and unreliable hardware are disregarded. Thus, recent work additionally evaluates performance on potentially manipulated or corrupted inputs as well as after quantization and deployment on specialized hardware. In such settings, it is also important to obtain reasonable estimates of the model's confidence alongside its predictions. This thesis studies robustness and uncertainty estimation in deep learning along three main directions: First, we consider so-called adversarial examples, slightly perturbed inputs causing severe drops in accuracy. Second, we study weight perturbations, focusing particularly on bit errors in quantized weights. This is relevant for deploying models on special-purpose hardware for efficient inference, so-called accelerators. Finally, we address uncertainty estimation to improve robustness and provide meaningful statistical performance guarantees for safe deployment. In detail, we study the existence of adversarial examples with respect to the underlying data manifold. In this context, we also investigate adversarial training which improves robustness by augmenting training with adversarial examples at the cost of reduced accuracy. We show that regular adversarial examples leave the data manifold in an almost orthogonal direction. While we find no inherent trade-off between robustness and accuracy, this contributes to a higher sample complexity as well as severe overfitting of adversarial training. Using a novel measure of flatness in the robust loss landscape with respect to weight changes, we also show that robust overfitting is caused by converging to particularly sharp minima. In fact, we find a clear correlation between flatness and good robust generalization. Further, we study random and adversarial bit errors in quantized weights. In accelerators, random bit errors occur in the memory when reducing voltage with the goal of improving energy-efficiency. Here, we consider a robust quantization scheme, use weight clipping as regularization and perform random bit error training to improve bit error robustness, allowing considerable energy savings without requiring hardware changes. In contrast, adversarial bit errors are maliciously introduced through hardware- or software-based attacks on the memory, with severe consequences on performance. We propose a novel adversarial bit error attack to study this threat and use adversarial bit error training to improve robustness and thereby also the accelerator's security. Finally, we view robustness in the context of uncertainty estimation. By encouraging low-confidence predictions on adversarial examples, our confidence-calibrated adversarial training successfully rejects adversarial, corrupted as well as out-of-distribution examples at test time. Thereby, we are also able to improve the robustness-accuracy trade-off compared to regular adversarial training. However, even robust models do not provide any guarantee for safe deployment. To address this problem, conformal prediction allows the model to predict confidence sets with user-specified guarantee of including the true label. Unfortunately, as conformal prediction is usually applied after training, the model is trained without taking this calibration step into account. To address this limitation, we propose conformal training which allows training conformal predictors end-to-end with the underlying model. This not only improves the obtained uncertainty estimates but also enables optimizing application-specific objectives without losing the provided guarantee. Besides our work on robustness or uncertainty, we also address the problem of 3D shape completion of partially observed point clouds. Specifically, we consider an autonomous driving or robotics setting where vehicles are commonly equipped with LiDAR or depth sensors and obtaining a complete 3D representation of the environment is crucial. However, ground truth shapes that are essential for applying deep learning techniques are extremely difficult to obtain. Thus, we propose a weakly-supervised approach that can be trained on the incomplete point clouds while offering efficient inference. In summary, this thesis contributes to our understanding of robustness against both input and weight perturbations. To this end, we also develop methods to improve robustness alongside uncertainty estimation for safe deployment of deep learning methods in high-stakes applications. In the particular context of autonomous driving, we also address 3D shape completion of sparse point clouds.
Structured Prediction Problem Archive
P. Swoboda, A. Horňáková, P. Rötzer, B. Savchynskyy and A. Abbas
Technical Report, 2022
(arXiv: 2202.03574)
Abstract
Structured prediction problems are one of the fundamental tools in machine<br>learning. In order to facilitate algorithm development for their numerical<br>solution, we collect in one place a large number of datasets in easy to read<br>formats for a diverse set of problem classes. We provide archival links to<br>datasets, description of the considered problems and problem formats, and a<br>short summary of problem characteristics including size, number of instances<br>etc. For reference we also give a non-exhaustive selection of algorithms<br>proposed in the literature for their solution. We hope that this central<br>repository will make benchmarking and comparison to established works easier.<br>We welcome submission of interesting new datasets and algorithms for inclusion<br>in our archive.<br>
On Fragile Features and Batch Normalization in Adversarial Training
N. P. Walter, D. Stutz and B. Schiele
Technical Report, 2022
(arXiv: 2204.12393)
Abstract
Modern deep learning architecture utilize batch normalization (BN) to<br>stabilize training and improve accuracy. It has been shown that the BN layers<br>alone are surprisingly expressive. In the context of robustness against<br>adversarial examples, however, BN is argued to increase vulnerability. That is,<br>BN helps to learn fragile features. Nevertheless, BN is still used in<br>adversarial training, which is the de-facto standard to learn robust features.<br>In order to shed light on the role of BN in adversarial training, we<br>investigate to what extent the expressiveness of BN can be used to robustify<br>fragile features in comparison to random features. On CIFAR10, we find that<br>adversarially fine-tuning just the BN layers can result in non-trivial<br>adversarial robustness. Adversarially training only the BN layers from scratch,<br>in contrast, is not able to convey meaningful adversarial robustness. Our<br>results indicate that fragile features can be used to learn models with<br>moderate adversarial robustness, while random features cannot<br>
Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in Driving Scenes
Y.-H. Wu, D. Zhang, L. Zhang, X. Zhan, D. Dai, Y. Liu and M.-M. Cheng
Technical Report, 2022
(arXiv: 2208.08621)
Abstract
Current efficient LiDAR-based detection frameworks are lacking in exploiting<br>object relations, which naturally present in both spatial and temporal manners.<br>To this end, we introduce a simple, efficient, and effective two-stage<br>detector, termed as Ret3D. At the core of Ret3D is the utilization of novel<br>intra-frame and inter-frame relation modules to capture the spatial and<br>temporal relations accordingly. More Specifically, intra-frame relation module<br>(IntraRM) encapsulates the intra-frame objects into a sparse graph and thus<br>allows us to refine the object features through efficient message passing. On<br>the other hand, inter-frame relation module (InterRM) densely connects each<br>object in its corresponding tracked sequences dynamically, and leverages such<br>temporal information to further enhance its representations efficiently through<br>a lightweight transformer network. We instantiate our novel designs of IntraRM<br>and InterRM with general center-based or anchor-based detectors and evaluate<br>them on Waymo Open Dataset (WOD). With negligible extra overhead, Ret3D<br>achieves the state-of-the-art performance, being 5.5% and 3.2% higher than the<br>recent competitor in terms of the LEVEL 1 and LEVEL 2 mAPH metrics on vehicle<br>detection, respectively.<br>
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
K. Zhou, B. Lal Bhatnagar, J. E. Lenssen and G. Pons-Moll
Technical Report, 2022
(arXiv: 2205.07982)
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction<br>sequences using a data prior. Existing hand trackers, especially those that<br>rely on very few cameras, often produce visually unrealistic results with<br>hand-object intersection or missing contacts. Although correcting such errors<br>requires reasoning about temporal aspects of interaction, most previous work<br>focus on static grasps and contacts. The core of our method are TOCH fields, a<br>novel spatio-temporal representation for modeling correspondences between hands<br>and objects during interaction. The key component is a point-wise<br>object-centric representation which encodes the hand position relative to the<br>object. Leveraging this novel representation, we learn a latent manifold of<br>plausible TOCH fields with a temporal denoising auto-encoder. Experiments<br>demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object<br>interaction models, which are limited to static grasps and contacts. More<br>importantly, our method produces smooth interactions even before and after<br>contact. Using a single trained TOCH model, we quantitatively and qualitatively<br>demonstrate its usefulness for 1) correcting erroneous reconstruction results<br>from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,<br>and 3) grasp transfer across objects. We will release our code and trained<br>model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch/<br>
Hypergraph Transformer for Skeleton-based Action Recognition
Y. Zhou, C. Li, Z.-Q. Cheng, Y. Geng, X. Xie and M. Keuper
Technical Report, 2022
(arXiv: 2211.09590)
Abstract
Skeleton-based action recognition aims to predict human actions given human<br>joint coordinates with skeletal interconnections. To model such off-grid data<br>points and their co-occurrences, Transformer-based formulations would be a<br>natural choice. However, Transformers still lag behind state-of-the-art methods<br>using graph convolutional networks (GCNs). Transformers assume that the input<br>is permutation-invariant and homogeneous (partially alleviated by positional<br>encoding), which ignores an important characteristic of skeleton data, i.e.,<br>bone connectivity. Furthermore, each type of body joint has a clear physical<br>meaning in human motion, i.e., motion retains an intrinsic relationship<br>regardless of the joint coordinates, which is not explored in Transformers. In<br>fact, certain re-occurring groups of body joints are often involved in specific<br>actions, such as the subconscious hand movement for keeping balance. Vanilla<br>attention is incapable of describing such underlying relations that are<br>persistent and beyond pair-wise. In this work, we aim to exploit these unique<br>aspects of skeleton data to close the performance gap between Transformers and<br>GCNs. Specifically, we propose a new self-attention (SA) extension, named<br>Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order<br>relations into the model. The K-hop relative positional embeddings are also<br>employed to take bone connectivity into account. We name the resulting model<br>Hyperformer, and it achieves comparable or better performance w.r.t. accuracy<br>and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D<br>120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the<br>significantly improved performance reached by our Hyperformer demonstrates the<br>underestimated potential of Transformer models in this field.<br>
2021
(SP)2Net for Generalized Zero-Label Semantic Segmentation
A. Das, Y. Xian, Y. He, B. Schiele and Z. Akata
Pattern Recognition (GCPR 2021), 2021
Revisiting Consistency Regularization for Semi-supervised Learning
Y. Fan, A. Kukleva and B. Schiele
Pattern Recognition (GCPR 2021), 2021
Compositional Mixture Representations for Vision and Text
S. Alaniz, M. Federici and Z. Akata
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), 2021
Attention Consistency on Visual Corruptions for Single-Source Domain Generalization
I. Cugu, M. Mancini, Y. Chen and Z. Akata
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), 2021
Probabilistic Compositional Embeddings for Multimodal Image Retrieval
A. Neculai, Y. Chen and Z. Akata
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), 2021
Uniform Priors for Data-Efficient Learning
S. Sinha, K. Roth, A. Goyal, M. Ghassemi, Z. Akata, H. Larochelle and A. Garg
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR 2022), 2021
2020
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
L. Salewski, A. S. Koepke, H. P. A. Lensch and Z. Akata
xxAI -- Beyond Explainable AI (xxAI @ICML 2020), 2020