D2
Computer Vision and Machine Learning
2022
SP-ViT: Learning 2D Spatial Priors for Vision Transformers
Y. Zhou, W. Xiang, C. Li, B. Wang, X. Wei, L. Zhang, M. Keuper and X. Hua
33rd British Machine Vision Conference (BMVC 2022), 2022
(Accepted/in press)
Robust Models are less Over-Confident
J. Grabinski, P. Gavrikov, J. Keuper and M. Keuper
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
Trading off Image Quality for Robustness is not Necessary with Regularized Deterministic Autoencoders
A. Saseendran, K. Skubch and M. Keuper
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
(Accepted/in press)
HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance
S. Shimada, V. Golyanik, Z. Li, P. Pérez, W. Xu and C. Theobalt
Computer Vision -- ECCV 2022, 2022
FrequencyLowCut pooling - Plug & Play against Catastrophic Overfitting
J. Grabinski, S. Jung, J. Keuper and M. Keuper
European Conference on Computer Vision (ECCV 2022), 2022
(Accepted/in press)
Improving Robustness by Enhancing Weak Subnets
Y. Guo, D. Stutz and B. Schiele
European Conference on Computer Vision (ECCV 2022), 2022
(Accepted/in press)
Learning Where To Look - Generative NAS is Surprisingly Efficient
J. Lukasik, S. Jung and M. Keuper
European Conference on Computer Vision (ECCV 2022), 2022
(Accepted/in press)
Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields
G. Tiwari, D. Antic, J. E. Lenssen, N. Sarafianos, T. Tung and G. Pons-Moll
European Conference on Computer Vision (ECCV 2022), 2022
(Accepted/in press)
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
K. Zhou, B. L. Bhatnagar, J. E. Lenssen and G. Pons-Moll
European Conference on Computer Vision (ECCV 2022), 2022
(Accepted/in press)
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction<br>sequences using a data prior. Existing hand trackers, especially those that<br>rely on very few cameras, often produce visually unrealistic results with<br>hand-object intersection or missing contacts. Although correcting such errors<br>requires reasoning about temporal aspects of interaction, most previous work<br>focus on static grasps and contacts. The core of our method are TOCH fields, a<br>novel spatio-temporal representation for modeling correspondences between hands<br>and objects during interaction. The key component is a point-wise<br>object-centric representation which encodes the hand position relative to the<br>object. Leveraging this novel representation, we learn a latent manifold of<br>plausible TOCH fields with a temporal denoising auto-encoder. Experiments<br>demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object<br>interaction models, which are limited to static grasps and contacts. More<br>importantly, our method produces smooth interactions even before and after<br>contact. Using a single trained TOCH model, we quantitatively and qualitatively<br>demonstrate its usefulness for 1) correcting erroneous reconstruction results<br>from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,<br>and 3) grasp transfer across objects. We will release our code and trained<br>model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch/<br>
BEHAVE: Dataset and Method for Tracking Human Object Interactions
B. L. Bhatnagar, X. Xie, I. Petrov, C. Sminchisescu, C. Theobalt and G. Pons-Moll
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
B-cos Networks: Alignment is All We Need for Interpretability
M. Böhle, M. Fritz and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Pix2NeRF: Unsupervised Conditional Pi-GAN for Single Image to Neural Radiance Fields Translation
S. Cai, A. Obukhov, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Decoupling Zero-Shot Semantic Segmentation
J. Ding, N. Xue, G.-S. Xia and D. Dai
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2112.07910, Accepted/in press)
Abstract
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories<br>that have not been seen in the training. Existing works formulate ZS3 as a<br>pixel-level zero-shot classification problem, and transfer semantic knowledge<br>from seen classes to unseen ones with the help of language models pre-trained<br>only with texts. While simple, the pixel-level ZS3 formulation shows the<br>limited capability to integrate vision-language models that are often<br>pre-trained with image-text pairs and currently demonstrate great potential for<br>vision tasks. Inspired by the observation that humans often perform<br>segment-level semantic labeling, we propose to decouple the ZS3 into two<br>sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments.<br>2) a zero-shot classification task on segments. The former sub-task does not<br>involve category information and can be directly transferred to group pixels<br>for unseen classes. The latter subtask performs at segment-level and provides a<br>natural way to leverage large-scale vision-language models pre-trained with<br>image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we<br>propose a simple and effective zero-shot semantic segmentation model, called<br>ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by<br>large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff<br>in terms of mIoU for unseen classes. Code will be released at<br>https://github.com/dingjiansw101/ZegFormer.<br>
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning
Y. Fan, D. Dai and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2112.04564, Accepted/in press)
Abstract
In this paper, we propose a novel co-learning framework (CoSSL) with<br>decoupled representation learning and classifier learning for imbalanced SSL.<br>To handle the data imbalance, we devise Tail-class Feature Enhancement (TFE)<br>for classifier learning. Furthermore, the current evaluation protocol for<br>imbalanced SSL focuses only on balanced test sets, which has limited<br>practicality in real-world scenarios. Therefore, we further conduct a<br>comprehensive evaluation under various shifted test distributions. In<br>experiments, we show that our approach outperforms other methods over a large<br>range of shifted distributions, achieving state-of-the-art performance on<br>benchmark datasets ranging from CIFAR-10, CIFAR-100, ImageNet, to Food-101. Our<br>code will be made publicly available.<br>
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
L. Hoyer, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2111.14887, Accepted/in press)
Abstract
As acquiring pixel-wise annotations of real-world images for semantic<br>segmentation is a costly process, a model can instead be trained with more<br>accessible synthetic data and adapted to real images without requiring their<br>annotations. This process is studied in unsupervised domain adaptation (UDA).<br>Even though a large number of methods propose new adaptation strategies, they<br>are mostly based on outdated network architectures. As the influence of recent<br>network architectures has not been systematically studied, we first benchmark<br>different network architectures for UDA and then propose a novel UDA method,<br>DAFormer, based on the benchmark results. The DAFormer network consists of a<br>Transformer encoder and a multi-level context-aware feature fusion decoder. It<br>is enabled by three simple but crucial training strategies to stabilize the<br>training and to avoid overfitting DAFormer to the source domain: While the Rare<br>Class Sampling on the source domain improves the quality of pseudo-labels by<br>mitigating the confirmation bias of self-training towards common classes, the<br>Thing-Class ImageNet Feature Distance and a learning rate warmup promote<br>feature transfer from ImageNet pretraining. DAFormer significantly improves the<br>state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for<br>Synthia->Cityscapes and enables learning even difficult classes such as train,<br>bus, and truck well. The implementation is available at<br>https://github.com/lhoyer/DAFormer.<br>
Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding
X. Ma, Z. Wang, Y. Zhan, Y. Zheng, Z. Wang, D. Dai and C.-W. Lin
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Abstract
Although considerable progress has been made in semantic scene understanding<br>under clear weather, it is still a tough problem under adverse weather<br>conditions, such as dense fog, due to the uncertainty caused by imperfect<br>observations. Besides, difficulties in collecting and labeling foggy images<br>hinder the progress of this field. Considering the success in semantic scene<br>understanding under clear weather, we think it is reasonable to transfer<br>knowledge learned from clear images to the foggy domain. As such, the problem<br>becomes to bridge the domain gap between clear images and foggy images. Unlike<br>previous methods that mainly focus on closing the domain gap caused by fog --<br>defogging the foggy images or fogging the clear images, we propose to alleviate<br>the domain gap by considering fog influence and style variation simultaneously.<br>The motivation is based on our finding that the style-related gap and the<br>fog-related gap can be divided and closed respectively, by adding an<br>intermediate domain. Thus, we propose a new pipeline to cumulatively adapt<br>style, fog and the dual-factor (style and fog). Specifically, we devise a<br>unified framework to disentangle the style factor and the fog factor<br>separately, and then the dual-factor from images in different domains.<br>Furthermore, we collaborate the disentanglement of three factors with a novel<br>cumulative loss to thoroughly disentangle these three factors. Our method<br>achieves the state-of-the-art performance on three benchmarks and shows<br>generalization ability in rainy and snowy scenes.<br>
Towards Better Understanding Attribution Methods
S. Rao, M. Böhle and B. Schiele
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(Accepted/in press)
Sound and Visual Representation Learning with Multiple Pretraining Tasks
A. B. Vasudevan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2201.01046, Accepted/in press)
Abstract
Different self-supervised tasks (SSL) reveal different features from the<br>data. The learned feature representations can exhibit different performance for<br>each downstream task. In this light, this work aims to combine Multiple SSL<br>tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically,<br>for this study, we investigate binaural sounds and image data in isolation. For<br>binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal<br>synchronization of foreground objects and binaural audio and temporal gap<br>prediction. We investigate several approaches of Multi-SSL and give insights<br>into the downstream task performance on video retrieval, spatial sound super<br>resolution, and semantic prediction on the OmniAudio dataset. Our experiments<br>on binaural sound representations demonstrate that Multi-SSL via incremental<br>learning (IL) of SSL tasks outperforms single SSL task models and fully<br>supervised models in the downstream task performance. As a check of<br>applicability on other modality, we also formulate our Multi-SSL models for<br>image representation learning and we use the recently proposed SSL tasks,<br>MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2,<br>DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83,<br>+1.56 and +1.61 AP on COCO detection. Code will be made publicly available.<br>
Adiabatic Quantum Computing for Multi Object Tracking
J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai and L. Van Gool
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022
(arXiv: 2202.08837, Accepted/in press)
Abstract
Multi-Object Tracking (MOT) is most often approached in the<br>tracking-by-detection paradigm, where object detections are associated through<br>time. The association step naturally leads to discrete optimization problems.<br>As these optimization problems are often NP-hard, they can only be solved<br>exactly for small instances on current hardware. Adiabatic quantum computing<br>(AQC) offers a solution for this, as it has the potential to provide a<br>considerable speedup on a range of NP-hard optimization problems in the near<br>future. However, current MOT formulations are unsuitable for quantum computing<br>due to their scaling properties. In this work, we therefore propose the first<br>MOT formulation designed to be solved with AQC. We employ an Ising model that<br>represents the quantum mechanical system implemented on the AQC. We show that<br>our approach is competitive compared with state-of-the-art optimization-based<br>approaches, even when using of-the-shelf integer programming solvers. Finally,<br>we demonstrate that our MOT problem is already solvable on the current<br>generation of real quantum computers for small examples, and analyze the<br>properties of the measured solutions.<br>
Multi-Scale Interaction for Real-Time LiDAR Data Segmentation on an Embedded Platform
S. Li, X. Chen, Y. Liu, D. Dai, C. Stachniss and J. Gall
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Improving Depth Estimation Using Map-Based Depth Priors
V. Patil, A. Liniger, D. Dai and L. Van Gool
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
End-to-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization
N. Vödisch, O. Unal, K. Li, L. Van Gool and D. Dai
IEEE Robotics and Automation Letters, Volume 7, Number 2, 2022
Learnable Online Graph Representations for 3D Multi-Object Tracking
J.-N. Zaech, D. Dai, A. Liniger, M. Danelljan and L. Van Gool
IEEE Robotics and Automation Letters, 2022
Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds
D. Dai, A. B. Vasudevan, J. Matas and L. Van Gool
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Higher-Order Multicuts for Geometric Model Fitting and Motion Segmentation
E. Levinkov, A. Kardoost, B. Andres and M. Keuper
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Abstract
Minimum cost lifted multicut problem is a generalization of the multicut problem and is a means to optimizing a decomposition of a graph w.r.t. both positive and negative edge costs. Its main advantage is that multicut-based formulations do not require the number of components given a priori; instead, it is deduced from the solution. However, the standard multicut cost function is limited to pairwise relationships between nodes, while several important applications either require or can benefit from a higher-order cost function, i.e. hyper-edges. In this paper, we propose a pseudo-boolean formulation for a multiple model fitting problem. It is based on a formulation of any-order minimum cost lifted multicuts, which allows to partition an undirected graph with pairwise connectivity such as to minimize costs defined over any set of hyper-edges. As the proposed formulation is NP-hard and the branch-and-bound algorithm is too slow in practice, we propose an efficient local search algorithm for inference into resulting problems. We demonstrate versatility and effectiveness of our approach in several applications: geometric multiple model fitting, homography and motion estimation, motion segmentation.
Meta-Transfer Learning through Hard Tasks
Q. Sun, Y. Liu, Z. Chen, T.-S. Chua and B. Schiele
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Number 3, 2022
Hyperspectral Image Super-Resolution with RGB Image Super-Resolution as an Auxiliary Task
K. Li, D. Dai and L. van Gool
2022 IEEE Winter Conference on Applications of Computer Vision (WACV 2022), 2022
ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks
D. H. M. Nguyen, D. M. Nguyen, T. T. N. Mai, T. Nguyen, K. T. Tran, A. T. Nguyen, B. T. Pham and B. T. Nguyen
Information Sciences, Volume 591, 2022
MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes
Z. Li, S. Shimada, B. Schiele, C. Theobalt and V. Golyanik
International Conference on 3D Vision, 2022
(arXiv: 2208.08439, Accepted/in press)
Abstract
3D human motion capture from monocular RGB images respecting interactions of<br>a subject with complex and possibly deformable environments is a very<br>challenging, ill-posed and under-explored problem. Existing methods address it<br>only weakly and do not model possible surface deformations often occurring when<br>humans interact with scene surfaces. In contrast, this paper proposes<br>MoCapDeform, i.e., a new framework for monocular 3D human motion capture that<br>is the first to explicitly model non-rigid deformations of a 3D scene for<br>improved 3D human pose estimation and deformable environment reconstruction.<br>MoCapDeform accepts a monocular RGB video and a 3D scene mesh aligned in the<br>camera space. It first localises a subject in the input monocular video along<br>with dense contact labels using a new raycasting based strategy. Next, our<br>human-environment interaction constraints are leveraged to jointly optimise<br>global 3D human poses and non-rigid surface deformations. MoCapDeform achieves<br>superior accuracy than competing methods on several datasets, including our<br>newly recorded one with deforming background scenes.<br>
OASIS: Only Adversarial Supervision for Semantic Image Synthesis
V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele and A. Khoreva
International Journal of Computer Vision, 2022
Attribute Prototype Network for Any-Shot Learning
W. Xu, Y. Xian, J. Wang, B. Schiele and Z. Akata
International Journal of Computer Vision, 2022
DPER: Direct Parameter Estimation for Randomly Missing Data
T. T. Nguyen, K. M. Nguyen-Duy, D. H. M. Nguyen, B. T. Nguyen and B. A. Wade
Knowledge-Based Systems, Volume 240, 2022
Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks
S. Jung and M. Keuper
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022), 2022
(Accepted/in press)
Abstract
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial<br>optimization problem of partitioning a real-valued edge-weighted graph such as<br>to minimize the total cost of the partition. While graph convolutional neural<br>networks (GNN) have proven to be promising in the context of combinatorial<br>optimization, most of them are only tailored to or tested on positive-valued<br>edge weights, i.e. they do not comply to the nature of the multicut problem. We<br>therefore adapt various GNN architectures including Graph Convolutional<br>Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to<br>facilitate the efficient encoding of real-valued edge costs. Moreover, we<br>employ a reformulation of the multicut ILP constraints to a polynomial program<br>as loss function that allows to learn feasible multicut solutions in a scalable<br>way. Thus, we provide the first approach towards end-to-end trainable<br>multicuts. Our findings support that GNN approaches can produce good solutions<br>in practice while providing lower computation times and largely improved<br>scalability compared to LP solvers and optimized heuristics, especially when<br>considering large instances.<br>
TATL: Task Agnostic Transfer Learning for Skin Attributes Detection
D. H. M. Nguyen, T. T. Nguyen, H. Vu, Q. Pham, B. T. Nguyen, D. Sonntag and M.-D. Nguyen
Medical Image Analysis, Volume 78, 2022
Impact of Realistic Properties of the Point Spread Function on Classification Tasks to Reveal a Possible Distribution Shift
P. Müller, A. Braun and M. Keuper
NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications (NeurIPS 2022 Workshop DistShift), 2022
Optimizing Edge Detection for Image Segmentation with Multicut Penalties
S. Jung, S. Ziegler, A. Kardoost and M. Keuper
Pattern Recognition (GCPR 2022), 2022
(Accepted/in press)
Abstract
The Minimum Cost Multicut Problem (MP) is a popular way for obtaining a graph<br>decomposition by optimizing binary edge labels over edge costs. While the<br>formulation of a MP from independently estimated costs per edge is highly<br>flexible and intuitive, solving the MP is NP-hard and time-expensive. As a<br>remedy, recent work proposed to predict edge probabilities with awareness to<br>potential conflicts by incorporating cycle constraints in the prediction<br>process. We argue that such formulation, while providing a first step towards<br>end-to-end learnable edge weights, is suboptimal, since it is built upon a<br>loose relaxation of the MP. We therefore propose an adaptive CRF that allows to<br>progressively consider more violated constraints and, in consequence, to issue<br>solutions with higher validity. Experiments on the BSDS500 benchmark for<br>natural image segmentation as well as on electron microscopic recordings show<br>that our approach yields more precise edge detection and image segmentation.<br>
Lifted Edges as Connectivity Priors for Multicut and Disjoint Paths
A. Horňáková
PhD Thesis, Universität des Saarlandes, 2022
Understanding and Improving Robustness and Uncertainty Estimation in Deep Learning
D. Stutz
PhD Thesis, Universität des Saarlandes, 2022
Abstract
Deep learning is becoming increasingly relevant for many high-stakes applications such as autonomous driving or medical diagnosis where wrong decisions can have massive impact on human lives. Unfortunately, deep neural networks are typically assessed solely based on generalization, e.g., accuracy on a fixed test set. However, this is clearly insufficient for safe deployment as potential malicious actors and distribution shifts or the effects of quantization and unreliable hardware are disregarded. Thus, recent work additionally evaluates performance on potentially manipulated or corrupted inputs as well as after quantization and deployment on specialized hardware. In such settings, it is also important to obtain reasonable estimates of the model's confidence alongside its predictions. This thesis studies robustness and uncertainty estimation in deep learning along three main directions: First, we consider so-called adversarial examples, slightly perturbed inputs causing severe drops in accuracy. Second, we study weight perturbations, focusing particularly on bit errors in quantized weights. This is relevant for deploying models on special-purpose hardware for efficient inference, so-called accelerators. Finally, we address uncertainty estimation to improve robustness and provide meaningful statistical performance guarantees for safe deployment. In detail, we study the existence of adversarial examples with respect to the underlying data manifold. In this context, we also investigate adversarial training which improves robustness by augmenting training with adversarial examples at the cost of reduced accuracy. We show that regular adversarial examples leave the data manifold in an almost orthogonal direction. While we find no inherent trade-off between robustness and accuracy, this contributes to a higher sample complexity as well as severe overfitting of adversarial training. Using a novel measure of flatness in the robust loss landscape with respect to weight changes, we also show that robust overfitting is caused by converging to particularly sharp minima. In fact, we find a clear correlation between flatness and good robust generalization. Further, we study random and adversarial bit errors in quantized weights. In accelerators, random bit errors occur in the memory when reducing voltage with the goal of improving energy-efficiency. Here, we consider a robust quantization scheme, use weight clipping as regularization and perform random bit error training to improve bit error robustness, allowing considerable energy savings without requiring hardware changes. In contrast, adversarial bit errors are maliciously introduced through hardware- or software-based attacks on the memory, with severe consequences on performance. We propose a novel adversarial bit error attack to study this threat and use adversarial bit error training to improve robustness and thereby also the accelerator's security. Finally, we view robustness in the context of uncertainty estimation. By encouraging low-confidence predictions on adversarial examples, our confidence-calibrated adversarial training successfully rejects adversarial, corrupted as well as out-of-distribution examples at test time. Thereby, we are also able to improve the robustness-accuracy trade-off compared to regular adversarial training. However, even robust models do not provide any guarantee for safe deployment. To address this problem, conformal prediction allows the model to predict confidence sets with user-specified guarantee of including the true label. Unfortunately, as conformal prediction is usually applied after training, the model is trained without taking this calibration step into account. To address this limitation, we propose conformal training which allows training conformal predictors end-to-end with the underlying model. This not only improves the obtained uncertainty estimates but also enables optimizing application-specific objectives without losing the provided guarantee. Besides our work on robustness or uncertainty, we also address the problem of 3D shape completion of partially observed point clouds. Specifically, we consider an autonomous driving or robotics setting where vehicles are commonly equipped with LiDAR or depth sensors and obtaining a complete 3D representation of the environment is crucial. However, ground truth shapes that are essential for applying deep learning techniques are extremely difficult to obtain. Thus, we propose a weakly-supervised approach that can be trained on the incomplete point clouds while offering efficient inference. In summary, this thesis contributes to our understanding of robustness against both input and weight perturbations. To this end, we also develop methods to improve robustness alongside uncertainty estimation for safe deployment of deep learning methods in high-stakes applications. In the particular context of autonomous driving, we also address 3D shape completion of sparse point clouds.
TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement
K. Zhou, B. Lal Bhatnagar, J. E. Lenssen and G. Pons-Moll
Technical Report, 2022
(arXiv: 2205.07982)
Abstract
We present TOCH, a method for refining incorrect 3D hand-object interaction<br>sequences using a data prior. Existing hand trackers, especially those that<br>rely on very few cameras, often produce visually unrealistic results with<br>hand-object intersection or missing contacts. Although correcting such errors<br>requires reasoning about temporal aspects of interaction, most previous work<br>focus on static grasps and contacts. The core of our method are TOCH fields, a<br>novel spatio-temporal representation for modeling correspondences between hands<br>and objects during interaction. The key component is a point-wise<br>object-centric representation which encodes the hand position relative to the<br>object. Leveraging this novel representation, we learn a latent manifold of<br>plausible TOCH fields with a temporal denoising auto-encoder. Experiments<br>demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object<br>interaction models, which are limited to static grasps and contacts. More<br>importantly, our method produces smooth interactions even before and after<br>contact. Using a single trained TOCH model, we quantitatively and qualitatively<br>demonstrate its usefulness for 1) correcting erroneous reconstruction results<br>from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising,<br>and 3) grasp transfer across objects. We will release our code and trained<br>model on our project page at http://virtualhumans.mpi-inf.mpg.de/toch/<br>
2021
(SP)2Net for Generalized Zero-Label Semantic Segmentation
A. Das, Y. Xian, Y. He, B. Schiele and Z. Akata
Pattern Recognition (GCPR 2021), 2021