Publications - Current Year
2024
- “CloSe: A 3D Clothing Segmentation Dataset and Model,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Interaction Replica: Tracking Human–Object Interaction and Scene Changes From Human Motion,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “Generating Continual Human Motion in Diverse 3D Scenes,” in 3DV 2024, 11th International Conference on 3D Vision, Davos, Switzerland, 2024.
- “B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada.more
Abstract
B-cos Networks have been shown to be effective for obtaining highly human
interpretable explanations of model decisions by architecturally enforcing
stronger alignment between inputs and weight. B-cos variants of convolutional
networks (CNNs) and vision transformers (ViTs), which primarily replace linear
layers with B-cos transformations, perform competitively to their respective
standard variants while also yielding explanations that are faithful by design.
However, it has so far been necessary to train these models from scratch, which
is increasingly infeasible in the era of large, pre-trained foundation models.
In this work, inspired by the architectural similarities in standard DNNs and
B-cos networks, we propose 'B-cosification', a novel approach to transform
existing pre-trained models to become inherently interpretable. We perform a
thorough study of design choices to perform this conversion, both for
convolutional neural networks and vision transformers. We find that
B-cosification can yield models that are on par with B-cos models trained from
scratch in terms of interpretability, while often outperforming them in terms
of classification performance at a fraction of the training cost. Subsequently,
we apply B-cosification to a pretrained CLIP model, and show that, even with
limited data and compute cost, we obtain a B-cosified version that is highly
interpretable and competitive on zero shot performance across a variety of
datasets. We release our code and pre-trained model weights at
github.com/shrebox/B-cosification. - “Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Vancouver, Canada.
- “Recent Trends in 3D Reconstruction of General Non-Rigid Scenes,” Computer Graphics Forum (Proc. EUROGRAPHICS 2024), vol. 43, no. 2, 2024.
- “Improving Feature Stability during Upsampling - Spectral Artifacts and the Importance of Spatial Context,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Good Teachers Explain: Explanation-Enhanced Knowledge Distillation,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
- “Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs,” in Computer Vision -- ECCV 2024, Milan, Italy, 2024.
- “HowToCaption: Prompting LLMs to Transform Video Annotations at Scale,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “GiT: Towards Generalist Vision Transformer through Universal Language Interface,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “LatentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction,” in Computer Vision -- ECCV 2024, Milano, Italy, 2024.
- “Improving 2D Feature Representations by 3D-Aware Fine-Tuning,” in Computer Vision -- ECCV 2024, Milano, Italy.
- “Sp2360: Sparse-view 360◦ Scene Reconstruction using Cascaded 2D Diffusion Priors,” in ECCV 2024 Workshop on Wild 3D (ECCV 2024 Wild3D), Milan, Italy, 2024.
- “Domain-Aware Fine-Tuning of Foundation Models,” in ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop), Vienna, Austria, 2024.
- “OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Training Vision Transformers for Semi-Supervised Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Open-Vocabulary 3D Semantic Segmentation with Foundation Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 2024.
- “Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Point Transformer V3: Simpler, Faster, Stronger,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “GEARS: Local Geometry-aware Hand-object Interaction Synthesis,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA.
- “Task Driven Sensor Layouts - Joint Optimization of Pixel Layout and Network Parameters,” in IEEE International Conference on Computational Photography (ICCP 2024), Lausanne, Switzrland, 2024.
- “Automated Dominative Subspace Mining for Efficient Neural Architecture Search,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, 2024.
- “Enhanced Long-Tailed Recognition With Contrastive CutMix Augmentation,” IEEE Transactions on Image Processing, vol. 33, 2024.
- “Better Understanding Differences in Attribution Methods via Systematic Evaluations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, 2024.
- “MTR++: Multi-Agent Motion Prediction With Symmetric Scene Modeling and Guided Intention Querying,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, 2024.
- “Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, vol. 132, 2024.
- “Toward a Diffusion-Based Generalist for Dense Vision Tasks,” in MMFM2, The 2nd Workshop on What is Next in Multimodal Foundation Models?, Seattle, WA, USA, 2024.
- “CosPGD: An Efficient White-Box Adversarial Attack for Pixel-Wise Prediction Tasks,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Implicit Representations for Constrained Image Segmentation,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “MultiMax: Sparse and Mulit-Modal Attention Learning,” in Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 2024.
- “Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “On Adversarial Training without Perturbing all Examples,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.
- “Learning the essential in less than 2k additional weights - a simple approach to improve image classification stability under corruptions,” Transactions on Machine Learning Research, vol. 2024, no. 6, 2024.
- “As large as it gets - Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters,” Transactions on Machine Learning Research, vol. 2024, 2024.
- “Wakening Past Concepts without Past Data: Class-Incremental Learning from Online Placebos,” in WACV 2024, IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2024.
- “Efficient and Differentiable Combinatorial Optimization for Visual Computing,” Universität des Saarlandes, Saarbrücken, 2024.
- “Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing,” Universität des Saarlandes, Saarbrücken, 2024.
- “Increasing Interpretability of Deep Neural Networks via B-cosification,” Universität des Saarlandes, Saarbrücken, 2024.more
Abstract
Understanding the decisions of deep neural networks (DNNs) has been a challenging task due to their ‘black-box’ nature. Methods such as feature attributions that attempt to explain the decisions of such models post-hoc, while popular, have been shown to often yield explanations that are not faithful to the model. Recently, B-cos networks were proposed as a means of instead designing such networks to be inherently interpretable by architecturally enforcing stronger alignment between inputs and weights, yielding highly human interpretable explanations that are model-faithful by design. However, unlike with post-hoc methods, this requires training new models from scratch, which represents a major hurdle for establishing such novel models as an alternative to existing ones, in particular due to the increasing reliance on large,pre-trained foundational models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose ‘B-cosification’, a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability,while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to CLIP models, and show that, even with limited data and compute cost, we obtain B-cosified CLIP models that are highly interpretable and are competitive on zero shot and linear probe performance across a variety of datasets.
- “Towards Designing Inherently Interpretable Deep Neural Networks for Image Classification,” Universität des Saarlandes, Saarbrücken, 2024.
- “Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12489.more
Abstract
In this work, we introduce Scribbles for All, a label and training data
generation algorithm for semantic segmentation trained on scribble labels.
Training or fine-tuning semantic segmentation models with weak supervision has
become an important topic recently and was subject to significant advances in
model quality. In this setting, scribbles are a promising label type to achieve
high quality segmentation results while requiring a much lower annotation
effort than usual pixel-wise dense semantic segmentation annotations. The main
limitation of scribbles as source for weak supervision is the lack of
challenging datasets for scribble segmentation, which hinders the development
of novel methods and conclusive evaluations. To overcome this limitation,
Scribbles for All provides scribble labels for several popular segmentation
datasets and provides an algorithm to automatically generate scribble labels
for any dataset with dense annotations, paving the way for new insights and
model advancements in the field of weakly supervised segmentation. In addition
to providing datasets and algorithm, we evaluate state-of-the-art segmentation
models on our datasets and show that models trained with our synthetic labels
perform competitively with respect to models trained on manual labels. Thus,
our datasets enable state-of-the-art research into methods for scribble-labeled
semantic segmentation. The datasets, scribble generation algorithm, and
baselines are publicly available at github.com/wbkit/Scribbles4All - “Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,” Universität des Saarlandes, Saarbrücken, 2024.
- “MINO: MOTR with Improved DeNoising Anchor Boxes for End-to-End Multiple Object Tracking,” Universität des Saarlandes, Saarbrücken, 2024.
- “Sailing in High-dimensional Spaces: Low-dimensional Embeddings through Angle Preservation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09876.more
Abstract
Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in
science and engineering. They allow us to quickly understand the main
properties of the data, identify outliers and processing errors, and inform the
next steps of data analysis. As such, LDEs have to be faithful to the original
high-dimensional data, i.e., they should represent the relationships that are
encoded in the data, both at a local as well as global scale. The current
generation of LDE approaches focus on reconstructing local distances between
any pair of samples correctly, often out-performing traditional approaches
aiming at all distances. For these approaches, global relationships are,
however, usually strongly distorted, often argued to be an inherent trade-off
between local and global structure learning for embeddings. We suggest a new
perspective on LDE learning, reconstructing angles between data points. We show
that this approach, Mercat, yields good reconstruction across a diverse set of
experiments and metrics, and preserve structures well across all scales.
Compared to existing work, our approach also has a simple formulation,
facilitating future theoretical analysis and algorithmic improvements. - “Advancing Image and Video Recognition with Less Supervision,” Universität des Saarlandes, Saarbrücken, 2024.more
Abstract
Deep learning is increasingly relevant in our daily lives, as it simplifies tedious tasks and enhances quality of life across various domains such as entertainment, learning, automatic assistance, and autonomous driving. However, the demand for more data to train models for emerging tasks is increasing dramatically. Deep learning models heavily depend on the quality and quantity of data, necessitating high-quality labeled datasets. Yet, each task requires different types of annotations for training and evaluation, posing challenges in obtaining comprehensive supervision. The acquisition of annotations is not only resource-intensive in terms of time and cost but also introduces biases, such as granularity in classification, where distinctions like specific breeds versus generic categories may arise. Furthermore, the dynamic nature of the world causes the challenge that previously annotated data becomes potentially irrelevant, and new categories and rare occurrences continually emerge, making it impossible to label every aspect of the world.
Therefore, this thesis aims to explore various supervision scenarios to mitigate the need for full supervision and reduce data acquisition costs. Specifically, we investigate learning without labels, referred to as self-supervised and unsupervised methods, to better understand video and image representations. To learn from data without labels, we leverage injected priors such as motion speed, direction, action order in videos, or semantic information granularity to obtain powerful data representations. Further, we study scenarios involving reduced supervision levels. To reduce annotation costs, first, we propose to omit precise annotations for one modality in multimodal learning, namely in text-video and image-video settings, and transfer available knowledge to large copora of video data. Second, we study semi-supervised learning scenarios, where only a subset of annotated data alongside unlabeled data is available, and propose to revisit regularization constraints and improve generalization to unlabeled data. Additionally, we address scenarios where parts of available data is inherently limited due to privacy and security reasons or naturally rare events, which not only restrict annotations but also limit the overall data volume. For these scenarios, we propose methods that carefully balance between previously obtained knowledge and incoming limited data by introducing a calibration method or combining a space reservation technique with orthogonality constraints. Finally, we explore multimodal and unimodal open-world scenarios where the model is asked to generalize beyond the given set of object or action classes. Specifically, we propose a new challenging setting on multimodal egocentric videos and propose an adaptation method for vision-language models to generalize on egocentric domain. Moreover, we study unimodal image recognition in an open-set setting and propose to disentangle open-set detection and image classification tasks that effectively improve generalization in different settings.
In summary, this thesis investigates challenges arising when full supervision for training models is not available. We develop methods to understand learning dynamics and the role of biases in data, while also proposing novel setups to advance training with less supervision. - “Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01505.more
Abstract
Reasoning over sports videos for question answering is an important task with
numerous applications, such as player training and information retrieval.
However, this task has not been explored due to the lack of relevant datasets
and the challenging nature it presents. Most datasets for video question
answering (VideoQA) focus mainly on general and coarse-grained understanding of
daily-life videos, which is not applicable to sports scenarios requiring
professional action understanding and fine-grained motion analysis. In this
paper, we introduce the first dataset, named Sports-QA, specifically designed
for the sports VideoQA task. The Sports-QA dataset includes various types of
questions, such as descriptions, chronologies, causalities, and counterfactual
conditions, covering multiple sports. Furthermore, to address the
characteristics of the sports VideoQA task, we propose a new Auto-Focus
Transformer (AFT) capable of automatically focusing on particular scales of
temporal information for question answering. We conduct extensive experiments
on Sports-QA, including baseline studies and the evaluation of different
methods. The results demonstrate that our AFT achieves state-of-the-art
performance. - “A Good Teacher Explains: Explanation-enhanced Knowledge Distillation,” Universität des Saarlandes, Saarbrücken, 2024.
- “SplatFusion: Sparse-view 360˝ Scene Reconstruction using 2D Diffusion Priors,” Universität des Saarlandes, Saarbrücken, 2024.
- “Implicit Surface Reconstruction from Noisy Point-clouds using Local Priors,” Universität des Saarlandes, Saarbrücken, 2024.
- “TransEnD: An End-to-End Autonomous Driving System,” Universität des Saarlandes, Saarbrücken, 2024.
- “Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01806.more
Abstract
Multiple object tracking in complex scenarios - such as coordinated dance
performances, team sports, or dynamic animal groups - presents unique
challenges. In these settings, objects frequently move in coordinated patterns,
occlude each other, and exhibit long-term dependencies in their trajectories.
However, it remains a key open research question on how to model long-range
dependencies within tracklets, interdependencies among tracklets, and the
associated temporal occlusions. To this end, we introduce Samba, a novel
linear-time set-of-sequences model designed to jointly process multiple
tracklets by synchronizing the multiple selective state-spaces used to model
each tracklet. Samba autoregressively predicts the future track query for each
sequence while maintaining synchronized long-term memory representations across
tracklets. By integrating Samba into a tracking-by-propagation framework, we
propose SambaMOTR, the first tracker effectively addressing the aforementioned
issues, including long-range dependencies, tracklet interdependencies, and
temporal occlusions. Additionally, we introduce an effective technique for
dealing with uncertain observations (MaskObs) and an efficient training recipe
to scale SambaMOTR to longer sequences. By modeling long-range dependencies and
interactions among tracked objects, SambaMOTR implicitly learns to track
objects accurately through occlusions without any hand-crafted heuristics. Our
approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT,
and SportsMOT datasets. - “TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” 2024. [Online]. Available: https://arxiv.org/abs/2410.23168.more
Abstract
Transformers have become the predominant architecture in foundation models
due to their excellent performance across various domains. However, the
substantial cost of scaling these models remains a significant concern. This
problem arises primarily from their dependence on a fixed number of parameters
within linear projections. When architectural modifications (e.g., channel
dimensions) are introduced, the entire model typically requires retraining from
scratch. As model sizes continue growing, this strategy results in increasingly
high computational costs and becomes unsustainable. To overcome this problem,
we introduce TokenFormer, a natively scalable architecture that leverages the
attention mechanism not only for computations among input tokens but also for
interactions between tokens and model parameters, thereby enhancing
architectural flexibility. By treating model parameters as tokens, we replace
all the linear projections in Transformers with our token-parameter attention
layer, where input tokens act as queries and model parameters as keys and
values. This reformulation allows for progressive and efficient scaling without
necessitating retraining from scratch. Our model scales from 124M to 1.4B
parameters by incrementally adding new key-value parameter pairs, achieving
performance comparable to Transformers trained from scratch while greatly
reducing training costs. Code and models are available at
\url{https://github.com/Haiyang-W/TokenFormer}. - “FaceGPT: Self-supervised Learning to Chat about 3D Human Faces,” 2024. [Online]. Available: https://arxiv.org/abs/2406.07163.more
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large
Vision-Language Models (VLMs) to reason about 3D human faces from images and
text. Typical 3D face reconstruction methods are specialized algorithms that
lack semantic reasoning capabilities. FaceGPT overcomes this limitation by
embedding the parameters of a 3D morphable face model (3DMM) into the token
space of a VLM, enabling the generation of 3D faces from both textual and
visual inputs. FaceGPT is trained in a self-supervised manner as a model-based
autoencoder from in-the-wild images. In particular, the hidden state of LLM is
projected into 3DMM parameters and subsequently rendered as 2D face image to
guide the self-supervised learning process via image-based reconstruction.
Without relying on expensive 3D annotations of human faces, FaceGPT obtains a
detailed understanding about 3D human faces, while preserving the capacity to
understand general user instructions. Our experiments demonstrate that FaceGPT
not only achieves high-quality 3D face reconstructions but also retains the
ability for general-purpose visual instruction following. Furthermore, FaceGPT
learns fully self-supervised to generate 3D faces based on complex textual
inputs, which opens a new direction in human face analysis. - “Number it: Temporal Grounding Videos like Flipping Manga,” 2024. [Online]. Available: https://arxiv.org/abs/2411.10332.more
Abstract
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend
this visual understanding to tasks requiring precise temporal localization,
known as Video Temporal Grounding (VTG). To address this gap, we introduce
Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual
comprehension with temporal grounding by adding unique numerical identifiers to
each video frame. Treating a video as a sequence of numbered frame images,
NumPro transforms VTG into an intuitive process: flipping through manga panels
in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking
visual content with corresponding temporal information. Our experiments
demonstrate that NumPro significantly boosts VTG performance of top-tier
Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a
NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing
previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and
8.5\% in mAP for highlight detection. The code will be available at
github.com/yongliang-wu/NumPro.