Publications - Current Year

2025

1
Conference paper
D2
V. Guzov, Y. Jiang, F. Hong, G. Pons-Moll, R. Newcombe, C. K. Liu, Y. Ye, and L. Ma
“HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
2
Conference paper
D2
C. Li, J. Chibane, Y. He, N. Pearl, A. Geiger, and G. Pons-Moll
“Unimotion: Unifying 3D Human Motion Synthesis and Understanding,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
3
Conference paper
D2
K. Raj, C. Wewer, R. Yunus, E. Ilg, and J. E. Lenssen
“Spurfies: Sparse-view Surface Reconstruction using Local Geometry Priors,” in 3DV 2025, International Conference on 3D Vision, Singapore.
4
Conference paper
D2
T. Wimmer, M. Oechsle, M. Niemeyer, and F. Tombari
“Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes,” in 3DV 2025, International Conference on 3D Vision, Singapore.
more
Abstract
State-of-the-art novel view synthesis methods achieve impressive results for
multi-view captures of static 3D scenes. However, the reconstructed scenes
still lack "liveliness," a key component for creating engaging 3D experiences.
Recently, novel video diffusion models generate realistic videos with complex
motion and enable animations of 2D images, however they cannot naively be used
to animate 3D scenes as they lack multi-view consistency. To breathe life into
the static world, we propose Gaussians2Life, a method for animating parts of
high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is
to leverage powerful video diffusion models as the generative component of our
model and to combine these with a robust technique to lift 2D videos into
meaningful 3D motion. We find that, in contrast to prior work, this enables
realistic animations of complex, pre-existing 3D scenes and further enables the
animation of a large variety of object classes, while related work is mostly
focused on prior-based character animation, or single 3D objects. Our model
enables the creation of consistent, immersive 3D experiences for arbitrary
scenes.
5
Conference paper
D2
X. Xie, J. E. Lenssen, and G. Pons-Moll
“InterTrack: Tracking Human Object Interaction without Object Templates,” in 3DV 2025, International Conference on 3D Vision, Singapore.
6
Conference paper
D2
X. Zhang, B. L. Bhatnagar, S. Starke, I. A. Petrov, V. Guzov, H. Dhamo, E. Pérez Pellitero, and G. Pons-Moll
“FORCE: Dataset and Method for Intuitive Physics Guided Human-object Interaction,” in 3DV 2025, International Conference on 3D Vision, Singapore.
7
Article
D2
B. Andres, S. Di Gregorio, J. Irmai, and J.-H. Lange
“Corrigendum to ‘A polyhedral study of lifted multicuts’ [Discrete Optim. 47 (2023) 100757],” Discrete Optimization, vol. 55, 2025.
8
Conference paper
D2
M. Asim, C. Wewer, T. Wimmer, B. Schiele, and J. E. Lenssen
“MEt3R: Measuring Multi-View Consistency in Generated Images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
9
Conference paper
D2
P. Flotho, M. Piening, A. Kukleva, and G. Steidl
“T-FAKE: Synthesizing Thermal Images for Facial Landmarking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
10
Conference paper
D2
F. Hong, V. Guzov, H. J. Kim, Y. Ye, R. Newcombe, Z. Liu, and L. Ma
“EgoLM: Multi-Modal Language Model of Egocentric Motions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
11
Conference paper
D2
X. Hu, H. Wang, J. E. Lenssen, and B. Schiele
“PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
more
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI
12
Conference paper
D2
N. Shvetsova, A. Nagrani, B. Schiele, H. Kuehne, and C. Rupprecht
“Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
13
Conference paper
D2
F. Vogel, W. Bousselham, A. Kukleva, N. Shvetsova, and H. Kuehne
“VideoGEM: Training-free Action Grounding in Videos,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
14
Conference paper
D2
Y. Wu, X. Hu, Y. Sun, Y. Zhou, W. Zhu, F. Rao, B. Schiele, and X. Yang
“Number it: Temporal Grounding Videos like Flipping Manga,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
more
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI
15
Conference paper
D2
J. Xie, A. Tonioni, N. Rauschmayr, F. Tombari, and B. Schiele
“Test-Time Visual In-Context Tuning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA.
16
Conference paper
D2
T. Medi, S. Jung, and M. Keuper
“FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
17
Conference paper
D2
K. Prasse, I. Bravo, S. Walter, and M. Keuper
“I Spy with My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
18
Conference paper
D2
Y. Liu, C. Graf, M. Spies, and M. Keuper
“Segment any Repeated Object,” in IEEE International Conference on Robotics and Automation (ICRA 2025), Hyderabad, India.
19
Article
D2
J. Lukasik, M. Moeller, and M. Keuper
“An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” International Journal of Computer Vision, vol. 133, 2025.
20
Conference paper
D2
R. Hesse, J. Fischer, S. Schaub-Meyer, and S. Roth
“Disentangling Polysemantic Channels in Convolutional Neural Networks,” in The First Workshop on Mechanistic Interpretability for Vision (MIV 2025), Nashville, TN, USA.
21
Conference paper
D2
I. Hossain, J. Fischer, R. Burkholz, and J. Quackenbush
“Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in The Second Conference on Parsimony and Learning Recent Spotlight Track (CPAL 2025), Stanford, CA, USA, 2025.
22
Conference paper
D2
Y. Li, W. Beluch, M. Keuper, D. Zhang, and A. Khoreva
“VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
23
Conference paper
D2
M. Segu, L. Piccinelli, S. Li, Y.-H. Yang, B. Schiele, and L. Van Gool
“Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
24
Conference paper
D2
S. Gairola, M. Böhle, F. Locatello, and B. Schiele
“How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
25
Conference paper
D2
P. Gavrikov, J. Lukasik, S. Jung, R. Geirhos, M. J. Mirza, M. Keuper, and J. Keuper
“Can We Talk Models Into Seeing the World Differently?,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
26
Conference paper
D2
H. Wang, Y. Fan, M. F. Naeem, Y. Xian, J. E. Lenssen, L. Wang, F. Tombari, and B. Schiele
“TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
27
Conference paper
D2
Y. Yuan, Z. Zhang, X. He, A. Nitta, W. Hu, D. Wang, M. Shah, S. Huang, B. Stojanovič, A. Krumholz, J. E. Lenssen, J. Leskovec, and M. Fey
“ContextGNN: Beyond Two-Tower Recommendation Systems,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
28
Paper
D2
J. Belouadi, E. Ilg, M. Keuper, H. Tanaka, M. Utiyama, R. Dabre, S. Eger, and S. P. Ponzetto
“TikZero: Zero-Shot Text-Guided Graphics Program Synthesis,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11509.
more
Abstract
With the rise of generative AI, synthesizing figures from text captions
becomes a compelling application. However, achieving high geometric precision
and editability requires representing figures as graphics programs in languages
like TikZ, and aligned training data (i.e., graphics programs with captions)
remains scarce. Meanwhile, large amounts of unaligned graphics programs and
captioned raster images are more readily available. We reconcile these
disparate data sources by presenting TikZero, which decouples graphics program
generation from text understanding by using image representations as an
intermediary bridge. It enables independent training on graphics programs and
captioned images and allows for zero-shot text-guided graphics program
synthesis during inference. We show that our method substantially outperforms
baselines that can only operate with caption-aligned graphics programs.
Furthermore, when leveraging caption-aligned graphics programs as a
complementary training signal, TikZero matches or exceeds the performance of
much larger models, including commercial systems like GPT-4o. Our code,
datasets, and select models are publicly available.
29
Paper
D2
C. Chen, E. Saha, J. Fischer, M. B. Guebila, V. Fanfani, K. H. Shutta, M. Padi, K. Glass, D. L. DeMeo, C. M. Lopes-Ramos, and J. Quackenbush
“Identifying Sex Differences in Lung Adenocarcinoma Using Multi-Omics Integrative Protein Signaling Networks.” bioRxiv, 2025.
30
Paper
D2
J. Erbach, D. Narnhofer, A. Dombos, B. Schiele, J. E. Lenssen, and K. Schindler
“Solving Inverse Problems with FLAIR,” 2025. [Online]. Available: https://arxiv.org/abs/2506.02680.
more
Abstract
Flow-based latent generative models such as Stable Diffusion 3 are able to
generate images with remarkable quality, even enabling photorealistic
text-to-image generation. Their impressive performance suggests that these
models should also constitute powerful priors for inverse imaging problems, but
that approach has not yet led to comparable fidelity. There are several key
obstacles: (i) the encoding into a lower-dimensional latent space makes the
underlying (forward) mapping non-linear; (ii) the data likelihood term is
usually intractable; and (iii) learned generative models struggle to recover
rare, atypical data modes during inference. We present FLAIR, a novel training
free variational framework that leverages flow-based generative models as a
prior for inverse problems. To that end, we introduce a variational objective
for flow matching that is agnostic to the type of degradation, and combine it
with deterministic trajectory adjustments to recover atypical modes. To enforce
exact consistency with the observed data, we decouple the optimization of the
data fidelity and regularization terms. Moreover, we introduce a time-dependent
calibration scheme in which the strength of the regularization is modulated
according to off-line accuracy estimates. Results on standard imaging
benchmarks demonstrate that FLAIR consistently outperforms existing diffusion-
and flow-based methods in terms of reconstruction quality and sample diversity.
31
Paper
D2
A. Görgün, B. Schiele, and J. Fischer
“VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22399.
more
Abstract
Neural networks are widely adopted to solve complex and challenging tasks.
Especially in high-stakes decision-making, understanding their reasoning
process is crucial, yet proves challenging for modern deep networks. Feature
visualization (FV) is a powerful tool to decode what information neurons are
responding to and hence to better understand the reasoning behind such
networks. In particular, in FV we generate human-understandable images that
reflect the information detected by neurons of interest. However, current
methods often yield unrecognizable visualizations, exhibiting repetitive
patterns and visual artifacts that are hard to understand for a human. To
address these problems, we propose to guide FV through statistics of real image
features combined with measures of relevant network flow to generate
prototypical images. Our approach yields human-understandable visualizations
that both qualitatively and quantitatively improve over state-of-the-art FVs
across various architectures. As such, it can be used to decode which
information the network uses, complementing mechanistic circuits that identify
where it is encoded. Code is available at: github.com/adagorgun/VITAL
32
Paper
D2
R. Hesse, D. Bağcı, B. Schiele, S. Schaub-Meyer, and S. Roth
“Beyond Accuracy: What Matters in Designing Well-Behaved Models?,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17110.
more
Abstract
Deep learning has become an essential part of computer vision, with deep
neural networks (DNNs) excelling in predictive performance. However, they often
fall short in other critical quality dimensions, such as robustness,
calibration, or fairness. While existing studies have focused on a subset of
these quality dimensions, none have explored a more general form of
"well-behavedness" of DNNs. With this work, we address this gap by
simultaneously studying nine different quality dimensions for image
classification. Through a large-scale study, we provide a bird's-eye view by
analyzing 326 backbone models and how different training paradigms and model
architectures affect the quality dimensions. We reveal various new insights
such that (i) vision-language models exhibit high fairness on ImageNet-1k
classification and strong robustness against domain changes; (ii)
self-supervised learning is an effective training paradigm to improve almost
all considered quality dimensions; and (iii) the training dataset size is a
major driver for most of the quality dimensions. We conclude our study by
introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel
metric that ranks models across multiple dimensions of quality, enabling
tailored recommendations based on specific user needs.
33
Paper
D2
P. Müller, A. Braun, and M. Keuper
“Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models,” 2025. [Online]. Available: https://arxiv.org/abs/2504.18510.
more
Abstract
Deep neural networks (DNNs) have proven to be successful in various computer
vision applications such that models even infer in safety-critical situations.
Therefore, vision models have to behave in a robust way to disturbances such as
noise or blur. While seminal benchmarks exist to evaluate model robustness to
diverse corruptions, blur is often approximated in an overly simplistic way to
model defocus, while ignoring the different blur kernel shapes that result from
optical systems. To study model robustness against realistic optical blur
effects, this paper proposes two datasets of blur corruptions, which we denote
OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such
as coma, defocus, and astigmatism, i.e. aberrations that can be represented by
varying a single parameter of Zernike polynomials. To go beyond the principled
but synthetic setting of primary aberrations, LensCorruptions samples linear
combinations in the vector space spanned by Zernike polynomials, corresponding
to 100 real lenses. Evaluations for image classification and object detection
on ImageNet and MSCOCO show that for a variety of different pre-trained models,
the performance on OpticsBench and LensCorruptions varies significantly,
indicating the need to consider realistic image corruptions to evaluate a
model's robustness against blur.
34
Paper
D2D6
N. Pham, B. Schiele, A. Kortylewski, and J. Fischer
“Escaping Plato’s Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13429.
more
Abstract
With the rise of neural networks, especially in high-stakes applications,
these networks need two properties (i) robustness and (ii) interpretability to
ensure their safety. Recent advances in classifiers with 3D volumetric object
representations have demonstrated a greatly enhanced robustness in
out-of-distribution data. However, these 3D-aware classifiers have not been
studied from the perspective of interpretability. We introduce CAVE - Concept
Aware Volumes for Explanations - a new direction that unifies interpretability
and robustness in image classification. We design an inherently-interpretable
and robust classifier by extending existing 3D-aware classifiers with concepts
extracted from their volumetric representations for classification. In an array
of quantitative metrics for interpretability, we compare against different
concept-based approaches across the explainable AI literature and show that
CAVE discovers well-grounded concepts that are used consistently across images,
while achieving superior robustness.
35
Paper
D2
L. Piccinelli, C. Sakaridis, M. Segu, Y.-H. Yang, S. Li, W. Abbeloos, and L. Van Gool
“UniK3D: Universal Camera Monocular 3D Estimation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16591.
more
Abstract
Monocular 3D estimation is crucial for visual perception. However, current
methods fall short by relying on oversimplified assumptions, such as pinhole
camera models or rectified images. These limitations severely restrict their
general applicability, causing poor performance in real-world scenarios with
fisheye or panoramic images and resulting in substantial context loss. To
address this, we present UniK3D, the first generalizable method for monocular
3D estimation able to model any camera. Our method introduces a spherical 3D
representation which allows for better disentanglement of camera and scene
geometry and enables accurate metric 3D reconstruction for unconstrained camera
models. Our camera component features a novel, model-independent representation
of the pencil of rays, achieved through a learned superposition of spherical
harmonics. We also introduce an angular loss, which, together with the camera
module design, prevents the contraction of the 3D outputs for wide-view
cameras. A comprehensive zero-shot evaluation on 13 diverse datasets
demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and
camera metrics, with substantial gains in challenging large-field-of-view and
panoramic settings, while maintaining top accuracy in conventional pinhole
small-field-of-view domains. Code and models are available at
github.com/lpiccinelli-eth/unik3d .
36
Paper
D2
L. Piccinelli, C. Sakaridis, Y.-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool
“UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler,” 2025. [Online]. Available: https://arxiv.org/abs/2502.20110.
more
Abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving
downstream tasks in 3D perception and modeling. However, the remarkable
accuracy of recent MMDE methods is confined to their training domains. These
methods fail to generalize to unseen domains even in the presence of moderate
domain gaps, which hinders their practical applicability. We propose a new
model, UniDepthV2, capable of reconstructing metric 3D scenes from solely
single images across domains. Departing from the existing MMDE paradigm,
UniDepthV2 directly predicts metric 3D points from the input image at inference
time without any additional information, striving for a universal and flexible
MMDE solution. In particular, UniDepthV2 implements a self-promptable camera
module predicting a dense camera representation to condition depth features.
Our model exploits a pseudo-spherical output representation, which disentangles
the camera and depth representations. In addition, we propose a geometric
invariance loss that promotes the invariance of camera-prompted depth features.
UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss
which enhances the localization and sharpness of edges in the metric depth
outputs, a revisited, simplified and more efficient architectural design, and
an additional uncertainty-level output which enables downstream tasks requiring
confidence. Thorough evaluations on ten depth datasets in a zero-shot regime
consistently demonstrate the superior performance and generalization of
UniDepthV2. Code and models are available at
github.com/lpiccinelli-eth/UniDepth
37
Paper
D2
K. Prasse, M. Kleinmann, I. Adam, K. Beckersjuergen, A. Edte, J. Frroku, T. Gumpp, S. Jung, I. Bravo, S. Walter, and M. Keuper
“Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09361.
more
Abstract
Climate change is one of the most pressing challenges of the 21st century,
sparking widespread discourse across social media platforms. Activists,
policymakers, and researchers seek to understand public sentiment and
narratives while access to social media data has become increasingly restricted
in the post-API era. In this study, we analyze a dataset of climate
change-related tweets from X (formerly Twitter) shared in 2019, containing 730k
tweets along with the shared images. Our approach integrates statistical
analysis, image classification, object detection, and sentiment analysis to
explore visual narratives in climate discourse. Additionally, we introduce a
graphical user interface (GUI) to facilitate interactive data exploration. Our
findings reveal key themes in climate communication, highlight sentiment
divergence between images and text, and underscore the strengths and
limitations of foundation models in analyzing social media imagery. By
releasing our code and tools, we aim to support future research on the
intersection of climate change, social media, and computer vision.
38
Paper
D2
K. Prasse, P. Knab, S. Marton, C. Bartelt, and M. Keuper
“DCBM: Data-Efficient Visual Concept Bottleneck Models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.11576.
more
Abstract
Concept Bottleneck Models (CBMs) enhance the interpretability of neural
networks by basing predictions on human-understandable concepts. However,
current CBMs typically rely on concept sets extracted from large language
models or extensive image corpora, limiting their effectiveness in data-sparse
scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for
large sample sizes during concept generation while preserving interpretability.
DCBMs define concepts as image regions detected by segmentation or detection
foundation models, allowing each image to generate multiple concepts across
different granularities. This removes reliance on textual descriptions and
large-scale pre-training, making DCBMs applicable for fine-grained
classification and out-of-distribution tasks. Attribution analysis using
Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized
in test images. By leveraging dataset-specific concepts instead of predefined
ones, DCBMs enhance adaptability to new domains.
39
Paper
D2
R. Qorbani, G. Villani, T. Panagiotakopoulos, M. B. Colomer, L. Härenstam-Nielsen, M. Segu, P. L. Dovesi, J. Karlgren, D. Cremers, F. Tombari, and M. Poggi
“Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.21780.
more
Abstract
Open-vocabulary semantic segmentation models associate vision and text to
label pixels from an undefined set of classes using textual queries, providing
versatile performance on novel datasets. However, large shifts between training
and test domains degrade their performance, requiring fine-tuning for effective
real-world applications. We introduce Semantic Library Adaptation (SemLA), a
novel framework for training-free, test-time domain adaptation. SemLA leverages
a library of LoRA-based adapters indexed with CLIP embeddings, dynamically
merging the most relevant adapters based on proximity to the target domain in
the embedding space. This approach constructs an ad-hoc model tailored to each
specific input without additional training. Our method scales efficiently,
enhances explainability by tracking adapter contributions, and inherently
protects data privacy, making it ideal for sensitive applications.
Comprehensive experiments on a 20-domain benchmark built over 10 standard
datasets demonstrate SemLA's superior adaptability and performance across
diverse settings, establishing a new standard in domain adaptation for
open-vocabulary semantic segmentation.
40
Paper
D2
F. Sammani, J. Fischer, and N. Deligiannis
“Unlocking Open-Set Language Accessibility in Vision Models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.10981.
41
Paper
D2
N. P. Walter, J. Vreeken, and J. Fischer
“Now You See Me! A Framework for Obtaining Class-relevant Saliency Maps,” 2025. [Online]. Available: https://arxiv.org/abs/2503.07346.
42
Paper
D2RG3
Y. Wang, S. Rao, J.-U. Lee, M. Jobanputra, and V. Demberg
“B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.
more
Abstract
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm.
43
Paper
D2
C. Wewer, B. Pogodzinski, B. Schiele, and J. E. Lenssen
“Spatial Reasoning with Denoising Models,” 2025. [Online]. Available: https://www.arxiv.org/abs/2502.21075.
more
Abstract
We introduce Spatial Reasoning Models (SRMs), a framework to perform
reasoning over sets of continuous variables via denoising generative models.
SRMs infer continuous representations on a set of unobserved variables, given
observations on observed variables. Current generative models on spatial
domains, such as diffusion and flow matching models, often collapse to
hallucination in case of complex distributions. To measure this, we introduce a
set of benchmark tasks that test the quality of complex reasoning in generative
models and can quantify hallucination. The SRM framework allows to report key
findings about importance of sequentialization in generation, the associated
order, as well as the sampling strategies during training. It demonstrates, for
the first time, that order of generation can successfully be predicted by the
denoising network itself. Using these findings, we can increase the accuracy of
specific reasoning tasks from 1% to >50%.
44
Paper
D2
Y. Wu, Z. Li, X. Hu, X. Ye, X. Zeng, G. Yu, W. Zhu, B. Schiele, M.-H. Yang, and X. Yang
“KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.16707.
more
Abstract
Recent advances in multi-modal generative models have enabled significant
progress in instruction-based image editing. However, while these models
produce visually plausible outputs, their capacity for knowledge-based
reasoning editing tasks remains under-explored. In this paper, we introduce
KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a
diagnostic benchmark designed to assess models through a cognitively informed
lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks
across three foundational knowledge types: Factual, Conceptual, and Procedural.
Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning
dimensions and release 1,267 high-quality annotated editing instances. To
support fine-grained evaluation, we propose a comprehensive protocol that
incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints
and calibrated through human studies. Empirical results on 10 state-of-the-art
models reveal significant gaps in reasoning performance, highlighting the need
for knowledge-centric benchmarks to advance the development of intelligent
image editing systems.

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract