Publications - Current Year
2025
- “HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- “Unimotion: Unifying 3D Human Motion Synthesis and Understanding,” in 3DV 2025, 12th International Conference on 3D Vision, Singapore.
- “Spurfies: Sparse-view Surface Reconstruction using Local Geometry Priors,” in 3DV 2025, International Conference on 3D Vision, Singapore, 2025.
- “Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes,” in 3DV 2025, International Conference on 3D Vision, Singapore.more
Abstract
State-of-the-art novel view synthesis methods achieve impressive results for
multi-view captures of static 3D scenes. However, the reconstructed scenes
still lack "liveliness," a key component for creating engaging 3D experiences.
Recently, novel video diffusion models generate realistic videos with complex
motion and enable animations of 2D images, however they cannot naively be used
to animate 3D scenes as they lack multi-view consistency. To breathe life into
the static world, we propose Gaussians2Life, a method for animating parts of
high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is
to leverage powerful video diffusion models as the generative component of our
model and to combine these with a robust technique to lift 2D videos into
meaningful 3D motion. We find that, in contrast to prior work, this enables
realistic animations of complex, pre-existing 3D scenes and further enables the
animation of a large variety of object classes, while related work is mostly
focused on prior-based character animation, or single 3D objects. Our model
enables the creation of consistent, immersive 3D experiences for arbitrary
scenes. - “InterTrack: Tracking Human Object Interaction without Object Templates,” in 3DV 2025, International Conference on 3D Vision, Singapore, 2025.
- “FORCE: Dataset and Method for Intuitive Physics Guided Human-object Interaction,” in 3DV 2025, International Conference on 3D Vision, Singapore.
- “Multi-Omics Protein Signaling Networks Identify Sex-Specific Therapeutic Candidates in Lung Adenocarcinoma,” Biology of Sex Differences, vol. 16, 2025.
- “y-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition,” in DAGM German Conference on Pattern Recognition (DAGM GCPR 2025), Freiburg, Germany.more
Abstract
Most pattern recognition models are developed on pre-proce\-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose $γ$-Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via github.com/Mishalfatima/Gamma-Quant
- “Corrigendum to ‘A polyhedral study of lifted multicuts’ [Discrete Optim. 47 (2023) 100757],” Discrete Optimization, vol. 55, 2025.
- “MEt3R: Measuring Multi-View Consistency in Generated Images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “T-FAKE: Synthesizing Thermal Images for Facial Landmarking,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “EgoLM: Multi-Modal Language Model of Egocentric Motions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “VideoGEM: Training-free Action Grounding in Videos,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “Number it: Temporal Grounding Videos like Flipping Manga,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.more
Abstract
We introduce PersonaHOI, a training- and tuning-free framework that fuses a
general StableDiffusion model with a personalized face diffusion (PFD) model to
generate identity-consistent human-object interaction (HOI) images. While
existing PFD models have advanced significantly, they often overemphasize
facial features at the expense of full-body coherence, PersonaHOI introduces an
additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By
incorporating cross-attention constraints in the PFD branch and spatial merging
at both latent and residual levels, PersonaHOI preserves personalized facial
details while ensuring interactive non-facial regions. Experiments, validated
by a novel interaction alignment metric, demonstrate the superior realism and
scalability of PersonaHOI, establishing a new standard for practical
personalized face with HOI generation. Our code will be available at
github.com/JoyHuYY1412/PersonaHOI - “Test-Time Visual In-Context Tuning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 2025.
- “SmartKC++: Improving Performance of Smartphone-Based Corneal Topographers,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
- “FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
- “I Spy with My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, 2025.
- “Segment any Repeated Object,” in IEEE International Conference on Robotics and Automation (ICRA 2025), Atlanta, GA, USA, 2025.
- “Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.more
Abstract
Deep neural networks (DNNs) have proven to be successful in various computer
vision applications such that models even infer in safety-critical situations.
Therefore, vision models have to behave in a robust way to disturbances such as
noise or blur. While seminal benchmarks exist to evaluate model robustness to
diverse corruptions, blur is often approximated in an overly simplistic way to
model defocus, while ignoring the different blur kernel shapes that result from
optical systems. To study model robustness against realistic optical blur
effects, this paper proposes two datasets of blur corruptions, which we denote
OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such
as coma, defocus, and astigmatism, i.e. aberrations that can be represented by
varying a single parameter of Zernike polynomials. To go beyond the principled
but synthetic setting of primary aberrations, LensCorruptions samples linear
combinations in the vector space spanned by Zernike polynomials, corresponding
to 100 real lenses. Evaluations for image classification and object detection
on ImageNet and MSCOCO show that for a variety of different pre-trained models,
the performance on OpticsBench and LensCorruptions varies significantly,
indicating the need to consider realistic image corruptions to evaluate a
model's robustness against blur. - “AIM: Amending Inherent Interpretability via Self-Supervised Masking,” in International Conference on Computer Vision (ICCV 2025), Honolulu, HI, USA.more
Abstract
It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose "Amending Inherent Interpretability via Self-Supervised Masking" (AIM), a simple yet interestingly effective method that promotes the network's utilization of genuine features over spurious alternatives without requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.
- CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts. IEEE.more
Abstract
An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: genintel.github.io/CNS.
- “Robust Object Detection with Domain-Invariant Training and Continual Test-Time Adaptation,” International Journal of Computer Vision, vol. 133, 2025.
- “An Evaluation of Zero-Cost Proxies - From Neural Architecture Performance Prediction to Model Robustness,” International Journal of Computer Vision, vol. 133, 2025.
- “Pipeline Olympics: Continuable Benchmarking of Computational Workflows for DNA Methylation Sequencing Data Against an Experimental Gold Standard,” Nucleic Acids Research, vol. 53, no. 19, 2025.
- “Pixel-level Certified Explanations via Randomized Smoothing,” in Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), Vancouver, Canada.
- “What’s the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL 2025), Vienna, Austria, 2025.
- “Spatial Reasoners for Continuous Variables in Any Domain,” in Proceedings of the ICML 2025 Workshop on Championing Open- source Development in Machine Learning (CODEML 2025), Vancouver, Canada, 2025.
- “Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025), Nashville, TN, USA.
- “Disentangling Polysemantic Channels in Convolutional Neural Networks,” in The First Workshop on Mechanistic Interpretability for Vision (MIV 2025), Nashville, TN, USA.
- “Pruning Neural Network Models for Gene Regulatory Dynamics Using Data and Domain Knowledge,” in The Second Conference on Parsimony and Learning Recent Spotlight Track (CPAL 2025), Stanford, CA, USA, 2025.
- “VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
- “Samba: Synchronized Set-of-Sequences Modeling for Multiple Object Tracking,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
- “How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
- “Can We Talk Models Into Seeing the World Differently?,” in The Thirteenth International Conference on Learning Representations (ICLR 2025 ), Singapore, 2025.
- “TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.
- “ContextGNN: Beyond Two-Tower Recommendation Systems,” in Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore.
- “FlowBench: Benchmarking Optical Flow Estimation Methods for Reliability and Generalization,” Transactions on Machine Learning Research, vol. 2025, 2025.
- “Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models,” Transactions on Machine Learning Research, vol. 2025, no. 8, 2025.more
Abstract
Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.
- “DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05091.more
Abstract
Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field.
To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation - “SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification,” 2025. .more
Abstract
Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
- “A Granular Study of Safety Pretraining under Model Abliteration,” 2025. [Online]. Available: https://www.arxiv.org/abs/2510.02768.more
Abstract
Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: github.com/shashankskagnihotri/safety_pretraining.
- “TikZero: Zero-Shot Text-Guided Graphics Program Synthesis,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11509.more
Abstract
With the rise of generative AI, synthesizing figures from text captions
becomes a compelling application. However, achieving high geometric precision
and editability requires representing figures as graphics programs in languages
like TikZ, and aligned training data (i.e., graphics programs with captions)
remains scarce. Meanwhile, large amounts of unaligned graphics programs and
captioned raster images are more readily available. We reconcile these
disparate data sources by presenting TikZero, which decouples graphics program
generation from text understanding by using image representations as an
intermediary bridge. It enables independent training on graphics programs and
captioned images and allows for zero-shot text-guided graphics program
synthesis during inference. We show that our method substantially outperforms
baselines that can only operate with caption-aligned graphics programs.
Furthermore, when leveraging caption-aligned graphics programs as a
complementary training signal, TikZero matches or exceeds the performance of
much larger models, including commercial systems like GPT-4o. Our code,
datasets, and select models are publicly available. - “Identifying Sex Differences in Lung Adenocarcinoma Using Multi-Omics Integrative Protein Signaling Networks.” bioRxiv, 2025.
- “Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels,” 2025. [Online]. Available: https://arxiv.org/abs/2506.05312.more
Abstract
Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.
- “Solving Inverse Problems with FLAIR,” 2025. [Online]. Available: https://arxiv.org/abs/2506.02680.more
Abstract
Flow-based latent generative models such as Stable Diffusion 3 are able to
generate images with remarkable quality, even enabling photorealistic
text-to-image generation. Their impressive performance suggests that these
models should also constitute powerful priors for inverse imaging problems, but
that approach has not yet led to comparable fidelity. There are several key
obstacles: (i) the encoding into a lower-dimensional latent space makes the
underlying (forward) mapping non-linear; (ii) the data likelihood term is
usually intractable; and (iii) learned generative models struggle to recover
rare, atypical data modes during inference. We present FLAIR, a novel training
free variational framework that leverages flow-based generative models as a
prior for inverse problems. To that end, we introduce a variational objective
for flow matching that is agnostic to the type of degradation, and combine it
with deterministic trajectory adjustments to recover atypical modes. To enforce
exact consistency with the observed data, we decouple the optimization of the
data fidelity and regularization terms. Moreover, we introduce a time-dependent
calibration scheme in which the strength of the regularization is modulated
according to off-line accuracy estimates. Results on standard imaging
benchmarks demonstrate that FLAIR consistently outperforms existing diffusion-
and flow-based methods in terms of reconstruction quality and sample diversity. - “Improving Representation Learning from Data and Model Perspectives: Semi-supervised Learning and Foundation Models,” Universität des Saarlandes, Saarbrücken, 2025.
- “PyG 2.0: Scalable Learning on Real World Graphs,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16991.more
Abstract
PyG (PyTorch Geometric) has evolved significantly since its initial release, establishing itself as a leading framework for Graph Neural Networks. In this paper, we present Pyg 2.0 (and its subsequent minor versions), a comprehensive update that introduces substantial improvements in scalability and real-world application capabilities. We detail the framework's enhanced architecture, including support for heterogeneous and temporal graphs, scalable feature/graph stores, and various optimizations, enabling researchers and practitioners to tackle large-scale graph learning problems efficiently. Over the recent years, PyG has been supporting graph learning in a large variety of application areas, which we will summarize, while providing a deep dive into the important areas of relational deep learning and large language modeling.
- “VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22399.more
Abstract
Neural networks are widely adopted to solve complex and challenging tasks.
Especially in high-stakes decision-making, understanding their reasoning
process is crucial, yet proves challenging for modern deep networks. Feature
visualization (FV) is a powerful tool to decode what information neurons are
responding to and hence to better understand the reasoning behind such
networks. In particular, in FV we generate human-understandable images that
reflect the information detected by neurons of interest. However, current
methods often yield unrecognizable visualizations, exhibiting repetitive
patterns and visual artifacts that are hard to understand for a human. To
address these problems, we propose to guide FV through statistics of real image
features combined with measures of relevant network flow to generate
prototypical images. Our approach yields human-understandable visualizations
that both qualitatively and quantitatively improve over state-of-the-art FVs
across various architectures. As such, it can be used to decode which
information the network uses, complementing mechanistic circuits that identify
where it is encoded. Code is available at: github.com/adagorgun/VITAL - “Beyond Accuracy: What Matters in Designing Well-Behaved Models?,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17110.more
Abstract
Deep learning has become an essential part of computer vision, with deep
neural networks (DNNs) excelling in predictive performance. However, they often
fall short in other critical quality dimensions, such as robustness,
calibration, or fairness. While existing studies have focused on a subset of
these quality dimensions, none have explored a more general form of
"well-behavedness" of DNNs. With this work, we address this gap by
simultaneously studying nine different quality dimensions for image
classification. Through a large-scale study, we provide a bird's-eye view by
analyzing 326 backbone models and how different training paradigms and model
architectures affect the quality dimensions. We reveal various new insights
such that (i) vision-language models exhibit high fairness on ImageNet-1k
classification and strong robustness against domain changes; (ii)
self-supervised learning is an effective training paradigm to improve almost
all considered quality dimensions; and (iii) the training dataset size is a
major driver for most of the quality dimensions. We conclude our study by
introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel
metric that ranks models across multiple dimensions of quality, enabling
tailored recommendations based on specific user needs. - “Deepfakes: we need to re-think the concept of ‘real’ images,” 2025. [Online]. Available: https://arxiv.org/abs/2509.21864.more
Abstract
The wide availability and low usability barrier of modern image generation models has triggered the reasonable fear of criminal misconduct and negative social implications. The machine learning community has been engaging this problem with an extensive series of publications proposing algorithmic solutions for the detection of "fake", e.g. entirely generated or partially manipulated images. While there is undoubtedly some progress towards technical solutions of the problem, we argue that current and prior work is focusing too much on generative algorithms and "fake" data-samples, neglecting a clear definition and data collection of "real" images. The fundamental question "what is a real image?" might appear to be quite philosophical, but our analysis shows that the development and evaluation of basically all current "fake"-detection methods is relying on only a few, quite old low-resolution datasets of "real" images like ImageNet. However, the technology for the acquisition of "real" images, aka taking photos, has drastically evolved over the last decade: Today, over 90% of all photographs are produced by smartphones which typically use algorithms to compute an image from multiple inputs (over time) from multiple sensors. Based on the fact that these image formation algorithms are typically neural network architectures which are closely related to "fake"-image generators, we state the position that today, we need to re-think the concept of "real" images. The purpose of this position paper is to raise the awareness of the current shortcomings in this active field of research and to trigger an open discussion whether the detection of "fake" images is a sound objective at all. At the very least, we need a clear technical definition of "real" images and new benchmark datasets.
- “Faithful, Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks,” 2025. [Online]. Available: https://arxiv.org/abs/2507.16761.more
Abstract
Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. In this work, we address these limitations by introducing anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}_\text{FLC}$ and $\text{B-cos}_\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-class and multi-label settings. Code available at: GitHub repository (url: github.com/mkleinma/B-cos-medical-paper).
- “RefAM: Attention Magnets for Zero-Shot Referral Segmentation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.22650.more
Abstract
Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.
- “Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00754.more
Abstract
The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LUViT significantly improves performance on various downstream vision tasks, showcasing a more effective and efficient pathway to harness LLM knowledge for visual understanding.
- “CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks,” 2025. .more
Abstract
The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.
- “TRIX- Trading Adversarial Fairness via Mixed Adversarial Training,” 2025. [Online]. Available: https://arxiv.org/abs/2507.07768.more
Abstract
Adversarial Training (AT) is a widely adopted defense against adversarial examples. However, existing approaches typically apply a uniform training objective across all classes, overlooking disparities in class-wise vulnerability. This results in adversarial unfairness: classes with well distinguishable features (strong classes) tend to become more robust, while classes with overlapping or shared features(weak classes) remain disproportionately susceptible to adversarial attacks. We observe that strong classes do not require strong adversaries during training, as their non-robust features are quickly suppressed. In contrast, weak classes benefit from stronger adversaries to effectively reduce their vulnerabilities. Motivated by this, we introduce TRIX, a feature-aware adversarial training framework that adaptively assigns weaker targeted adversaries to strong classes, promoting feature diversity via uniformly sampled targets, and stronger untargeted adversaries to weak classes, enhancing their focused robustness. TRIX further incorporates per-class loss weighting and perturbation strength adjustments, building on prior work, to emphasize weak classes during the optimization. Comprehensive experiments on standard image classification benchmarks, including evaluations under strong attacks such as PGD and AutoAttack, demonstrate that TRIX significantly improves worst-case class accuracy on both clean and adversarial data, reducing inter-class robustness disparities, and preserves overall accuracy. Our results highlight TRIX as a practical step toward fair and effective adversarial defense.
- “Missing Fine Details in Images: Last Seen in High Frequencies,” 2025. [Online]. Available: https://arxiv.org/abs/2509.05441.more
Abstract
Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, generated images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information during optimization, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Moreover, we integrate our frequency-preserving latent embeddings into a SOTA latent diffusion model, resulting in sharper and more realistic image generation. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image synthesis, with broader implications for applications in content creation, neural rendering, and medical imaging.
- “Escaping Plato’s Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13429.more
Abstract
With the rise of neural networks, especially in high-stakes applications,
these networks need two properties (i) robustness and (ii) interpretability to
ensure their safety. Recent advances in classifiers with 3D volumetric object
representations have demonstrated a greatly enhanced robustness in
out-of-distribution data. However, these 3D-aware classifiers have not been
studied from the perspective of interpretability. We introduce CAVE - Concept
Aware Volumes for Explanations - a new direction that unifies interpretability
and robustness in image classification. We design an inherently-interpretable
and robust classifier by extending existing 3D-aware classifiers with concepts
extracted from their volumetric representations for classification. In an array
of quantitative metrics for interpretability, we compare against different
concept-based approaches across the explainable AI literature and show that
CAVE discovers well-grounded concepts that are used consistently across images,
while achieving superior robustness. - “UniK3D: Universal Camera Monocular 3D Estimation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16591.more
Abstract
Monocular 3D estimation is crucial for visual perception. However, current
methods fall short by relying on oversimplified assumptions, such as pinhole
camera models or rectified images. These limitations severely restrict their
general applicability, causing poor performance in real-world scenarios with
fisheye or panoramic images and resulting in substantial context loss. To
address this, we present UniK3D, the first generalizable method for monocular
3D estimation able to model any camera. Our method introduces a spherical 3D
representation which allows for better disentanglement of camera and scene
geometry and enables accurate metric 3D reconstruction for unconstrained camera
models. Our camera component features a novel, model-independent representation
of the pencil of rays, achieved through a learned superposition of spherical
harmonics. We also introduce an angular loss, which, together with the camera
module design, prevents the contraction of the 3D outputs for wide-view
cameras. A comprehensive zero-shot evaluation on 13 diverse datasets
demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and
camera metrics, with substantial gains in challenging large-field-of-view and
panoramic settings, while maintaining top accuracy in conventional pinhole
small-field-of-view domains. Code and models are available at
github.com/lpiccinelli-eth/unik3d . - “UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler,” 2025. [Online]. Available: https://arxiv.org/abs/2502.20110.more
Abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving
downstream tasks in 3D perception and modeling. However, the remarkable
accuracy of recent MMDE methods is confined to their training domains. These
methods fail to generalize to unseen domains even in the presence of moderate
domain gaps, which hinders their practical applicability. We propose a new
model, UniDepthV2, capable of reconstructing metric 3D scenes from solely
single images across domains. Departing from the existing MMDE paradigm,
UniDepthV2 directly predicts metric 3D points from the input image at inference
time without any additional information, striving for a universal and flexible
MMDE solution. In particular, UniDepthV2 implements a self-promptable camera
module predicting a dense camera representation to condition depth features.
Our model exploits a pseudo-spherical output representation, which disentangles
the camera and depth representations. In addition, we propose a geometric
invariance loss that promotes the invariance of camera-prompted depth features.
UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss
which enhances the localization and sharpness of edges in the metric depth
outputs, a revisited, simplified and more efficient architectural design, and
an additional uncertainty-level output which enables downstream tasks requiring
confidence. Thorough evaluations on ten depth datasets in a zero-shot regime
consistently demonstrate the superior performance and generalization of
UniDepthV2. Code and models are available at
github.com/lpiccinelli-eth/UniDepth - “Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging,” 2025. [Online]. Available: https://arxiv.org/abs/2507.15576.more
Abstract
Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \href{https://github.com/Nicolas-Poggi/Project_THz_Classification/tree/main}{GitHub repository}.
- “Deep Learning for Climate Action: Computer Vision Analysis of Visual Narratives on X,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09361.more
Abstract
Climate change is one of the most pressing challenges of the 21st century,
sparking widespread discourse across social media platforms. Activists,
policymakers, and researchers seek to understand public sentiment and
narratives while access to social media data has become increasingly restricted
in the post-API era. In this study, we analyze a dataset of climate
change-related tweets from X (formerly Twitter) shared in 2019, containing 730k
tweets along with the shared images. Our approach integrates statistical
analysis, image classification, object detection, and sentiment analysis to
explore visual narratives in climate discourse. Additionally, we introduce a
graphical user interface (GUI) to facilitate interactive data exploration. Our
findings reveal key themes in climate communication, highlight sentiment
divergence between images and text, and underscore the strengths and
limitations of foundation models in analyzing social media imagery. By
releasing our code and tools, we aim to support future research on the
intersection of climate change, social media, and computer vision. - “DCBM: Data-Efficient Visual Concept Bottleneck Models,” 2025. [Online]. Available: https://arxiv.org/abs/2412.11576.more
Abstract
Concept Bottleneck Models (CBMs) enhance the interpretability of neural
networks by basing predictions on human-understandable concepts. However,
current CBMs typically rely on concept sets extracted from large language
models or extensive image corpora, limiting their effectiveness in data-sparse
scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for
large sample sizes during concept generation while preserving interpretability.
DCBMs define concepts as image regions detected by segmentation or detection
foundation models, allowing each image to generate multiple concepts across
different granularities. This removes reliance on textual descriptions and
large-scale pre-training, making DCBMs applicable for fine-grained
classification and out-of-distribution tasks. Attribution analysis using
Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized
in test images. By leveraging dataset-specific concepts instead of predefined
ones, DCBMs enhance adaptability to new domains. - “Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.21780.more
Abstract
Open-vocabulary semantic segmentation models associate vision and text to
label pixels from an undefined set of classes using textual queries, providing
versatile performance on novel datasets. However, large shifts between training
and test domains degrade their performance, requiring fine-tuning for effective
real-world applications. We introduce Semantic Library Adaptation (SemLA), a
novel framework for training-free, test-time domain adaptation. SemLA leverages
a library of LoRA-based adapters indexed with CLIP embeddings, dynamically
merging the most relevant adapters based on proximity to the target domain in
the embedding space. This approach constructs an ad-hoc model tailored to each
specific input without additional training. Our method scales efficiently,
enhances explainability by tracking adapter contributions, and inherently
protects data privacy, making it ideal for sensitive applications.
Comprehensive experiments on a 20-domain benchmark built over 10 standard
datasets demonstrate SemLA's superior adaptability and performance across
diverse settings, establishing a new standard in domain adaptation for
open-vocabulary semantic segmentation. - “Unlocking Open-Set Language Accessibility in Vision Models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.10981.
- “RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09368.more
Abstract
Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that accurate models are not necessarily robust and that robustness varies widely by corruption type. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience. It will be available at spring-benchmark.org.
- “Vision At Night: Exploring Biologically Inspired Preprocessing For Improved Robustness Via Color And Contrast Transformations,” 2025. [Online]. Available: https://arxiv.org/abs/2509.24863.more
Abstract
Inspired by the human visual system's mechanisms for contrast enhancement and color-opponency, we explore biologically motivated input preprocessing for robust semantic segmentation. By applying Difference-of-Gaussians (DoG) filtering to RGB, grayscale, and opponent-color channels, we enhance local contrast without modifying model architecture or training. Evaluations on Cityscapes, ACDC, and Dark Zurich show that such preprocessing maintains in-distribution performance while improving robustness to adverse conditions like night, fog, and snow. As this processing is model-agnostic and lightweight, it holds potential for integration into imaging pipelines, enabling imaging systems to deliver task-ready, robust inputs for downstream vision models in safety-critical environments.
- “Now You See Me! A Framework for Obtaining Class-relevant Saliency Maps,” 2025. [Online]. Available: https://arxiv.org/abs/2503.07346.
- “B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12992.more
Abstract
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
anonymous.4open.science/r/bcos_lm. - “Spatial Reasoning with Denoising Models,” 2025. [Online]. Available: https://www.arxiv.org/abs/2502.21075.more
Abstract
We introduce Spatial Reasoning Models (SRMs), a framework to perform
reasoning over sets of continuous variables via denoising generative models.
SRMs infer continuous representations on a set of unobserved variables, given
observations on observed variables. Current generative models on spatial
domains, such as diffusion and flow matching models, often collapse to
hallucination in case of complex distributions. To measure this, we introduce a
set of benchmark tasks that test the quality of complex reasoning in generative
models and can quantify hallucination. The SRM framework allows to report key
findings about importance of sequentialization in generation, the associated
order, as well as the sampling strategies during training. It demonstrates, for
the first time, that order of generation can successfully be predicted by the
denoising network itself. Using these findings, we can increase the accuracy of
specific reasoning tasks from 1% to >50%. - “AnyUp: Universal Feature Upsampling,” 2025. [Online]. Available: https://arxiv.org/abs/2510.12764.more
Abstract
We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
- “KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.16707.more
Abstract
Recent advances in multi-modal generative models have enabled significant
progress in instruction-based image editing. However, while these models
produce visually plausible outputs, their capacity for knowledge-based
reasoning editing tasks remains under-explored. In this paper, we introduce
KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a
diagnostic benchmark designed to assess models through a cognitively informed
lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks
across three foundational knowledge types: Factual, Conceptual, and Procedural.
Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning
dimensions and release 1,267 high-quality annotated editing instances. To
support fine-grained evaluation, we propose a comprehensive protocol that
incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints
and calibrated through human studies. Empirical results on 10 state-of-the-art
models reveal significant gaps in reasoning performance, highlighting the need
for knowledge-centric benchmarks to advance the development of intelligent
image editing systems. - “MVGBench: Comprehensive Benchmark for Multi-view Generation Models,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00006.more
Abstract
We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models). Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generative tasks where multiple solutions exist while differing from ground truth. Furthermore, different MVGs are trained on different view angles, synthetic data and specific lightings -- robustness to these factors and generalization to real data are rarely evaluated thoroughly. Without a rigorous evaluation protocol, it is also unclear what design choices contribute to the progress of MVGs. MVGBench evaluates three different aspects: best setup performance, generalization to real data and robustness. Instead of comparing against ground truth, we introduce a novel 3D self-consistency metric which compares 3D reconstructions from disjoint generated multi-views. We systematically compare 12 existing MVGs on 4 different curated real and synthetic datasets. With our analysis, we identify important limitations of existing methods specially in terms of robustness and generalization, and we find the most critical design choices. Using the discovered best practices, we propose ViFiGen, a method that outperforms all evaluated MVGs on 3D consistency. Our code, model, and benchmark suite will be publicly released.
- “Informed Mixing -- Improving Open Set Recognition via Attribution-based Augmentation,” 2025. .more
Abstract
Open set recognition (OSR) is devised to address the problem of detecting novel classes during model inference. Even in recent vision models, this remains an open issue which is receiving increasing attention. Thereby, a crucial challenge is to learn features that are relevant for unseen categories from given data, for which these features might not be discriminative. To facilitate this process and "optimize to learn" more diverse features, we propose GradMix, a data augmentation method that dynamically leverages gradient-based attribution maps of the model during training to mask out already learned concepts. Thus GradMix encourages the model to learn a more complete set of representative features from the same data source. Extensive experiments on open set recognition, close set classification, and out-of-distribution detection reveal that our method can often outperform the state-of-the-art. GradMix can further increase model robustness to corruptions as well as downstream classification performance for self-supervised learning, indicating its benefit for model generalization.
- “LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers,” 2025. [Online]. Available: https://arxiv.org/abs/2507.04404.more
Abstract
Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In this work, we introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation. Through empirical attention analysis, we identify two key patterns: punctuation tokens receive dominant attention in early layers, while conceptual tokens govern semantic reasoning in intermediate layers. By selectively suppressing attention to these token types at their respective depths, we achieve the induction of controlled factual degradation and derive contrastive signals to guide the final factual decoding. Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.