Abstract
Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.
BibTeX
@online{Wrobel2602.06613,
TITLE = {{DAVE}: Distribution-aware Attribution via {ViT} Gradient Decomposition},
AUTHOR = {Wr{\'o}bel, Adam and Gairola, Siddhartha and Tabor, Jacek and Schiele, Bernt and Zieli{\'n}ski, Bartosz and Rymarczyk, Dawid},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2602.06613},
EPRINT = {2602.06613},
EPRINTTYPE = {arXiv},
YEAR = {2026},
ABSTRACT = {Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.},
}Endnote
%0 Report
%A Wróbel, Adam
%A Gairola, Siddhartha
%A Tabor, Jacek
%A Schiele, Bernt
%A Zieliński, Bartosz
%A Rymarczyk, Dawid
%+ External Organizations
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
External Organizations
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
%T DAVE: Distribution-aware Attribution via ViT Gradient Decomposition :
%G eng
%U http://hdl.handle.net/21.11116/0000-0012-9A61-1
%U https://arxiv.org/abs/2602.06613
%D 2026
%X Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.
%K Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Artificial Intelligence, cs.AI,Computer Science, Human-Computer Interaction, cs.HC,Computer Science, Learning, cs.LG