Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

ICML 2025

Harrish Thasarathan^1,2, Julian Forsyth¹, Thomas Fel³, Matthew Kowal^1,2,4,5, Konstantinos G. Derpanis^1,6,2,7

¹ York University, Toronto, Canada ² Vector Institute, Toronto, Canada ³ Kempner Institute, Harvard University, Boston, USA ⁴ FAR.AI ⁵ Trajectory Labs, Toronto ⁶ University of Toronto, Toronto, Canada ⁷ Samsung AI Centre, Toronto

Paper Code Demo

Universal Sparse Autoencoders (USAEs) create a universal, interpretable concept space that reveals what multiple vision models learn in common about the visual world.

Abstract

We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation—concepts—across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offer novel applications—such as coordinated activation maximization—that open avenues for deeper insights in multi-model AI systems.

Method Overview

Contrasting standard SAEs, which reinterpret the internal representations of a single model, USAEs extend this notion across M different models. The key insight of USAEs is to learn a shared sparse code, Z, which allows every model to be reconstructed from the same sparse embedding. Each model has its own encoder and decoder pair to translate to and from this universal space; training aligns the encoders and decoders across models by requiring every decoder to reconstruct features from the shared sparse code produced by any model’s encoder.

Results

Universal Concept Discovery

We discover a diverse variety of visual concepts in the universal space Z—that is, concepts which are shared and index-aligned across all models. These universal concepts range from low-level properties such as colour and texture (e.g., “yellow,” “curves”) to high-level semantic parts and object groupings (e.g., “bolts,” “animal group faces”). Furthermore, because training is performed using spatial tokens, the resulting concepts exhibit spatial precision, activating only for the particular regions in the input images corresponding to the given concept.

Model-Unique Concepts

We find that some concepts are not universally shared between all models. DinoV2, for example, has many unique concepts which reflect aspects of 3-dimensional space—such as object geometry, spatial depth, and viewing orientation. SigLIP, on the other hand, has unique concepts which showcase its capability of jointly capturing textual-visual correspondences. A prime example is the “star” concept, activating on star shapes as well as the word “star”.

Visualizing Universal Concepts

We present an immediate application of USAEs, Coordinated Activation Maximization. For a given concept, optimizing the image inputs according to each model's encoder produces a per-model concept visualization. This reveals the different ways in which each model encodes the same concept. For example, some models We also demonstrate one immediate application of USAEs, Coordinated Activation Maximization, where optimizing the inputs of multiple models to activate the same concepts reveals how different models encode the same concept.

The Most Universal Concepts Are Also the Most Important

We define Concept Importance by measuring the impact of a given concept on reconstruction of the model activations. Universality can be quantified by computing how often the neurons in the concept space of every model fire together for the same input tokens. We find a distinct correlation between Concept Importance and Universality. Additionally, we analyse firing entropy to distinguish 3 modes of concept activation: concepts which are universal across all models, concepts which are shared between model pairs, and concepts which are model unique.

Universal Concepts Generalise Beyond Training Data

While our USAE is solely trained with ImageNet, we find that its concepts generalise well to other datasets. This suggests that robust representational capacity can be achieved even with relatively limited training data.

BibTeX


@inproceedings{
thasarathan2025universal,
title={Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment},
author={Harrish Thasarathan and Julian Forsyth and Thomas Fel and Matthew Kowal and Konstantinos G. Derpanis},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=UoaxRN88oR}
}