VTCD: Understanding Video Transformers via Universal Concept Discovery

CVPR 2024 (Highlight)
1York University, 2Samsung AI Centre Toronto 3Toyota Research Institute 4Vector Institute

What concepts are important for encoding object permanence in video transformers?

Input and Model Prediction
Layer 3 - Temporally Invariant Spatial Positions
Layer 7 - Collisions Between Objects
Layer 10 - Container Containing Object
Layer 12: Object Tracking Through Occlusions

VTCD discovers spatiotemporal concepts in video transformer models.


This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we demonstrate that VTCD can be used to improve model performance for fine-grained tasks.

Method Overview

Concept Discovery

To discover concepts in video transformers, we first pass a video dataset through a pretrained model and extract the activations of a given layer. We then cluster the activations using SLIC to generate spatiotemporal tubelets. Finally, we cluster the dataset of tubelets and use the resulting centers as concepts.

Concept Importance

To rank the importance of concepts within individual transformer heads, we find that previous methods based on gradients or masking out single concepts are not effective. Instead, we propose Concept Randomized Importance Sampling (CRIS), which randomly samples concepts from the entire model to mask out and measures the change in model performance.

Rosetta Concepts

In order to better understand fundamental computations performed by video transformers, we mine for universal concepts across all models. We filter based on concept importance and on the Intersection over Union (IoU) of the tubelets corresponding to each concept.


We analyze four different models trained for different tasks: (i) TCOW trained for semi-supervised video object segmentation to track objects through occlusions (ii) Supervised VideoMAE trained for action recognition (iii) Self-supervised VideoMAE and (iv) InternVideo trained on video-text pairs via contrastive learning.

Most Important Concepts Discovered

Semi-supervised VOS Concepts - TCOW Concepts

Shown below are the most three important concepts discovered by VTCD for TCOW. We find that the most important concept highlights objects with similar appearance, suggesting the model solving the disambiguation problem by first identifying possible distractors in mid-layers. The second concept highlights the target object throughout the video. The third concept captures temporally invariant spatial position in the top-left region of the video.

Concept 1 - Layer 5 Head 8
Concept 2 - Layer 9 Head 12
Concept 3 - Layer 3 Head 11

Action Recognition Concepts - VideoMAE Concepts

Shown below are the most three important concepts discovered by VTCD for VideoMAE trained on SSv2 targetting the class `dropping something into something'. The first concept highlights the object being dropped until the dropping event, at which point both the object and container are highlighted. The second concept captures the container being dropped into, notably not capturing the object itself and making a ring-like shape. As in the TCOW model, VideoMAE also contains concepts that capture spatial information, this one highlighting the center of the video.

Concept 1 - Layer 5 Head 8
Concept 2 - Layer 9 Head 12
Concept 3 - Layer 3 Head 11

Rosetta Concepts

We now mine for Rosetta concepts across the models. Shown below are a sample of Rosetta concepts found for the target SSv2 class `rolling something on a flat surface'. We find universal concepts across all models capturing diverse information such as spatial position information and object centric concepts.

Spatial Positions

TCOW Layer3
VidMAE Layer3
VidMAESSL Layer3
InternVideo Layer4
TCOW Layer3
VidMAE Layer5
VidMAE-SSL Layer1
InternVideo Layer7

Object Tracking

TCOW Layer9
VidMAE Layer11
VidMAE-SSL Layer12
InternVideo Layer10
TCOW Layer4
VidMAE Layer8
VidMAE-SSL Layer10
InternVideo Layer10

Application: Efficient Inference with Concept-Head Pruning

We propose a novel method for pruning transformer heads based on their importance. We target a subset (6) of the classes in the SSv2 dataset and prune the least important heads based on CRIS. We find that we can prune 30% of the heads to gain >4% in performance while reducing FLOPS, and up to 50% of the heads without a significant drop in performance.

Application: Zero-Shot Semi-VOS with VTCD

We apply VTCD for the task of semi-supervised video object segmentation and evaluate the discovered concepts on the DAVIS’16 video-object segmentation benchmark. To this end, we run VTCD on the training set and select concepts that have the highest overlap with the groundtruth tracks. We then use them to track new objects on the validation set. Not that are evaluating representations that are not directly trained for VOS. While solid performance is achieved by all representations, segmentation accuracy is limited by the low resolution of the concept masks. To mitigate this, we introduce a simple refinement step that samples random points inside the masks to prompt Segment Anything Model (SAM). The resulting approach, shown as ‘VTCD + SAM’, significantly improves performance. We anticipate that future developments in video representation learning will automatically lead to better VOS methods with the help of VTCD.


  author    = {Kowal, Matthew and Dave, Achal and Ambrus, Rares and Gaidon, Adrien and Derpanis, Konstantinos G and Tokmakov, Pavel},
  title     = {Understanding Video Transformers via Universal Concept Discovery},
  journal   = {arXiv preprint arXiv:2401.10831},
  year      = {2024},