Supplemental Results for the CVPR 2024 paper "Understanding Video Transformers via Universal Concept Discovery".

Table of contents

Figure 1 Visualization

In this visualization, we show the video version of Figure 1 from the main paper. The prediction heatmap of the TCOW model is shown on the left. Earlier layers capture positional information, while deeper layers capture events, objects, containers or track the target object through occlusions.

Input and Prediction
Layer 3 - Head 5
Layer 7 - Head 8
Layer 10 - Head 8
Layer 12 - Head 1

Most important concepts

We visualize the most important concepts for each of the four models: (1) TCOW, (2) supervised VideoMAE, (3) SSL VideoMAE, and (4) InternVideo. Only the top-1 concept is shown by default, please click on "Show more" to see the full list.

Most important concepts - TCOW

TCOW Concept 1 - Layer 5 Head 8

This concept highlights objects with similar appearance, suggesting the model solving the disambiguation problem by first identifying possible distractors in mid-layers.

Input
Input
Concept
Concept

TCOW Concept 2 - Layer 9 Head 12

This concept tracks the target object throughout the video.

Input
Input
Concept
Concept

TCOW Concept 3 - Layer 3 Head 11

This concept captures temporally invariant spatial position in the top-left region of the video

Input
Input
Concept
Concept

TCOW Concept 4 - Layer 3 Head 9

This concept captures vertical spatial position and highlights a temporally invariant horizontal slice across the video.

Input
Input
Concept
Concept

TCOW Concept 5 - Layer 10 Head 10

This concept again captures the target object across the entire video.

Input
Input
Concept
Concept

TCOW Concept 6 - Layer 2 Head 4

This concept captures spatial information in the top left region of the video.

Input
Input
Concept
Concept

TCOW Concept 7 - Layer 3 Head 3

This concept also captures spatial information in the top left region of the video.

Input
Input
Concept
Concept

TCOW Concept 8 - Layer 3 Head 4

This concept also captures spatial information in the top of the video. Interestingly, all concepts encoding a spatiotemporal basis highlight regions in the middle of the video or higher. These regions of the video may be particularly important for tracking objects in Kubric because they are randomly spawned above containers in each video. Thus, a precise understanding of the top regions of the video is required for tracking the target object.

Input
Input
Concept
Concept

Most important concepts - Supervised VideoMAE (dropping something into something)

Supervised VideoMAE Concept 1 - Layer 11 Head 9

Interestingly, the most important concept highlights the object being dropped until the dropping event, at which point both the object and container are highlighted.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 2 - Layer 8 Head 1

This concept captures the container being dropped into, notably not capturing the object itself and making a ring-like shape.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 3 - Layer 4 Head 3

As in the TCOW model, VideoMAE also contains concepts that capture spatial information, this one highlighting the center of the video.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 4 - Layer 12 Head 2

This concept is in the last layer of the model and captures the container being dropping into. Notably, the video on the left shows an unusual container, an almost full drawer, that the model still is able to successfully highlights until the bag is dropped into it.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 5 - Layer 6 Head 3

This is a positional concept highlighting the top and center region of the video.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 6 - Layer 12 Head 3

Interestingly, this concept, also in the final layer, highlights nothing in the video until the dropping event occurs, at which point the container and the object are highlighted.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 7 - Layer 4 Head 3

This is another positional concept highlighting the top and center region of the video.

Input
Input
Concept
Concept

Supervised VideoMAE Concept 8 - Layer 4 Head 3

This is a positional concept highlighting the bottom and center region of the video.

Input
Input
Concept
Concept

Most important concepts - SSL VideoMAE (dropping something into something)

SSL VideoMAE Concept 1 - Layer 4 Head 11

The most important concept captures the container being dropped into.

Input
Input
Concept
Concept

SSL VideoMAE Concept 2 - Layer 12 Head 10

The second most important concept also captures the container being dropping into.

Input
Input
Concept
Concept

SSL VideoMAE Concept 3 - Layer 7 Head 7

Interestingly, we observe the third most important concept is a spatially invariant temporal basis that captures the beginning of the video. At the beginning of the video, everything is highlighted, and then after a few frames, nothing is highlighted.

Input
Input
Concept
Concept

SSL VideoMAE Concept 4 - Layer 3 Head 10

This concept is a spatial position concept capturing the top center region of the video.

Input
Input
Concept
Concept

SSL VideoMAE Concept 5 - Layer 5 Head 8

This concept is a spatial position concept capturing the right region of the video.

Input
Input
Concept
Concept

SSL VideoMAE Concept 6 Layer 9 Head 12

This is another spatial position concept capturing the top center region of the video.

Input
Input
Concept
Concept

SSL VideoMAE Concept 7 - Layer 4 Head 4

This is an interesting spatiotemporal basis that highlights the right part of the video during the middle temporal segment of the video.

Input
Input
Concept
Concept

SSL VideoMAE Concept 8 - Layer 12 Head 9

This is another spatial position concept capturing the bottom left region of the video.

Input
Input
Concept
Concept

Most important concepts - InternVideo (dropping something into something)

InternVideo Concept 1 - Layer 11 Head 2

Interestingly, the most important concept for InternVideo captures hands dropping the object.

Input
Input
Concept
Concept

InternVideo Concept 2 - Layer 6 Head 8

This concept captures textured patterns in the image. Notably, it highlights background and foreground regions that contain textured patterns and tracks these regions throughout the video.

Input
Input
Concept
Concept

InternVideo Concept 3 - Layer 3 Head 11

This is another spatial position concept capturing the bottom right region of the video.

Input
Input
Concept
Concept

InternVideo Concept 4 - Layer 10 Head 12

This is another spatial position concept capturing the bottom right region of the video, however, different from concept 3, it is not completely temporally invariant and the boundary of the concept support changes non-trivially over the video.

Input
Input
Concept
Concept

InternVideo Concept 5 - Layer 4 Head 1

This is another spatial position concept capturing the bottom left region of the video.

Input
Input
Concept
Concept

InternVideo Concept 6 - Layer 7 Head 1

This is another spatial position concept capturing the right region of the video.

Input
Input
Concept
Concept

InternVideo Concept 7 - Layer 4 Head 3

This is another spatial position concept capturing the top right region of the video.

Input
Input
Concept
Concept

InternVideo Concept 8 - Layer 1 Head 3

This concept, which occurs at the first layer, captures orange-brown color.

Input
Input
Concept
Concept

Rosetta concepts - SSv2: rolling something on a flat surface

Here we visualize representative Rosetta concepts that are shared between all the 4 models analyzed in our experiemnts: (1) TCOW, (2) supervised VideoMAE, (3) VideoMAE SSL, and (4) InternVideo. Only one Rosetta concept is shown by default, please click on "Show more" to see the full list.

Rosetta concept 1

In this visualization, we show the Rosetta concept with the highest score of 22% mIoU when filtering by the most important 7.5% of concepts. This Rosetta concept captures spatial position information and is contained in the early layers of all models.

Input
TCOW Layer3 Head4
VidMAE Layer3 Head2
VidMAESSL Layer3 Head4
InternVideo Layer4 Head1

Rosetta concept 2

This visualization shows a Rosetta concept with a score of 18% mIoU. Interestingly, we observe that all models learn to localize and track individual objects over space and time. This is particularly interesting for self-supervised models like VideoMAE-SSL and InternVideo, which do not have access to any labels.

Input
TCOW Layer9 Head8
VidMAE Layer11 Head4
VidMAESSL Layer12 Head7
InternVideo Layer10 Head5

Rosetta concept 3

In this visualization, we show a Rosetta concept with a rosetta score of 15% mIoU. We again observe an object-centric concept in all models, capturing the notion of hand.

Input
TCOW Layer4 Head5
VidMAE Layer8 Head1
VidMAESSL Layer10 Head 5
InternVideo Layer10 Head2

Rosetta concept 4

In this visualization we observe a Rosetta concept with a score of 18% mIoU that highlights the region the object is rolling into. This suggests all models encode a notion of where an object will move to in the future.

Input
TCOW Layer11 Head12
VidMAE Layer11 Head2
VidMAESSL Layer6 Head6
InternVideo Layer9 Head12

Rosetta concept 5

Contrasting concept 5, that showed a concept capturing where an object will roll to, this visualization shows a Rosetta concept (16% mIoU) that captures the region that the rolling object has rolled from.

Input
TCOW Layer9 Head5
VidMAE Layer7 Head8
VidMAESSL Layer12 Head7
InternVideo Layer10 Head2

Query Key and Value Comparison

Finally, we demonstrate that VTCD produces interpretable concepts for units of interest other than Keys which are studied in the main paper. Here, we visualize the most important concepts when discovering concepts in the queries, keys and values for the TCOW model. We note some similarities between the most important concepts discovered in each unit: (i) queries and keys produce concepts that closely track the target object, (ii) all units produce positional concepts. Interestingly, we note some differences between the three units: (i) Queries produce multiple concepts that track the target object during the beginning of the video, but switch focus midway through; (ii) Values produce the most positional concepts. Overall, Keys result in most diverse and interpretable concpets, validating our design choice.

Most important concepts - TCOW Keys

Keys Concept 1 - Layer 5 Head 8

This concept highlights objects with similar appearance, suggesting the model solving the disambiguation problem by first identifying possible distractors in mid-layers.

Input
Input
Concept
Concept

Keys Concept 2 - Layer 9 Head 12

This concept tracks the target object throughout the video.

Input
Input
Concept
Concept

Keys Concept 3 - Layer 3 Head 11

This concept captures temporally invariant spatial position in the top-left region of the video

Input
Input
Concept
Concept

Keys Concept 4 - Layer 3 Head 9

This concept captures vertical spatial position and highlights a temporally invariant horizontal slice across the video.

Input
Input
Concept
Concept

Keys Concept 5 - Layer 10 Head 10

This concept again captures the target object across the entire video.

Input
Input
Concept
Concept

Keys Concept 6 - Layer 2 Head 4

This concept captures spatial information in the top left region of the video.

Input
Input
Concept
Concept

Keys Concept 7 - Layer 3 Head 3

This concept also captures spatial information in the top left region of the video.

Input
Input
Concept
Concept

Keys Concept 8 - Layer 3 Head 4

This concept also captures spatial information in the top of the video. Interestingly, all concepts encoding a spatiotemporal basis highlight regions in the middle of the video or higher. These regions of the video may be particularly important for tracking objects in TCOW Kubric because they are randomly spawned above containers in each video. Thus, a precise understanding of the top regions of the video is required for tracking the target object.

Input
Input
Concept
Concept

Most important concepts - TCOW Queries

Queries Concept 1 - Layer 10 Head 10

Interestingly, the most important concept tracks the target object through occlusions.

Input
Input
Concept
Concept

Queries Concept 2 - Layer 8 Head 6

The second most important concept highlights the region around the object during the beginning and middle of the video, but then remains in the same position afterwards.

Input
Input
Concept
Concept

Queries Concept 3 - Layer 8 Head 11

Similar to concept 2, this concept tracks the target object until it collides with something, and then ceases to highlight the target object and highlights the same region in space for the rest of the video.

Input
Input
Concept
Concept

Queries Concept 4 - Layer 9 Head 8

This concept closely tracks the target object.

Input
Input
Concept
Concept

Queries Concept 5 - Layer 9 Head 12

This concept highlights the target object falling in the top of the video, but then stops tracking the object and remains highlighting the top center region of the video.

Input
Input
Concept
Concept

Queries Concept 6 - Layer 7 Head 9

Once again, this concept highlights the target object in the first frame, but then captures spatial position in the center of the video.

Input
Input
Concept
Concept

Queries Concept 7 - Layer 8 Head 8

Interestingly, this concept seems to track the region that the target object is moving into, potentially suggesting the model is anticipating where the target object will move to next.

Input
Input
Concept
Concept

Queries Concept 8 - Layer 6 Head 1

This concept captures a single container in the video.

Input
Input
Concept
Concept

Most important concepts - TCOW Values

Values Concept 1 - Layer 5 Head 9

Interestingly, the most important concept for the Values vaptures the background region in the top left region of the video. Notably, it does not highlight any objects, forming a ring-like shape around any object that travels through the top left region.

Input
Input
Concept
Concept

Values Concept 2 - Layer 4 Head 9

This is a temporally invariant spatial position concept highlighting the top left region of the video.

Input
Input
Concept
Concept

Values Concept 3 - Layer 2 Head 11

This concept captures positional information in the middle left of the video.

Input
Input
Concept
Concept

Values Concept 4 - Layer 4 Head 10

This concept highlights large objects in the video. This could be the model identifying possible occluders or containers in the middle layers for later processing.

Input
Input
Concept
Concept

Values Concept 5 - Layer 9 Head 12

Interestingly, this concept captures nothing until several frames into the video, at which point it captures large objects in the left part of the image, again suggesting the model may be identifying possible occluders.

Input
Input
Concept
Concept

Values Concept 6 - Layer 8 Head 11

This concept highlights many objects surrounding the target object, but not the target object itself.

Input
Input
Concept
Concept

Values Concept 7 - Layer 2 Head 3

This concept captures both spatial information, highlighting the top portion of the video, but also approximately follows some object boundaries making the concept not totally temporally invariant.

Input
Input
Concept
Concept

Values Concept 8 - Layer 5 Head 8

This is a spatial position concept highlighting the top center region of the video.

Input
Input
Concept
Concept