Explaining CLIP Zero-shot Predictions Through Concepts

CVPR 2026

Onat Ozdemir^1,2, Anders Christensen^3,4,5, Stephan Alaniz⁶, Zeynep Akata^7,8,9,10, Emre Akbas^2,8,11

¹University of Edinburgh, ²Middle East Technical University (METU), ³Orbital, ⁴Technical University of Denmark, ⁵University of Copenhagen, ⁶Telecom Paris, ⁷Technical University of Munich, ⁸Helmholtz Munich, ⁹MCML, ¹⁰MDSI, ¹¹ROMER, METU

arXiv Code

Checkpoints

Embeddings

EZPC explains CLIP zero-shot predictions by projecting image and text embeddings into a shared concept space, enabling concept-level interpretability of classification decisions.

Abstract

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes.

We introduce EZPC (pronounced "easy-peasy") that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable.

Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models.

Quantitative Results

We evaluate EZPC on five benchmark datasets (CIFAR-100, ImageNet-100, CUB, ImageNet-1k, and Places365) under the generalized zero-shot setting using the CLIP RN50 backbone. EZPC maintains strong performance close to CLIP while providing concept-level explanations, and outperforms prior explainable zero-shot methods such as Z-CBM and SpLiCE. These results show that projecting CLIP embeddings into a concept space preserves semantic structure without sacrificing accuracy.

Model	CIFAR-100			ImageNet-100			CUB			ImageNet-1k			Places365
Model	Seen	Unseen	H	Seen	Unseen	H	Seen	Unseen	H	Seen	Unseen	H	Seen	Unseen	H
CLIP	0.370	0.454	0.408	0.680	0.707	0.693	0.468	0.481	0.474	0.513	0.548	0.530	0.350	0.375	0.362
Z-CBM	0.319	0.425	0.365	0.592	0.579	0.585	0.183	0.195	0.189	0.439	0.486	0.462	0.349	0.365	0.357
SpLiCE	0.248	0.298	0.270	0.371	0.409	0.389	0.100	0.053	0.070	0.275	0.331	0.300	0.276	0.288	0.282
EZPC	0.365	0.449	0.403	0.675	0.690	0.682	0.457	0.473	0.465	0.468	0.494	0.481	0.339	0.366	0.352

Table 1. Generalized zero-shot classification accuracy (Seen, Unseen, Harmonic mean). EZPC achieves performance close to CLIP while remaining fully interpretable, and substantially outperforms Z-CBM and SpLiCE across all datasets.

Cross-Dataset Transfer

We train the concept projection on ImageNet-100 and evaluate on CIFAR-100 and CUB without any fine-tuning. EZPC transfers effectively across domains, maintaining performance close to CLIP on both object-centric and fine-grained datasets. This demonstrates that the learned concept space captures general visual semantics that are not dataset-specific.

Target Dataset	Model	Zero-shot		Generalized Zero-shot
Target Dataset	Model	Seen	Unseen	Seen	Unseen	H
CIFAR-100	CLIP	0.686	0.387	0.663	0.266	0.380
CIFAR-100	EZPC	0.684	0.363	0.659	0.296	0.409

CUB	CLIP	0.686	0.471	0.617	0.458	0.526
CUB	EZPC	0.674	0.461	0.607	0.448	0.515

Table 2. Cross-dataset transfer: projection trained on ImageNet-100, evaluated on CIFAR-100 and CUB. EZPC stays within 1-3% of CLIP without any retraining.

Inference Time Comparison

A key advantage of EZPC is its computational efficiency. Unlike optimization-based methods (SpLiCE) or retrieval-based approaches (Z-CBM), EZPC performs a single matrix multiplication at inference time. This makes it suitable for large-scale deployment and interactive analysis.

Method	Embedding (ms/img)	Full Pipeline (ms/img)	Overhead
CLIP	0.0001 ± 0.0000	5.77 ± 0.55	1.0×
Z-CBM	97.55 ± 1.33	542.34 ± 6.02	94.0×
SpLiCE	4.50 ± 0.54	338.51 ± 4.39	58.7×
EZPC	0.0006 ± 0.0000	5.90 ± 0.73	∼1.0×

Table 3. Inference time on ImageNet-100 (NVIDIA H100 GPU). EZPC adds only ~0.1 ms per image over CLIP, while Z-CBM and SpLiCE are 94× and 59× slower respectively.

Faithfulness & Causal Validation

Are EZPC's explanations faithful? We test this by ablating the top-n most influential concepts and measuring the effect on predictions. If the identified concepts are causally responsible, removing them should degrade model confidence.

Top-n	Logit Drop	Flip Rate
1	0.0306	0.059
3	0.0816	0.099
5	0.1263	0.132
10	0.2256	0.169

Table 4a. Logit drop and prediction flip rate increase monotonically with n, confirming causal involvement of top-ranked concepts.

Removal Type	Flip Count	Flip Rate
Top-10 concepts	845	0.169
Random-10 concepts	70	0.014

Table 4b. Removing the top-10 concepts flips predictions 12× more often than removing 10 random concepts, confirming that explanations reflect true causal structure.

Backbone Sensitivity

EZPC is backbone-agnostic. We evaluate it across four CLIP/SigLIP architectures of increasing capacity. Larger backbones consistently improve both zero-shot and generalized zero-shot performance, showing that the concept-based decomposition scales naturally with model capacity.

Backbone	Variant	Zero-shot		Generalized Zero-shot
Backbone	Variant	Seen	Unseen	Seen	Unseen	H
CLIP RN50	Base	0.706	0.855	0.680	0.707	0.693
CLIP RN50	EZPC	0.699	0.851	0.675	0.690	0.682

CLIP ViT-B/32	Base	0.729	0.887	0.703	0.715	0.709
CLIP ViT-B/32	EZPC	0.724	0.879	0.694	0.716	0.705

CLIP ViT-L/14	Base	0.839	0.925	0.821	0.836	0.828
CLIP ViT-L/14	EZPC	0.832	0.924	0.812	0.831	0.822

SigLIP ViT-SO400M/14	Base	0.882	0.972	0.871	0.889	0.880
SigLIP ViT-SO400M/14	EZPC	0.880	0.972	0.870	0.886	0.878

Table 5. Backbone ablation on ImageNet-100. EZPC consistently retains near-baseline performance across all architectures, with stronger backbones yielding higher absolute accuracy.

Qualitative Results

We present qualitative results demonstrating EZPC's interpretability at multiple levels: per-image concept explanations, class-level concept distributions, concept-based image retrieval, and spatial grounding of concepts in image regions.

Figure 1. Image-level explanations: for each prediction, EZPC reveals the top contributing concepts and their activation scores, showing which human-understandable attributes drive CLIP's decision.

Figure 2. Class-level concept distributions: aggregated concept activations across all images of a class, showing which concepts consistently characterize each category.

Figure 3. Concept clustering: for each concept, we retrieve the images with the highest activation scores, showing that learned concepts capture coherent visual patterns across diverse images.

Figure 4. Concept-region alignment: spatial heatmaps showing where each concept activates in the image, demonstrating that EZPC grounds its explanations in semantically meaningful regions.

Acknowledgments

We acknowledge the computational resources provided by METU Center for Robotics and Artificial Intelligence (METU-ROMER) and TUBITAK ULAKBIM TRUBA. Dr. Alaniz is supported by Hi! PARIS and ANR/France 2030 program (ANR-23-IACL-0005). Dr. Akata acknowledges partial funding by the ERC (853489 - DEXIM) and the Alfried Krupp von Bohlen und Halbach Foundation. Dr. Akbas gratefully acknowledges the support of TUBITAK 2219.

BibTeX

@InProceedings{Ozdemir_2026_CVPR,
    author    = {Ozdemir, Onat and Christensen, Anders and Alaniz, Stephan and Akata, Zeynep and Akbas, Emre},
    title     = {Explaining CLIP Zero-shot Predictions Through Concepts},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {31336-31345}
}