DiffCLIP

Differential Attention Meets CLIP

Abstract GitHub arXiv 🤗 HuggingFace

Hasan Abed Al Kader Hammoud and Bernard Ghanem

King Abdullah University of Science and Technology (KAUST)

hasanabedalkader.hammoud@kaust.edu.sa

Abstract

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency.

The Key Idea: Differential Attention

The Mathematics Behind Differential Attention

Differential Attention addresses attention noise by learning two separate attention distributions and subtracting one from the other, effectively canceling out spurious alignments.

\[ A_{diff} = A_1 - \lambda \cdot A_2 \]

where \(\lambda\) is a learnable parameter

For DiffCLIP, we apply this mechanism to both the image and text encoders. The model learns to use one attention map to highlight important features while the second map identifies and cancels out noise or irrelevant patterns. With minimal additional parameters (roughly 0.003%), DiffCLIP effectively filters out noisy alignments in both vision and text streams.

Visualize Differential Attention in Action

Select an image and query to see how DiffCLIP focuses attention compared to standard CLIP:

Step 1: Choose an image

Dog and flower Image 1: Dog & Flower
Detective office Image 2: Detective Office

Key Results

Our experimental evaluation demonstrates that differential attention consistently enhances CLIP performance across diverse benchmarks.

OOD Zero-Shot Performance

Does Differential Attention Improve Out-of-Domain Robustness?

Finding: DiffCLIP outperforms standard CLIP on challenging out-of-domain variants with an average improvement of 2.1% on CC12M pretraining.

MMVP-VLM Benchmarking

Does DiffCLIP Improve Fine-Grained Visual Understanding?

Finding: On the MMVP-VLM benchmark, DiffCLIP improves accuracy by 5.7% relative to baseline CLIP.

DiffCLIP Variants Comparison

Does Applying Differential Attention to Vision Only Suffice?

Finding: DiffCLIP† with differential attention only in the vision encoder achieved comparable or better results, suggesting that most benefits come from enhancing visual representations.

Dynamic vs Static Lambda Parameter

Dynamic or Static λinit?

Finding: The dynamic approach (DiffCLIP*) showed significant gains on zero-shot ImageNet (+2.8%) and text retrieval tasks compared to fixed initialization. However, it reduces performance in other tasks compared to fixed initialization.

OOD Zero-Shot Performance

Does Differential Attention Improve Out-of-Domain Robustness?

Finding: DiffCLIP outperforms standard CLIP on challenging out-of-domain variants with an average improvement of 2.1%, demonstrating more robust feature generalization under significant distribution shifts.

MMVP-VLM Benchmarking

Does DiffCLIP Improve Fine-Grained Visual Understanding?

Finding: On the MMVP-VLM benchmark, DiffCLIP improves accuracy by 5.7% relative to baseline CLIP, suggesting that differential attention helps attend to more subtle details in images.

Conclusion

We introduced DiffCLIP, which integrates differential attention into CLIP-based vision-language models to better filter out noisy alignments.

Key Contributions:

  • First integration of differential attention into CLIP-based VLMs, yielding a simple yet effective approach to reducing attention noise.
  • Consistent gains over baseline CLIP across diverse tasks, with a minimal parameter overhead of roughly 0.003%.
  • Detailed ablations showing that dynamic initialization can boost zero-shot performance, and applying differential attention solely in the vision encoder captures most benefits.

References