Link: http://arxiv.org/abs/2508.20570v1
PDF Link: http://arxiv.org/pdf/2508.20570v1
Summary: Typographic attacks exploit multi-modal systems by injecting text intoimages, leading to targeted misclassifications, malicious content generationand even Vision-Language Model jailbreaks.
In this work, we analyze how CLIPvision encoders behave under typographic attacks, locating specializedattention heads in the latter half of the model's layers that causally extractand transmit typographic information to the cls token.
Building on theseinsights, we introduce a method to defend CLIP models against typographicattacks by selectively ablating a typographic circuit, consisting of attentionheads.
Without requiring finetuning, our method improves performance by up to19.
6% on a typographic variant of ImageNet-100, while reducing standardImageNet-100 accuracy by less than 1%.
Notably, our training-free approachremains competitive with current state-of-the-art typographic defenses thatrely on finetuning.
To this end, we release a family of dyslexic CLIP modelswhich are significantly more robust against typographic attacks.
These modelsserve as suitable drop-in replacements for a broad range of safety-criticalapplications, where the risks of text-based manipulation outweigh the utilityof text recognition.
Published on arXiv on: 2025-08-28T09:08:30Z