FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Link: http://arxiv.org/abs/2511.06947v1

PDF Link: http://arxiv.org/pdf/2511.06947v1

Summary: The well-aligned attribute of CLIP-based models enables its effectiveapplication like CLIPscore as a widely adopted image quality assessment metric.

However, such a CLIP-based metric is vulnerable for its delicate multimodalalignment.

In this work, we propose \textbf{FoCLIP}, a feature-spacemisalignment framework for fooling CLIP-based image quality metric.

Based onthe stochastic gradient descent technique, FoCLIP integrates three keycomponents to construct fooling examples: feature alignment as the core moduleto reduce image-text modality gaps, the score distribution balance module andpixel-guard regularization, which collectively optimize multimodal outputequilibrium between CLIPscore performance and image quality.

Such a design canbe engineered to maximize the CLIPscore predictions across diverse inputprompts, despite exhibiting either visual unrecognizability or semanticincongruence with the corresponding adversarial prompts from human perceptualperspectives.

Experiments on ten artistic masterpiece prompts and ImageNetsubsets demonstrate that optimized images can achieve significant improvementin CLIPscore while preserving high visual fidelity.

In addition, we found thatgrayscale conversion induces significant feature degradation in fooling images,exhibiting noticeable CLIPscore reduction while preserving statisticalconsistency with original images.

Inspired by this phenomenon, we propose acolor channel sensitivity-driven tampering detection mechanism that achieves91% accuracy on standard benchmarks.

In conclusion, this work establishes apractical pathway for feature misalignment in CLIP-based multimodal systems andthe corresponding defense method.

Published on arXiv on: 2025-11-10T10:54:35Z