Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Link: http://arxiv.org/abs/2508.10404v1

PDF Link: http://arxiv.org/pdf/2508.10404v1

Summary: With the rapid proliferation of Natural Language Processing (NLP), especiallyLarge Language Models (LLMs), generating adversarial examples to jailbreak LLMsremains a key challenge for understanding model vulnerabilities and improvingrobustness.

In this context, we propose a new black-box attack method thatleverages the interpretability of large models.

We introduce the Sparse FeaturePerturbation Framework (SFPF), a novel approach for adversarial text generationthat utilizes sparse autoencoders to identify and manipulate critical featuresin text.

After using the SAE model to reconstruct hidden layer representations,we perform feature clustering on the successfully attacked texts to identifyfeatures with higher activations.

These highly activated features are thenperturbed to generate new adversarial texts.

This selective perturbationpreserves the malicious intent while amplifying safety signals, therebyincreasing their potential to evade existing defenses.

Our method enables a newred-teaming strategy that balances adversarial effectiveness with safetyalignment.

Experimental results demonstrate that adversarial texts generated bySFPF can bypass state-of-the-art defense mechanisms, revealing persistentvulnerabilities in current NLP systems.

However, the method's effectivenessvaries across prompts and layers, and its generalizability to otherarchitectures and larger models remains to be validated.

Published on arXiv on: 2025-08-14T07:12:44Z