GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms

Link: http://arxiv.org/abs/2504.13052v1

PDF Link: http://arxiv.org/pdf/2504.13052v1

Summary: Large Language Models (LLMs) have been equipped with safety mechanisms toprevent harmful outputs, but these guardrails can often be bypassed through"jailbreak" prompts.

This paper introduces a novel graph-based approach tosystematically generate jailbreak prompts through semantic transformations.

Werepresent malicious prompts as nodes in a graph structure with edges denotingdifferent transformations, leveraging Abstract Meaning Representation (AMR) andResource Description Framework (RDF) to parse user goals into semanticcomponents that can be manipulated to evade safety filters.

We demonstrate aparticularly effective exploitation vector by instructing LLMs to generate codethat realizes the intent described in these semantic graphs, achieving successrates of up to 87% against leading commercial LLMs.

Our analysis reveals thatcontextual framing and abstraction are particularly effective at circumventingsafety measures, highlighting critical gaps in current safety alignmenttechniques that focus primarily on surface-level patterns.

These findingsprovide insights for developing more robust safeguards against structuredsemantic attacks.

Our research contributes both a theoretical framework andpractical methodology for systematically stress-testing LLM safety mechanisms.

Published on arXiv on: 2025-04-17T16:09:12Z