Securing Large Language Models (LLMs) from Prompt Injection Attacks

Link: http://arxiv.org/abs/2512.01326v1

PDF Link: http://arxiv.org/pdf/2512.01326v1

Summary: Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks.

These attacks leverage the model's instruction-following ability to make it perform malicious tasks.

Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions.

In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts.

We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness.

We fine-tuned LLaMA 2-7B, Qwen1.

5-4B, and Qwen1.

5-0.

5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.

5-Turbo baseline.

Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses.

We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility.

Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.

Published on arXiv on: 2025-12-01T06:34:20Z