Robustness of Large Language Models Against Adversarial Attacks

Link: http://arxiv.org/abs/2412.17011v1

PDF Link: http://arxiv.org/pdf/2412.17011v1

Summary: The increasing deployment of Large Language Models (LLMs) in variousapplications necessitates a rigorous evaluation of their robustness againstadversarial attacks.

In this paper, we present a comprehensive study on therobustness of GPT LLM family.

We employ two distinct evaluation methods toassess their resilience.

The first method introduce character-level text attackin input prompts, testing the models on three sentiment classificationdatasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2.

The second method involvesusing jailbreak prompts to challenge the safety mechanisms of the LLMs.

Ourexperiments reveal significant variations in the robustness of these models,demonstrating their varying degrees of vulnerability to both character-leveland semantic-level adversarial attacks.

These findings underscore the necessityfor improved adversarial training and enhanced safety mechanisms to bolster therobustness of LLMs.

Published on arXiv on: 2024-12-22T13:21:15Z