arxiv papers

Jul 15, 2025 • 1 min read

Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks

arxiv papers

Jul 11, 2025 • 1 min read

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

arxiv papers

Jul 10, 2025 • 1 min read

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

arxiv papers

Jul 10, 2025 • 1 min read

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

arxiv papers

Jul 9, 2025 • 1 min read

The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

arxiv papers

Jul 9, 2025 • 1 min read

TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

arxiv papers

Jul 9, 2025 • 1 min read

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

arxiv papers

Jul 8, 2025 • 1 min read

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

arxiv papers

Jul 8, 2025 • 1 min read

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

arxiv papers

Jul 4, 2025 • 1 min read

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

arxiv papers

Jul 4, 2025 • 1 min read

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

arxiv papers

Jul 4, 2025 • 1 min read

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection

arxiv papers