Jul 15, 2025 • 1 min read Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks arxiv papers
Jul 11, 2025 • 1 min read GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing arxiv papers
Jul 10, 2025 • 1 min read Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models arxiv papers
Jul 10, 2025 • 1 min read On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks arxiv papers
Jul 9, 2025 • 1 min read The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation arxiv papers
Jul 9, 2025 • 1 min read TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data arxiv papers
Jul 9, 2025 • 1 min read CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations arxiv papers
Jul 8, 2025 • 1 min read Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message arxiv papers
Jul 8, 2025 • 1 min read Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models arxiv papers
Jul 4, 2025 • 1 min read PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage arxiv papers
Jul 4, 2025 • 1 min read Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models arxiv papers
Jul 4, 2025 • 1 min read Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection arxiv papers