Apr 15, 2025 • 1 min read RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability arxiv papers
Apr 15, 2025 • 1 min read LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks arxiv papers
Apr 11, 2025 • 1 min read Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge arxiv papers
Apr 9, 2025 • 1 min read Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking arxiv papers
Apr 9, 2025 • 1 min read Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking arxiv papers
Apr 8, 2025 • 1 min read Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models arxiv papers
Apr 8, 2025 • 1 min read A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models arxiv papers
Apr 8, 2025 • 1 min read Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models arxiv papers
Apr 4, 2025 • 1 min read More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment arxiv papers
Apr 4, 2025 • 1 min read LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks arxiv papers
Apr 3, 2025 • 1 min read Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning arxiv papers