Feb 19, 2025 • 1 min read The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 arxiv papers
Feb 19, 2025 • 1 min read H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking arxiv papers
Feb 19, 2025 • 1 min read Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking arxiv papers
Feb 19, 2025 • 1 min read Understanding and Rectifying Safety Perception Distortion in VLMs arxiv papers
Feb 18, 2025 • 1 min read CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models arxiv papers
Feb 18, 2025 • 1 min read Detecting and Filtering Unsafe Training Data via Data Attribution arxiv papers
Feb 18, 2025 • 1 min read Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training arxiv papers
Feb 18, 2025 • 1 min read DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing arxiv papers
Feb 12, 2025 • 1 min read OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like Mechanisms arxiv papers
Feb 12, 2025 • 1 min read JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation arxiv papers
Feb 11, 2025 • 1 min read When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs arxiv papers