Jul 3, 2025 • 1 min read SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism arxiv papers
Jul 1, 2025 • 1 min read Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models arxiv papers
Jul 1, 2025 • 1 min read Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages arxiv papers
Jul 1, 2025 • 1 min read Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models arxiv papers
Jun 25, 2025 • 1 min read PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty arxiv papers
Jun 25, 2025 • 1 min read MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models arxiv papers
Jun 24, 2025 • 1 min read NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation arxiv papers
Jun 24, 2025 • 1 min read Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks arxiv papers
Jun 17, 2025 • 1 min read Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-AI Interactions arxiv papers
Jun 17, 2025 • 1 min read Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models arxiv papers
Jun 13, 2025 • 1 min read SoK: Evaluating Jailbreak Guardrails for Large Language Models arxiv papers
Jun 13, 2025 • 1 min read How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? arxiv papers