Home

Blog

New AI Jailbreak Technique Boosts Malicious Response Success Rates by 60%

Anthony Denis

January 3, 2025

New AI Jailbreak Technique Boosts Malicious Response Success Rates by 60%

News

Massive robotic figure dominates devastated city with small human figures below, conveying an atmosphere of technological apocalypse

Cybersecurity researchers have uncovered a novel jailbreak technique that significantly enhances the ability to bypass large language models' (LLMs) safety guardrails, potentially enabling the generation of harmful or malicious content.

The multi-turn attack strategy, dubbed "Bad Likert Judge" by Palo Alto Networks Unit 42 researchers, leverages a unique approach to manipulate AI models' response generation capabilities. The technique involves asking the target LLM to act as a judge scoring the harmfulness of responses using the Likert scale, a rating method that measures agreement or disagreement with a statement.

Researchers found that by instructing the LLM to generate responses aligned with different harmfulness scales, attackers can potentially extract content with increasingly malicious characteristics. Tests conducted across multiple categories and six state-of-the-art text-generation LLMs revealed that the technique can increase the attack success rate (ASR) by more than 60% compared to traditional attack methods.

The research examined various potential misuse categories, including hate speech, harassment, self-harm, sexual content, illegal activities, malware generation, and system prompt leakage. By exploiting the models' inherent ability to understand and evaluate harmful content, the technique demonstrates a significant vulnerability in current AI safety mechanisms.

"By leveraging the LLM's understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model's safety guardrails," the researchers explained in their findings.

The attack method works through a multi-step process. Initially, the LLM is asked to act as a judge evaluating responses using specific harmfulness guidelines. Subsequently, the model is prompted to generate responses corresponding to different harm scales, with the highest-scored response potentially containing the most harmful content.

To mitigate such risks, the researchers recommend implementing comprehensive content-filtering systems alongside LLMs. Their tests showed that content filters can reduce the attack success rate by an average of 89.2 percentage points across all tested models.

This discovery highlights the ongoing challenges in securing artificial intelligence systems against potential misuse. As AI technologies continue to advance, researchers and developers must remain vigilant in identifying and addressing potential vulnerabilities that could compromise system safety.

The findings underscore the importance of continuous research into AI security, emphasizing that no current model is entirely immune to sophisticated jailbreak techniques. Organizations deploying AI technologies are advised to implement multiple layers of security and stay informed about emerging threat vectors.

While the research demonstrates the potential for bypassing AI safety measures, the researchers stress that the technique targets edge cases and does not reflect typical AI model behaviors. Most AI models remain safe and secure when operated responsibly and with appropriate caution.

Found this article interesting? Keep visit thesecmaster.com, and our social media page on Facebook, LinkedIn, Twitter, Telegram, Tumblr, Medium, and Instagram and subscribe to receive tips like this.

You may also like these articles: Here are the 5 most contextually relevant blog posts:

Anthony Denis

Anthony Denis a Security News Reporter with a Bachelor's in Business Computer Application. Drawing from a decade of digital media marketing experience and two years of freelance writing, he brings technical expertise to cybersecurity journalism. His background in IT, content creation, and social media management enables him to deliver complex security topics with clarity and insight.

Home

About TheSecMaster

Blog

Tools

Learn

Subscribe

Subscribe

Advertise with us

New AI Jailbreak Technique Boosts Malicious Response Success Rates by 60%

Table of Contents

New AI Jailbreak Technique Boosts Malicious Response Success Rates by 60%

Anthony Denis

Breaking Down the Latest July 2025 Patch Tuesday Report

Breaking Down the Latest June 2025 Patch Tuesday Report

How to Fix GPG Key Error in Kali Linux?

Finding Web Application Vulnerabilities with Burp AI

Follow us on Social Media

Learn Something New with Free Email subscription

Learn Something New with Free Email subscription

Subscribe

Subscribe

Subscribe

Subscribe

Breaking Down the Latest July 2025 Patch Tuesday Report

Breaking Down the Latest June 2025 Patch Tuesday Report

How to Fix GPG Key Error in Kali Linux?

Finding Web Application Vulnerabilities with Burp AI

Recently added

News

View All

Hackers Target AI Enthusiasts with New "Noodlophile" Stealer

OttoKit WordPress Plugin Zero-Day Exploited to Create Rogue Admin Accounts

NVIDIA Patch Flaw Exposes AI Systems to Container Escape Attacks

Ransomware Attacks Surge in 2025, But Payouts Decline Sharply

Learn More About Cyber Security Security & Technology

Books

Books

Videos

Videos

Courses

Courses

Certifications

Certifications

Cybersecurity All-in-One For Dummies - 1st Edition

Tools

Featured

View All

BurpGPT

Premium

PentestGPT

Freemium

Tenable BurpGPT

Enterprise

Microsoft Security Copilot

Enterprise

Learn Something New with Free Email subscription

Learn Something New with Free Email subscription

Subscribe

Subscribe

Subscribe

Subscribe