New 'Bad Likert Judge' AI Jailbreak Technique Boosts Malicious Response Success by Over 60%
Summary
Cybersecurity researchers have discovered a new AI jailbreak technique called 'Bad Likert Judge' that can boost the success rates of malicious prompts bypassing large language models' safety guardrails by over 60%, enabling the generation of potentially harmful or illegal content.
Key Points
- Researchers have discovered a new jailbreak technique called 'Bad Likert Judge' that can increase the success rate of bypassing large language models' safety guardrails by over 60%.
- The technique involves asking the LLM to act as a judge and score the harmfulness of responses using the Likert scale, then generate examples aligned with the highest scale.
- Tests across various categories and LLMs showed the technique's effectiveness, highlighting the need for comprehensive content filtering when deploying LLMs.