Synaptic Labs Blog

Certifying LLM Safety against Adversarial Prompting

Written by Dendrex | Mar 12, 2024 1:26:00 PM

The paper: http://arxiv.org/abs/2309.02705

## Purpose 
The paper's primary goal is to introduce a framework for defending LLMs against adversarial attacks. These attacks involve adding malicious token sequences to prompts, tricking LLMs into producing harmful content despite safety measures.

## Methods 
- **Erase-and-Check Framework**: A novel approach where individual tokens in a prompt are erased, and subsequences are analyzed using a safety filter. 
- **Attack Modes Addressed**: Three types of adversarial attacks – suffix, insertion, and infusion.
- **Safety Filter Implementation**: Using pre-trained LLMs and a fine-tuned DistilBERT classifier to distinguish between safe and harmful prompts.

## Key Findings 
1. **Effectiveness of Erase-and-Check**: This method provides strong safety guarantees, successfully identifying harmful prompts with high accuracy.
2. **Improvement with DistilBERT**: Replacing Llama 2 with DistilBERT enhances accuracy and speed, recognizing harmful prompts more efficiently.
3. **Superiority Over Other Methods**: Outperforms traditional methods like randomized smoothing, especially in complex attack scenarios.
4. **High Performance on Safe Prompts**: Maintains good performance in classifying non-adversarial safe prompts.

## Discussion 
This research is crucial for the field of AI safety, addressing the vulnerability of LLMs to adversarial attacks. The erase-and-check approach offers a robust defense mechanism, improving the reliability of LLMs in real-world applications.

## Critiques 
1. **Computational Cost**: The complexity of the method, especially in more advanced attack modes, requires significant computational resources.
2. **Limitation in Scope**: The study primarily focuses on defense against specific types of adversarial attacks, potentially overlooking other emerging threats.

## Tags
#LLMSafety #AI #AdversarialAttacks #EraseAndCheck #DistilBERT #ComputationalCost