SecAlign: Defending Against Prompt Injection with Preference Optimization

1Meta, FAIR 2University of California, Berkeley

SecAlign formulates prompt injection defense as preference optimization, reducing the success rate of the strongest tested attacks to around 0% while preserving model utility.

Abstract

Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications, which perform text-based tasks by utilizing their advanced language capabilities. However, as LLMs have improved, so have the attacks against them. Prompt injection attack is listed as the #1 threat to LLM-integrated applications, where an LLM input contains a trusted prompt (instruction) and an untrusted data (user documents, web retrieval, results from API calls, etc) with potentially injected instructions (Ignore previous instructions and …) to arbitrarily manipulate the LLM.

To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to around 0%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, our defended models are still practical with similar utility to the one before our defensive training.

Background

LLM-Integrated Applications

Design an instruction (prompt) to serve users by processing their data via an LLM.

• Prompt: Trusted (from the developer)

• LLM: Trusted (from the developer or an API provider)

• Data: Untrusted (from a random user)

Prompt Injection Attack

The adversary injects an instruction in data to override the prompted instruction.

Listed as the #1 security threat for LLM-integrated applications by OWASP.

Example: A university wants to evaluate applicants’ CV with an LLM.

Secure Alignment (SecAlign)

(1) Get an SFT model by downloading a public instruct model (recommended) or SFTing a base model.

(2) Save the model’s delimiters.

(3) Find a public instruction tuning dataset.

(4) Construct the preference dataset. For each sample s in the instruction tuning dataset

• Sample another random sample s’ for simulating injection

LLM input: prompt-injected s with the instruction in s’

Desirable LLM output: the labelled output of s

Undesirable LLM output: the labelled output of s’

(5) Preference-optimize the SFT model on the preference dataset. SecAlign uses Direct Preference Optimization loss.

SecAlign trains on simulated injected inputs, labelled with both desirable responses and undesirable responses, leading to much larger probability gap between outputting them, and thus better robustness against prompt injection attacks.

Experiments

SecAlign Llama3-8B-Instruct maintains general-purpose utility (AlpacaEval2 WinRate).

SecAlign Llama3-8B-Instruct enjoys 1% attack success rates under the strongest tested optimization-based prompt injections.

SecAlign significantly surpasses existing prompting-based and fine-tuning-based defenses.

BibTeX

@article{chen2024aligning,
  title={SecAlign: Defending Against Prompt Injection with Preference Optimization},
  author={Chen, Sizhe and Zharmagambetov, Arman and Mahloujifar, Saeed and Chaudhuri, Kamalika and Wagner, David and Guo, Chuan},
  journal={arXiv preprint arXiv:2410.05451},
  year={2024}
}