Sizhe Chen（陈思哲）

Biography

Hi! I am a CS Ph.D. student at UC Berkeley, where I am fortunately advised by Prof. David Wagner in Berkeley AI Research (BAIR). I am working closely with Chuan Guo (Meta FAIR), Nicholas Carlini (Anthropic), and Chawin Sitawari (Google DeepMind), supported by Meta- and Google-BAIR Commons. I got my M.Eng. (National Scholarship) and B.Eng. (Best Bachelor’s Thesis) from Shanghai Jiao Tong University advised by Prof. Xiaolin Huang, and also with Prof. Cihang Xie.

My research focuses on AI security in real-world applications. I am currently working on prompt injection defense (Meta-SecAlign, SecAlign, DefensiveToken, StruQ, Jatmo). Prompt injection is listed as the #1 threat to LLM-integrated application systems (like agents), which access untrusted external data (document, webpage, API return, etc) containing potential injected instructions (Ignore previous instructions and …) to perform complex tasks. Prompt injection has shown real-world threat (e.g., data leakage) to industry products that integrate LLMs as a system component, such as Google Bard, Slack AI, OpenAI Operator, and Anthropic Claude Computer Use. To open up new opportunities for securely using LLMs in systems, I aim to develop principled and general defenses against strong prompt injections, with a minimal impact on the system’s utility.

I am fortunate to have mentored or worked with lots of talented students: Yizhu Wang, Jing Qian, Shutong Wu, Zhixing Ye, Hend Alzahrani, and Zhengbao He. Feel free to drop me an email to connect! I accept approximation on my name’s pronunciation.

Invited Talks

Securing LLMs Against Prompt Injection for Agentic Applications
Google DeepMind: Adversarial Machine Learning Seminar 2025
Duke University: Guest Lecture at Generative AI: Foundations, Applications, and Safety 2025
UC Berkeley: Security Seminar 2024
Hong Kong Baptist University: TMLR Young Scientist Seminar 2024
Shanghai Jiao Tong University: PAMI Group Seminar 2024
On the Learning Preference of Deep Neural Networks
ICLR Oral Track 2023
AI Time Youth Ph.D. Talk 2023
Subspace Adversarial Training
CVPR Oral Track 2022
Adversarial Attacks and Defenses
Northeastern University: Security Seminar 2022

(Prompt Injection) Security of LLMs

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
Sizhe Chen*, Arman Zharmagambetov, David Wagner, Chuan Guo*

Meta-SecAlign-70B is the first open-source commercial-grade LLM with built-in prompt injection robustness, outperforming gpt-4o and gemini-2.5-flash, the SOTA solutions that are closed-sourced. We show that, surprisingly, training on 19K instruction-following samples leads to substantial security, which generalizes to agentic workflows (tool-calling and web-navigation) with attack success rates <2%. We open-source our training recipe, an improved utility-preserving SecAlign, along with the evaluation code on 8 utility and 6 security benchmarks.
Defending Against Prompt Injection With a Few DefensiveTokens
Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner

DefensiveToken enables a flexible test-time switch between the SOTA utility and almost-SOTA prompt injection robustness. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few (e.g., 5) DefensiveTokens for a security comparable to SOTA training-time defenses. Otherwise, developers can skip DefensiveTokens for a system exactly the same as without the defense.
SecAlign: Defending Against Prompt Injection With Preference Optimization
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo

SecAlign aims at a prompt-injection-robust LLM that prefers (and thus output) the secure response over the insecure one. For this property, we build a preference dataset, where the “input” contains a prompt injection; the “desirable output” responds to the user instruction; and the “undesirable output” responds to the injected instruction. Preference optimization on this dataset reduces success rates of strong optimization-based prompt injections by a factor of >4 from StruQ.
StruQ: Defending Against Prompt Injection With Structured Queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner

StruQ is a general approach for prompt injection defense by separating the prompt (user instruction) and data into two channels. We design a secure front-end that formats a prompt and data into a special format, and a specially trained LLM that can produce high-quality outputs from these inputs. We augment the instruction tuning dataset with examples that contains a prompt injection, and do SFT on the model to ignore injections in data. StruQ reduces optimization-free attack success rates to <2%.
Jatmo: Prompt Injection Defense by Task-Specific Finetuning
Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner

Jatmo defends against prompt injection by fine-tuning a base LLM on only one task using (data, output) samples. Without seeing any task instruction, the defended LLM has no instruction-following ability that enables it to follow injected instructions. 0% attack success rates are achieved against optimization-free attacks.

Security of Vision Models (Previous SoP)

One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks
Shutong Wu*, Sizhe Chen*, Cihang Xie, Xiaolin Huang

OPS poisons model training by perturbing only one pixel in each image (taking few minutes) and robustly degrades the model accuracy on clean data to almost an untrained counterpart.
Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks
Sizhe Chen, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang

AAA specifically defends against score-based query attacks by maintaining predictions while disrupting gradients, which offers a plug-in solution that preserves accuracy, calibration, and inference speed.
Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet
Sizhe Chen, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang

AoA proposes that transfer attacks should seek for features that are shared across different architectures for common vulnerabilities, which produces DAmageNet, a 50K-sized dataset with >70% error rate on all tested models.
Subspace Adversarial Training
Tao Li, Yingwen Wu, Sizhe Chen, Kun Fang, Xiaolin Huang

Sub-AT approaches catastrophic overfitting and robust overfitting in adversarial training by constraining it in an extracted subspace, which enables 1-step Sub-AT to have a competitive performance vs. standard 10-step AT.

Services

Reviewer: CCS 2024/2025, SaTML 2025/2026, NeurIPS 2023/2025, ICML 2024/2025, ICLR 2023/2024/2025, CVPR 2023/2024/2025, ICCV 2023, ECCV 2022/2024, IEEE TPAMI, Machine Learning, Pattern Recognition
UC Berkeley EECS Student Reviewer: Faculty Hiring Committee 2024, Ph.D. Admission Committee 2024, Equal Access to Application Assistance 2024

Awards

Research Fundings: Meta-BAIR Commons 2024-2026, Google-BAIR Commons 2024-2025, UC Berkeley EECS Departmental Fellowship 2023, NeurIPS 2022 / ICLR 2023 Travel Support
Degree Awards: SJTU Best Bachelor’s Thesis (1%) 2020, SJTU Outstanding Graduate 2022/2023
Scholarship: China National Scholarship (0.2%) 2021/2022, Kwang-Hua Scholarship 2019, Arawana Scholarship 2017

Misc

I practice neatness and minimalism.
I love to sing, attend concerts, play table tennis, play badminton, and ski.
I write blogs (in Chinese yet) about my thoughts and experience.
My Erdös number is 3 due to my collaboration with Chuan Guo.