Sizhe Chen(陈思哲)
Biography
Hi! I am a Computer Science Ph.D. student at UC Berkeley, where I am fortunately advised by Prof. David Wagner in Berkeley AI Research (BAIR). I am working closely with Chuan Guo at Meta FAIR as a visiting researcher, and with Nicholas Carlini at Google DeepMind; the two collaborations are supported by two BAIR Commons from them. I got my M.Eng. (National Scholarship) and B.Eng. (Summa Cum Laude) from Shanghai Jiao Tong University advised by Prof. Xiaolin Huang and also with Prof. Cihang Xie, when I was working on attacks to vision models.
My research focuses on AI security in real-world applications. I am currently working on prompt injection defense (SecAlign, StruQ, Jatmo). Prompt injection attack is listed as the #1 threat to Large Language Model (LLM) Integrated Applications, where a trusted prompt is concatenated to an untrusted data (user documents, web retrieval, results from API calls, etc) as the LLM input. If the data contains malicious instructions (Ignore previous instructions and …), the LLM could be arbitrarily manipulated. To open up new opportunities for safely using LLMs in systems (e.g., as agents), my goal is to design fundamental defenses to secure LLMs against prompt injections.
I am fortunate to have mentored lots of talented students (and some from underrepresentative groups): Jing Qian, Shutong Wu, Yingwen Wu, Zhixing Ye, Hend Alzahrani, and Zhengbao He. Feel free to drop me an email to connect! I accept approximation on my name’s pronunciation.
Invited Talks
- Prompt Injection Defense by Structured Queries and Secure Alignment
UC Berkeley Security Seminar 2024
Hong Kong Baptist University TMLR Young Scientist Seminar 2024
Shanghai Jiao Tong University PAMI Group Seminar 2024 - On the Learning Preference of Deep Neural Networks
ICLR Oral Track 2023
AI Time Youth Ph.D. Talk 2023 - Subspace Adversarial Training
CVPR Oral Track 2022 - Adversarial Attacks and Defenses
Northeastern University Security Seminar 2022
Selected Publications
- Aligning LLMs to Be Robust Against Prompt Injection
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, Chuan Guo
SecAlign formulates prompt injection defense as the preference optimization. From an SFT dataset, we build our preference dataset, where the “input” contains a benign instruction, a benign data, and an injected instruction; the “desirable output” responds to the benign instruction; and the “undesirable output” responds to the injected instruction. Then, we apply existing alignment techniques to fine-tune an SFT model on our preference dataset. Preserving utility, SecAlign reduces strong optimization-based attack success rate by a factor of >3 from the previous SOTA StruQ. - StruQ: Defending Against Prompt Injection with Structured Queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner
StruQ is a general approach for prompt injection defense by separating the prompt and data into two channels. This system is made of a secure front-end that formats a prompt and data into a special format, and a specially trained LLM that can produce high-quality outputs from these inputs. We augment the SFT dataset with examples that additionally include instructions in data besides in prompt, and do SFT on the model to ignore instructions in data. Preserving utility, StruQ stops all existing (optimization-free) prompt injections to an attack success rate of <2%. - One-Pixel Shortcut: On the Learning Preference of Deep Neural Networks
Shutong Wu*, Sizhe Chen*, Cihang Xie, Xiaolin Huang
OPS perturbs only one pixel in each image to poison model training from the view of shortcut learning. OPS uses a heuristic model-agnostic search to find the pixel: perturbing in-class images at the same position to the same target value that could mostly and stably alter the original images. OPS degrades the model accuracy on clean data to almost an untrained counterpart. The perturbations, for the first time, are crafted within seconds (CIFAR-10) or minutes (ImageNet) and cannot be erased by adversarial training. - Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Attacks
Sizhe Chen, Zhehao Huang, Qinghua Tao, Yingwen Wu, Cihang Xie, Xiaolin Huang
AAA proposes a new direction to especially defend against score-based query attacks by maintaining predictions while disrupting gradients. We note that the efficient and realistic score-based attacks could be easily misled if the model logits are perturbed to create a periodically reverse loss trend. AAA secures WideResNet-28 with 80.59% accuracy under attack, compared to 67.44% from the best prior adversarial training defense. AAA does not hurt the accuracy, calibration, or inference speed, and can be directly plugged into any trained classifiers. - Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet
Sizhe Chen, Zhengbao He, Chengjin Sun, Jie Yang, Xiaolin Huang
AoA follows the proposed principle that transfer attacks should seek for features that are shared across different architectures, which tend to reveal their common vulnerabilities. We note that the attention heatmap (from the model interpretation tool) could be a shared feature, and constrain the attention as our attack loss, which improves the attack transferability by 30%. We apply AoA to generate 50K adversarial samples from the ImageNet validation set to get the DAmageNet, leading to >85% error rate on 13 undefended models and >70% error rate on most defended models. - Subspace Adversarial Training
Tao Li, Yingwen Wu, Sizhe Chen, Kun Fang, Xiaolin Huang
Sub-AT approaches catastrophic overfitting and robust overfitting in adversarial training (AT) by constraining AT in a carefully extracted subspace. Sub-AT saves checkpoints during the regular training and performs SVD on the parameter matrix (each vector is a squeezed checkpoint) to get mutually orthogonal bases of the subspace. Then Sub-AT projects gradients to those bases in the remaining training, i.e., only alters very few independent parameters like the following LoRA. 1-step Sub-AT achieves a competitive performance v.s. standard 10-step AT, with even 40% less computation than standard 1-step AT.
Other Publications
- Jatmo: Prompt Injection Defense by Task-Specific Finetuning
Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner
- Can LLMs Follow Simple Rules?
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner
- Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors
Sizhe Chen, Geng Yuan, Xinwen Cheng, Yifan Gong, Minghai Qin, Yanzhi Wang, Xiaolin Huang
- Query Attack by Multi-Identity Surrogates
Sizhe Chen, Zhehao Huang, Qinghua Tao, Xiaolin Huang
- Measuring the Transferability of $\ell_\infty$ Attacks by the $\ell_2$ Norm
Sizhe Chen, Qinghua Tao, Zhixing Ye, Xiaolin Huang
- Unifying Gradients to Improve Real-World Robustness for Deep Networks
Yingwen Wu, Sizhe Chen, Kun Fang, Xiaolin Huang
- Relevance Attack on Detectors
Sizhe Chen, Fan He, Xiaolin Huang, Kun Zhang
Services
- Reviewer: SaTML 2025, CCS 2024, ICML 2024, NeurIPS 2023, ICLR 2023/2024/2025, CVPR 2023/2024/2025, ICCV 2023, ECCV 2022/2024, IEEE TPAMI, Machine Learning, Pattern Recognition
- UC Berkeley EECS Student Reviewer: Faculty Hiring Committee 2024, Ph.D. Admission Committee 2024, Equal Access to Application Assistance 2024
Awards
- Research Fundings: Meta-BAIR Commons 2024, Google-BAIR Commons 2024, UC Berkeley EECS Departmental Fellowship 2023, NeurIPS 2022 / ICLR 2023 Travel Support
- Degree Awards: SJTU Extraordinary Bachelor’s Thesis (1%, Summa Cum Laude equivalent) 2020, SJTU Outstanding Graduate 2022/2023
- Scholarship: China National Scholarship (0.2%) 2021/2022, Kwang-Hua Scholarship 2019, Arawana Scholarship 2017
Misc
- I love to lift weights, play badminton, play table tennis, write blogs, and attend concerts.
- I directed three 1K-spectator concerts.