A Theoretical Framework for Analyzing (Mis)-Alignment of Large Language Models

Yotam Wolf, Noam Wies, Yoav Levine, Amnon Shashua

21 April, 2023

In our new paper, we propose the framework of large language model (LLM) Behavior Expectation Bounds and with it:

prove that LLMs can always be misaligned via prompting;
show that adversarial prompt lengths required to guarantee misalignment grow with initial LLM alignment;
theoretically explore the dynamics of misaligning conversations between LLMs and adversarial users;
show that RLHF can render an LLM vulnerable to prompting attacks;
prove that the "chatGPT jailbreak" practice of prompting the LLM to mimic toxic personas can be an efficient strategy for misaligning.

Large Language Models (LLMs) pretrain on the internet, where they learn a lot but are also exposed to different types of biases and offensive content. The process of removing the effect of such harmful training examples and ensuring that LLMs are useful and unharmful to human users, is called LLM alignment. While leading alignment methods such as reinforcement learning from human feedback are effective, the alignment of contemporary large language models (LLMs) is still dangerously brittle to adversarial prompting attacks. Our new paper puts forward a theoretical framework that is tailored for analyzing LLM alignment.

Our theoretical framework, called behavior expectation bounds (BEB), is based on viewing the LLM's distribution as a mixture of well-behaved and ill-behaved components that are distinct in their distributions. We quantify this "distinguishability" in statistical terms and find that the distinction between the components allows to insert prompts which are "more typical" of the ill-behaved components than the well-behaved ones, and as a result enhance ill-behaved components' weight in the LLM distribution. We test the efficiency of this adversarial enhancement in various scenarios of prompting and conversing with an LLM, and under assumptions of distinguishability between the ill- and well-behaved components, we make several new theoretical assertions regarding LLM mis-alignment via prompting:

Alignment impossibility - We show that any LLM alignment process which reduces undesired behaviors to a small but nonzero fraction of the probability space is not safe against adversarial prompts.
Conversation length guardrail - We show that by aligning an LLM and limiting the interaction length that users have with it, undesired behaviors may be easier to avoid.
RLHF can make things worse - While tuning methods such as RLHF lower the probability of undesired behaviors, they also sharpen the distinction between desired and undesired behaviors. We show that increased distinction can have the negative effect of rendering the LLM more susceptible to adversarial prompting.
LLMs can resist during a conversation - We show that if a user attempts to misalign an LLM during a conversation, the LLM can restore alignment during its conversation turns.
A misaligned LLM will not realign easily - We show that if an LLM was misaligned, it will remain so for conversation lengths shorter than the misaligning prompt.
Imitating personas can lead to easy alignment "jailbreaking" - We show that it is always possible to prompt a language model into behaving as a certain persona it has captured during pretraining, and further show that this mechanism can be used in order to easily access undesired behaviors.

Example of jailbreaking:

Read more details on our framework, assumptions, and results in our new preprint. We hope that the Behavior Expectation Bounds framework for analyzing LLM alignment may spark a theoretical thrust helping to better understand the important topic of LLM alignment.