Why Jailbreaks Work: Competing Objectives and Mismatched Generalization

If you collect jailbreaks long enough, the catalog starts to look like a bag of unrelated tricks: roleplay, base64, low-resource languages, prefix injection, fictional framing, hundreds of fake dialogue turns. The temptation is to treat each as its own phenomenon and patch them one at a time. That’s the losing game — patch one, three variants appear. The more useful move is to ask why any of them work, and the answer turns out to be small. Wei, Haghtalab, and Steinhardt’s “Jailbroken: How Does LLM Safety Training Fail?” (arXiv:2307.02483 ↗) argues that essentially all jailbreaks reduce to two structural failure modes of safety training. Understanding those two is worth more than memorizing fifty techniques.

Failure mode one: competing objectives

A safety-trained model is optimizing more than one thing at once. It was pretrained and instruction-tuned to be helpful, follow instructions, complete patterns, and continue text fluently. It was then trained to be harmless — to refuse certain requests. These objectives mostly coexist, but they can be put into direct conflict, and a jailbreak in this family is any prompt construction that pits the model’s helpfulness and instruction-following against its harmlessness in a way that makes compliance the path of least resistance.

The mechanism is a tug-of-war over the model’s output distribution. Safety training shifts probability mass toward refusal for certain requests. Competing-objective jailbreaks construct a context where the pressure to be helpful, to follow the explicit instruction, or to complete an established pattern pulls hard enough to overcome that shift. The harmful token sequence becomes the locally most “cooperative” continuation, and the model’s helpfulness wins the tug-of-war against its harmlessness.

Wei et al. give concrete instances of the family without needing operational detail to explain the principle:

Prefix coercion — structuring the prompt so that beginning with a refusal would be grammatically or instructionally awkward, making the helpful continuation a non-refusal. The instruction-following objective is set against the harmlessness objective.
Refusal suppression — instructing the model, as part of the task, to avoid the linguistic patterns refusals are made of. The model’s obedience to the stated constraints competes with its trained tendency to refuse.
Roleplay and persona framing — establishing a fictional context where being “helpful” and “in character” pulls against the safety objective.

In every case the model isn’t broken; it’s conflicted, and the attack engineers the conflict so helpfulness has the higher gradient. This is why a model that’s more capable and more eager to help can, paradoxically, be more exploitable along this axis — the very objective being weaponized is stronger.

Failure mode two: mismatched generalization

The second failure mode is about coverage. A model’s capabilities generalize broadly — it learned to read base64, to operate in dozens of languages, to follow instructions encoded in unusual formats, to do in-context learning over long contexts. Its safety training, by contrast, was applied over a much narrower distribution: mostly natural-language harmful requests in the languages and formats the safety data covered.

Mismatched generalization is the gap between where capability reaches and where safety reaches. Wherever the model can do the task but its safety training never saw that input distribution, the safety behavior fails to generalize while the capability does. The harmful request, expressed in a region of input space the safety training didn’t cover, sails through.

This is the unifying explanation for a whole cluster of techniques:

Encoding and obfuscation (base64, ROT13, leetspeak) — the model’s decoding capability generalizes to the encoded request; its safety training, trained on plaintext, doesn’t.
Low-resource languages — the model can operate in a language; its safety data underrepresented that language, so refusals don’t generalize there.
Unusual formats and structures — the capability to follow instructions in a novel structure outpaces the safety training’s coverage of that structure.

The Anthropic many-shot jailbreaking result (Anil et al., research page ↗) fits here too: in-context learning is a capability that generalizes powerfully over long contexts, and packing the context with many demonstrations of the undesired behavior exploits the gap between that capability and safety training that didn’t anticipate hundreds of fabricated turns. The capability scaled; the safety coverage didn’t keep up.

Why scaling alone doesn’t close the gap

The most important consequence of this framing is the one practitioners most want to be false: making the model bigger and more capable does not, by itself, fix either failure mode. Wei et al. argue this directly, and the logic is uncomfortable but clean.

For competing objectives, scale tends to strengthen the very objective being exploited. A larger, more capable model is more strongly helpful, more fluent at instruction-following, more committed to pattern completion. If the jailbreak weaponizes helpfulness against harmlessness, a more helpful model has more of the thing being turned against it. Capability and the competing-objective vulnerability scale together.

For mismatched generalization, scale widens the capability frontier — the bigger model can do more, in more languages, over longer contexts, in more formats. Unless safety training expands to cover all that new territory at the same rate, the gap between capability and safety coverage grows rather than shrinks. More capability means more surface where safety hasn’t reached.

The paper’s conclusion — “safety mechanisms should be as sophisticated as the underlying model” — is the actionable form of this. The defense has to scale with capability, not lag it. A static safety layer bolted onto an ever-more-capable model is a widening gap by construction.

The distinction this draws

This framing also clarifies a confusion worth naming: a jailbreak is an attack on the model’s own safety behavior, executed through input the user controls. It is distinct from prompt injection, where the malicious instruction arrives through data the application feeds the model (a retrieved document, a tool output) and the attack is on the application’s trust boundary, not the model’s safety training. The two get conflated constantly; they have different root causes and different fixes. The injection-versus-jailbreak distinction and the defender’s stack for each is covered in its own post on this site.

Where the defender’s leverage actually is

If the root causes are competing objectives and mismatched generalization, the defenses follow directly — and they’re not “block the base64 trick.”

Close generalization gaps deliberately. The mismatched-generalization failure is, in principle, addressable by expanding safety training to cover the input distributions where capability reaches: encoded inputs, low-resource languages, unusual formats, long-context in-context-learning patterns. This is exactly where frontier safety training has been investing, and it’s why the older encoding and persona tricks degrade over time — the coverage gap for those specific regions got closed. The defender’s job is to keep coverage expanding as fast as capability does.

Reduce the objective conflict at the boundary. For competing objectives, the application-layer leverage is to not rely on the model winning its internal tug-of-war. Independent input and output classification, run outside the model’s generation objective, doesn’t have a helpfulness objective to be played against — it’s a separate judgment that isn’t subject to the same conflict. This is why a guardrail model can catch what the primary model rationalizes into producing.

Measure with a substance-aware judge. Because competing-objective attacks can make a model agree to something while producing little of actual use, effectiveness has to be measured by whether real harmful information came out, not whether the model refused — the StrongREJECT point (Souly et al., arXiv:2402.10260 ↗). Otherwise you’ll over-count “empty” jailbreaks and misallocate defensive effort.

Assume the gap, contain the blast radius. Since neither failure mode is fully closeable and scale doesn’t rescue you, the operational posture is to assume some jailbreaks land and design so that a successful one is a containable event — capability scoping, output filtering, and monitoring around the model rather than perfect refusal inside it.

The bottom line

Jailbreaks aren’t fifty unrelated tricks; they’re two failure modes wearing fifty costumes. Competing objectives turn the model’s helpfulness against its harmlessness. Mismatched generalization exploits the gap between where capability reaches and where safety training reached. Scaling the model strengthens the first and widens the second, which is why “just make it bigger” isn’t a safety strategy. The leverage for defenders is to make safety scale with capability — close generalization gaps deliberately, judge externally to the generation objective, measure honestly, and contain what gets through. Once you see the two root causes, the whole catalog stops looking mysterious.

For the technique-class catalog these failure modes produce, see the Q2 2026 catalog. For more context, AI attack techniques ↗ and adversarial ML research ↗ cover the broader landscape.

Why Jailbreaks Work: Competing Objectives and Mismatched Generalization

Failure mode one: competing objectives

Failure mode two: mismatched generalization

Why scaling alone doesn’t close the gap

The distinction this draws

Where the defender’s leverage actually is

The bottom line

Sources

Jailbreaks FYI — in your inbox

Related

How LLM Jailbreaks Work: Techniques, Success Rates, and Defender Responses

DAN Prompt Jailbreak Explained: How 'Do Anything Now' Attacks Work

Multi-Turn Role-Play Attacks: Why One Safe Turn Gets Unsafe

Comments