Tag
#jailbreaks
6 posts tagged jailbreaks.
- analysis
Why Jailbreaks Work: Competing Objectives and Mismatched Generalization
Jailbreaks aren't a grab-bag of tricks — they exploit two structural failure modes of safety training. Understanding competing objectives and mismatched generalization explains why scaling alone won't fix them, and where the defender's leverage actually is.
- red-team
ArtPrompt Post-Mortem: Why ASCII-Art Bypasses Worked
A defender-vs-attacker walkthrough of the ArtPrompt ASCII-art jailbreak. Where it slipped past safety training, which model families patched and how, and the encoding-class variants still landing in 2026.
- red-team
Multi-Turn Role-Play Attacks: Why One Safe Turn Gets Unsafe
Crescendo, Many-Shot, and gradual context manipulation. How multi-turn jailbreaks evade single-turn classifiers, what's still landing in 2026, and where the defenses are honestly weak.
- red-team
Multimodal jailbreaks: image and audio attack surfaces in 2026
Vision and audio inputs are a separate attack channel from text. A practitioner survey of multimodal jailbreaks that still land in 2026 — typographic prompts, perturbed images, audio steganography — and what defenders are actually doing about them.
- red-team
System prompt extraction: the techniques that still leak in 2026
A red-team walkthrough of how system prompts get exfiltrated from production LLM apps — direct extraction, indirect inference, behavioral fingerprinting — and what actually keeps them hidden.
- red-team
Jailbreak Technique Catalog: Working as of 2026 Q2
Which jailbreak technique classes still work against current production LLMs, what's been hardened, and the cost-of-attack trend. Indexed for practitioners.