

AI x Cyber Reading Group
Automated Jailbreaking and the Cat-and-Mouse Game of AI Security
Modern AI systems rely on safety filters to block harmful outputs. But how robust are those filters when faced with a determined attacker?
In this session, we’ll explore a very recent paper from UK AISI introducing Boundary Point Jailbreaking (BPJ), a method for automatically discovering prompts that bypass safety guardrails, even when the attacker only has black-box access (i.e., they can’t see anything about the model’s internals, just whether a response is blocked or allowed).
We’ll cover:
What “black-box” AI access means
How the proposed automated jailbreak discovery works at a high level
Why boundary-probing strategies are effective
What this might imply for AI safety, governance, and adversarial dynamics
No especially deep technical AI background nor cybersecurity experience required. We’ll focus on building our intuition, understanding real-world parallels to traditional cybersecurity, and then have a discussion on the broader implications.
Expect a short (~15 min) walkthrough followed by open conversation, though we might split into breakout rooms depending on the number of attendees.
Link to paper: https://arxiv.org/abs/2602.15001