Cover Image for AI x Cyber Reading Group
Cover Image for AI x Cyber Reading Group
Avatar for BlueDot Impact
Presented by
BlueDot Impact
We’re building the workforce needed to safely navigate AGI.
Contact: [email protected]

AI x Cyber Reading Group

Zoom
Registration
Welcome! To join the event, please register below.
About Event

We will hear a presentation from @Prashant Kulkarni on his very recent paper: Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

This paper uses mechanistic interpretability as a vehicle to detect early signals of multi-turn prompt injection attacks on LLMs. In particular, it proposes detecting these attacks by monitoring how an LLM’s internal activations shift over a conversation, rather than judging each message in isolation. Its core idea is “adversarial restlessness”: attacker conversations tend to produce distinctive drift patterns as they move from benign setup to pivoting and escalation. For this reading group, it’s particularly interesting because it's a direct followup to our recent conversations on jailbreaking, prompt-injection and defenses against them.

Link to paper: https://arxiv.org/abs/2604.28129

Avatar for BlueDot Impact
Presented by
BlueDot Impact
We’re building the workforce needed to safely navigate AGI.
Contact: [email protected]