

AI x Cyber Reading Group
We will hear a presentation from @Prashant Kulkarni on his very recent paper: Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
This paper uses mechanistic interpretability as a vehicle to detect early signals of multi-turn prompt injection attacks on LLMs. In particular, it proposes detecting these attacks by monitoring how an LLM’s internal activations shift over a conversation, rather than judging each message in isolation. Its core idea is “adversarial restlessness”: attacker conversations tend to produce distinctive drift patterns as they move from benign setup to pivoting and escalation. For this reading group, it’s particularly interesting because it's a direct followup to our recent conversations on jailbreaking, prompt-injection and defenses against them.
Link to paper: https://arxiv.org/abs/2604.28129