We’re building the workforce needed to safely navigate AGI.   

Contact: team@bluedot.org

BlueDot Impact

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

This paper uses mechanistic interpretability as a vehicle to detect early signals of multi-turn prompt injection attacks on LLMs. In particular, it proposes detecting these attacks by monitoring how an LLM’s internal activations shift over a conversation, rather than judging each message in isolation. Its core idea is “adversarial restlessness”: attacker conversations tend to produce distinctive drift patterns as they move from benign setup to pivoting and escalation. For this reading group, it’s particularly interesting because it's a direct followup to our recent conversations on jailbreaking, prompt-injection and defenses against them.

AI x Cyber Reading Group