

AI x Cyber Reading Group
This week we are discussing a very recent paper from UK AISI on measuring how far AI agents can go in realistic, multi-step cyber attack scenarios. Instead of toy tasks, the paper drops models into simulated enterprise and ICS environments and evaluates how well they can execute long attack chains end-to-end. The results show clear progress with scaling and newer models, but also highlight major limitations: performance is still partial, highly dependent on token budgets, and tested in environments without active defenses or defenders. In other words, it’s a great step toward realism, but still far from representing real-world operations. As you read, it’s worth thinking about what the biggest gaps are before capabilities becomes operationally meaningful. Are we "there" yet? If not, what would it take and how would we know?
Link to paper: https://www.aisi.gov.uk/research/measuring-ai-agents-progress-on-multi-step-cyber-attack-scenarios