

What evaluation frameworks exist for AI, and what's their rationale?
This session is part of the How to Think about Tech? The Case of 'AI Safety' study group initiated by some of the fellow candidates of the 2025/2026 Introduction to Political Technology course. It is open to faculty and fellowship candidates only.
Evaluations (or "evals") have become central to AI governance - companies use them to justify model releases, regulators require third-party assessments, and researchers design benchmarks for dangerous capabilities.
But who decides what gets measured? Whose values are embedded in evaluation design? And how do evals function in practice?
This study group session examines AI evaluation as both a technical practice and a political process, analyzing how "safety" gets operationalized through benchmarks, who holds power in defining risk, and what systematically gets excluded from evaluation frameworks.
---
Key Discussion Questions:
Who decides what capabilities or risks to evaluate?
How do evaluation frameworks shape what gets built and deployed?
What's the relationship between evals and actual safety?
Can we evaluate "societal impact"? What would that require?
How do evals function in governance?
What gets optimized when benchmarks become targets?
What's the gap between evaluation results and deployment decisions?
Recommended Readings:
Readings are also linked in the study group doc
Introduction & Context
Technical Approaches (for reference)
Anthropic's Responsible Scaling Policy v1.1 (September 2024)
METR (formerly ARC Evals) - Model Evaluation and Threat Research
Critical Perspectives & Limitations
The AI Safety Washing Problem - IEEE Spectrum (January 2025)
AI companies' safety practices fail to meet global standards, study shows - Reuters (November 2024)
[2501.17805] International AI Safety Report (January 2025) - Skim executive summary on current state of evals
Listen/Watch
Alternative Approaches & Sociotechnical Evaluation
For Technical Deep Dive (Optional)
Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems (November 2025)- Proposes CLEAR framework: Cost, Latency, Efficacy, Assurance, Reliability
OpenAI's GDPval: Measuring model performance on real-world tasks (June 2025) - Evaluation across 44 occupations on economically valuable tasks