NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
Agentic AI systems - LLM-driven agents that plan, use tools, and act autonomously - are rapidly evolving, but evaluating them remains one of the hardest problems in AI. Traditional benchmarks fail to capture long-term reasoning, goal shifts, or real-world impact. This event brings together researchers and practitioners to discuss how we can meaningfully assess success, alignment, and reliability in autonomous agents.
Activities
Lightning Talks – Fast, high-signal talks on emerging evaluation methods, from goal success metrics to chain-of-thought coherence and agent reliability.
Expert Panel – A debate among leading researchers and founders on how to measure alignment, simulate real-world settings, and balance research rigor with practical metrics.
Networking Mixer – Informal group discussions for participants to connect around shared interests in safety, benchmarks, and deployment evaluation.