Cover Image for NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
Cover Image for NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
32 Going

NeurIPS Social: Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact

Hosted by Tatia Tsmindashvili & Rapael Kalandadze
Registration
Welcome! To join the event, please register below.
About Event

Agentic AI systems - LLM-driven agents that plan, use tools, and act autonomously - are rapidly evolving, but evaluating them remains one of the hardest problems in AI. Traditional benchmarks fail to capture long-term reasoning, goal shifts, or real-world impact. This event brings together researchers and practitioners to discuss how we can meaningfully assess success, alignment, and reliability in autonomous agents.

Activities

  • Lightning Talks – Fast, high-signal talks on emerging evaluation methods, from goal success metrics to chain-of-thought coherence and agent reliability.

  • Expert Panel – A debate among leading researchers and founders on how to measure alignment, simulate real-world settings, and balance research rigor with practical metrics.

  • Networking Mixer – Informal group discussions for participants to connect around shared interests in safety, benchmarks, and deployment evaluation.

Location
San Diego Convention Center
111 Harbor Dr, San Diego, CA 92101, USA
32 Going