

Daytona AI Researchers - Stanford, April 2026
Beyond Episodes: Infrastructure, Evaluation, and Benchmarking for Long-Running
For thirty years, RL has been built on a simple premise: episodes are brief, state is cheap, and you can always start over. Today's long-running agents violate all three – they run for days, accumulate irreplaceable environment state, and branch across speculative decision trees. The tooling we inherited wasn't built for this.
On Wednesday, April 29, Daytona and FounderCoHo are again co-hosting an exclusive, high-signal evening dedicated to researchers at Stanford University to explore when we take long-horizon, stateful agents seriously — from the infrastructure that makes them possible, to the evaluation frameworks that make them trustworthy.
Agenda
🕒 5:30 pm – 5:35 pm
Welcome and Opening Remarks
🎤 Pramanya Guda, Community Ambassador - Pacer at Daytona
🕒 5:35 pm – 5:50 pm
Talk "Today's Agents Don't Live In Episodes"
🎤 Muhammad Annas Hashmi, DevRel at Daytona
Outline:
The 'episode' (short, stateless, resettable) has been RL's foundational abstraction since ATARI. It underpins the Gym API, GRPO, PPO, and the conventional sandbox lifecycle. Today's agents no longer fit it. Tasks span for days; the env state at hour 18 of an agent session with warm caches, installed deps, live processes, open sockets, dirty git tree, is worth hours of wall clock to reproduce.
Three things are scaling simultaneously. Rollout horizon: seconds -> days. Env state: disposable between episodes -> first-class learning substrate. Branching: absent in modern LLM-RL -> speculative fork trees. Each stresses the inherited toolkit in a different way, and all three have been gated on the same missing primitives: VMs you can fork cheaply, pause without killing processes, snapshot mid-run, and resume hours later.
This talk walks through what opens up when those primitives become available. Live demo of long-horizon sessionful rollouts, mid-trajectory forking, and cross-calendar-time training. The research questions that follow (long-horizon benchmarks, speculative RL algorithms, event-driven training, to name a few) are where the next wave of agent RL gets built.
🕒 5:50 pm – 6:05 pm
Talk "Closing the Visibility Gap: Lessons from Safety Critical Agentic Systems"
🎤 Vivek Pandit, Frontier AI Lead at Turing
Outline:
AI agents are moving from demos to production, but their success depends on how well we can evaluate, benchmark, and trust them in high stakes workflows. This talk explores why traditional software metrics and static benchmarks fall short for agentic systems, especially when agents must reason, plan, call tools, recover from failure, and operate over long horizons. I’ll argue for evaluation frameworks that treat execution traces, reasoning trajectories, and tool interactions as first class signals, alongside outcome based metrics such as task success, pass rates, coverage, and behavioral robustness.
To ground these ideas, the talk draws from chip design verification, where over 60% of development time is spent validating design intent against complex specifications. Verification is not just a tooling problem but a reasoning problem, making it a strong testbed for agent evaluation. I’ll share lessons from building agents that interoperate with EDA toolchains, coordinate across stages like mental model formation, test planning, testbench generation, and run and debug, and use auto correction loops to safely adapt from tool feedback. The broader lesson is that better observability and domain aware benchmarking are essential for deploying reliable agents in production.
🕒 6:05 pm – 6:20 pm
Talk "Economics of Post Training"
🎤 Jay Ram, Co-founder & CEO at HUD (YC W25)
Outline:
This talk concerns the economics and supply chain of rl environments, as well as future data collection needed to scale post training
🕒 6:20 pm – 6:35 pm
Talk "Building Production RL Training Pipelines with Scalable Sandboxes for Agent Execution"
🎤 Andy Lyu, Co-Founder & CTO at Osmosis (YC W25)
Outline:
Osmosis is building an RL platform that enables developers to easily fine-tune open source models that can outperform foundation models. A core RL infrastructure challenge is container orchestration (i.e. spinning up and terminating thousands rapidly). We discuss how we designed a production-grade RL pipeline to train Qwen3.5 MoE models, specifically covering our usage of FP8 quantization, LoRA RL, and Daytona sandboxes.
🕒 6:35 pm – 6:50 pm
"Automating Benchmark Design"
🎤 Zhengyang (Jason) Qi, Research Scientist at Snorkel AI
Outline:
The talk will cover how we actively design evaluations through iterative rollouts, and I’ll also discuss how Daytona integrates with this workflow as well as helps us with Terminal Bench and Harbor rollouts.
🕒 6:50 pm - 8:30 pm
Networking
With food and beverages
________________________
About event
An engaging meetup designed for AI researchers to connect, share ideas, and explore the latest advancements in artificial intelligence. The event features informal networking, short talks, and discussions on current research trends, fostering collaboration and knowledge exchange within the AI community.