

Daytona & Sentry AI Builders - SF, December 2025
βAn event dedicated to exploring all things AI Engineering!
βAgenda
βββββπ 5:30 pm β 5:35 pm
Welcome and Opening Remarks
βπ€ Mert Gulsun, Pacer -Community Ambassador at Daytona
ββββββπ 5:35 pm - 5:50 pm
"Infrastructure for Autonomous Coding: Sandboxes, Speed, and Safety"
ββπ€ Mislav Ivanda, Developer Experience Engineer at Daytona
βββOutline:
βAs AI agents generate and run code continuously, developers need an environment that is fast, safe, and infinitely scalable. Daytona meets this with sub-90ms sandbox creation, a secure isolated runtime, and massive parallelization across thousands or millions of workloads. Its stateful, customizable sandboxes powered by Snapshots, along with globally deployed low-latency regions, provide both speed and reliability. With an agent-first API and a comprehensive SDK for process execution, file systems, Git, and PTY access, Daytona forms the core infrastructure for autonomous software development. In this talk, we show how these features make running AI-generated code effortless, controlled, and secure.
βπ 5:50 pm - 6:10 pm
"Bad DX Becomes Worse AX: Building Toward Self-Optimizing Software"
ββπ€ Indragie Karunaratne, Director of Engineering at Sentry
βββOutline:
βSelf-optimizing software has been a fantasy in performance engineering for decades, but the missing piece wasnβt data: it was a user that could actually act on it. We already know how to measure whatβs slow: profiles, traces, and profile-guided optimization have existed forever, wrapped in tools with notoriously bad DX. In an agentic world, that bad DX turns into even worse AX: LLMs struggle with complex tooling setup and data firehoses that eat tokens. This talk is about building simplified abstractions that present runtime context to agents as clear statistical guidance instead of raw telemetry. If we treat agents as first-class users of debugging tools, we get a realistic path to software that continuously measures, reasons, and rewrites itself.
ββπ 6:10 pm - 6:30 pm
"OpenThoughts: Data Recipes for Reasoning Models"
ββπ€ Etash Guha, PhD at Stanford University
βββOutline:
βReasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on this https URL.
ββπ 6:30 pm - 6:50 pm
"Terminal-Bench and Harbor: A benchmark for CLI-agents and a framework for performing agent rollouts at scale"
ββπ€ Alex Shaw, Founding Member of the Technical Staff at Laude Institute
βββOutline:
β2025 has been the year of coding agents. From Claude Code to Cursor to Manus, agents are being equipped with terminals. But how well can agents actually accomplish tasks using a terminal? Terminal-Bench is a benchmark created by Laude Institute and Stanford to measure agents' abilities to complete tasks by writing and executing code in a terminal. Harbor is a framework for defining and performing rollouts on agent tasks.
ββπ 6:50 pm - 7:10 pm
"Automating Benchmark Design"
ββπ€ Zhengyang (Jason) Qi, Research Scientist @ Snorkel AI
βββOutline:
βThe talk will cover how we actively design evaluations through iterative rollouts, and Iβll also discuss how Daytona integrates with this workflow as well as helps us with Terminal Bench and Harbor rollouts.
ββββββββββπ 7:10 pm - 8:30 pm
ββNetworking
βWith food and beverages
ββ____________________________
ββAbout event
βThis is dynamic gathering for AI enthusiasts, innovators, and professionals to collaborate, share ideas, and explore the latest advancements in artificial intelligence. Whether you're building AI products, researching cutting-edge algorithms, or simply passionate about the field, join us to connect, learn, and drive the future of AI forward.