

Serving and Programming LLMs: Agents, Prefill, and Distributed Systems
Talk 1: Serving LLMs at Scale: Prefill, Decode, and the New Distributed Systems Problem
LLM inference is no longer just a GPU problem. At scale, it becomes a distributed systems problem: routing, scheduling, placement, memory movement, backpressure, tail latency, and failure domains.
This talk breaks down why prefill and decode are different workloads, and why separating them changes the architecture of inference fleets. Prefill is bursty and compute-heavy. Decode is stateful and latency-sensitive. KV cache becomes the shared state that makes or breaks utilization, cost, and Time to First Token.
We will walk through vLLM, LMCache, cache-aware routing, and disaggregated prefill, then do live tuning to show how small systems choices change real performance.
You will leave with a concrete mental model for why the world’s best prefill service is really a distributed systems service.
About Speaker
Khawaja Shams is CEO and co-founder of Momento. Prior to Momento, he was a VP of Engineering at AWS. Before that, he served as manager of data services at the NASA Jet Propulsion Laboratory, where he was responsible for the team driving image processing for Mars Rovers.
Talk 2: Mom, I Wanna Make a Programming Language
Everyone is trying to make agents better at coding. We started from the opposite question: what would it look like to make coding better for agents? Assembly was built for the machine world, where every cycle mattered. Python and TypeScript were built for the human world, where readability mattered most. Now the author is an agent — generation is cheap, ambiguity is expensive — and we need something built for that.
The rubric we landed on is three consumers instead of one:
Agents that write the code.
Agents that read it to extend or debug it.
Humans who have to understand what shipped.
Some features serve all three cheaply, some force a real trade-off, and some look obvious for humans and turn out to be hostile to agents. Almost every design call (syntax, type system, error model, tooling) gets argued against those three.
BAML is an open-source programming language, built in Rust, used in production by companies shipping AI to real users. It has a compiler, a VM, an LSP, and a formatter, and drops into Python, TypeScript, Go, Rust, and the browser so teams can adopt it incrementally without rewriting their stack.
Because the human consumer's job is to understand, we'll start from the tooling side and work backward: what tooling has to be possible, and what the language has to look like to make it possible.
A semantic control-flow graph view that lets a humans (and agents) understand code does at a glance.
baml describe, grep with super powers.
watch-driven reactivity - imagine debugging without 10s of println!(..)
Then the language-level decisions that make that tooling possible, and what they cost us elsewhere.
Inferred error types as type contracts (rust level guarantees with typescript like syntax)
Exhaustive
matchover union typesStreaming built into the type system.
async/awaitwithout function coloringRethinking testing for non-deterministic systems
About Speaker
Vaibhav Gupta is the founder and CEO of Boundary (YC), where he and his team build BAML. Before Boundary, he spent most of a decade doing weird things in assembly: predictive pipelines at D. E. Shaw, augmented reality at Google, and real-time 3D reconstruction on HoloLens at Microsoft. Fun fact, Boundary went through 12 pivots before landing on BAML. In his spare time he studies compilers and plays competitive table tennis
Thank you Vaibhav and Boundary for sponsoring this event.