

From Prototype to Production: The Hidden Engineering of AI Inference
Every developer can spin up an LLM app in an afternoon. Almost no one can run it efficiently in production. This talk unpacks the gap between "it works on my laptop" and "it handles 10,000 requests per day without exploding your GPU bill." We go under the hood on the actual mechanics of modern AI inference, KV cache pressure, continuous batching, quantization tradeoffs, and GPU utilization math, and show how these translate directly into cost and latency decisions. Every concept is grounded in real benchmarks and code attendees can run against a live API the same day.
Key Highlights
Why KV cache memory, not compute, is the real bottleneck in LLM serving
Continuous batching vs. static batching — the optimization that changed production inference
Quantization tradeoffs: FP16, INT8, AWQ, GPTQ, and when each makes sense
Reading GPU utilization (MFU) and what it actually means for your cloud bill
Live benchmark walkthrough: 3.7x throughput, 5.1x faster inference, 30% lower cost
Drop-in OpenAI-compatible code patterns for serverless and dedicated endpoints
Speaker
Roan Weigert is a Developer Relations engineer at GMI Cloud, where he works at the intersection of AI infrastructure and the developer community. He helps engineers navigate the gap between model development and production deployment, with a focus on LLM inference performance, GPU cloud economics, and hands-on technical enablement. At GMI Cloud, he builds the tools, content, and community programs that help AI teams ship faster on NVIDIA-powered infrastructure.
Please join DataPhoenix Slack and follow us on LinkedIn and YouTube to stay updated on our community events and the latest AI and data news.