

Owning the Inference Stack / 01 / with Vidya - Open Registration
Your inference stack has three independent picks now: architecture (Transformer vs hybrid), training (PPO vs GRPO), silicon (NVIDIA vs four serious alternatives). Each wrong pick costs you 2x to 20x. The three no longer constrain each other the way they did in 2024, so your old defaults are no longer free.
We walk through the three picks that decide what you ship next.
Architecture. At 128K context, hybrid Mamba-Transformer throughput beats pure attention by ~20x. At 16K it is ~2.5x. Nemotron-3 Super and IBM Granite 4.0 already ship the pattern. The lever is per-layer: how many attention blocks you keep for state tracking, how many SSM blocks you swap in for cheap decode at long context.
Training. GRPO (the critic-free RL algorithm DeepSeek-R1 was trained with) collapsed the 4-model RLHF hardware floor. A serious 70B run drops from two 8x B200 nodes to one. The bottleneck moves from gradient compute to rollout throughput, and the open frontier becomes operational, not algorithmic.
Silicon. The $/token spread across vendors on the same Llama 3.3 70B is now wider than the spread between Llama 3.3 and Claude Haiku. Five families sell tokens at scale (NVIDIA B200/B300, Cerebras CS-3, Groq LPU, TPU Ironwood, Trainium); they price Llama 3.3 70B anywhere from roughly $0.60 to $0.88 per million tokens, and at the wider edge of inference SKUs the same workload spread crosses an order of magnitude. The $20B NVIDIA Groq deal in December 2025 was the public admission that GPU has a TTFT floor (time-to-first-token; the latency a streaming app feels).
Format: three 15-minute beats from Vidya, one open question per beat with room discussion, then social hour. Doors 5:45pm, talks 6:00 to 8:00, social 8:00 to 9:00.
This is episode 01 of Owning the Inference Stack. Future episodes target serving economics, eval at scale, and distillation as a cost lever. Themes lock as speakers commit.
Come with a model you are paying to serve. Bring the $/M-token line item, or the latency complaint, or the GPU bill you are trying to halve. This is for founders past MVP making serving decisions, not first-time builders shopping for frameworks.
This event is hosted at the Frontier Tower:
We are transforming a 16-floor tower in San Francisco into a self-governed vertical village—a hub for frontier technologies and creative arts. Tier-one labs presenting AI, Ethereum, biotech, neuroscience, longevity, robotics, makerspace, human flourishing, and arts & music. These floors will house innovators and creators pushing the boundaries of human potential in a post-AI-singularity world.
Apply here for founding citizenship: https://frontiertower.io/apply
Why should I become a citizen?
Be part of creating the first self-governed vertical village
Connect with the most creative people in the city
Get access to all floors, free event space & movement floor
Website: https://frontiertower.io/
Need more reading? Visit https://frontiertower.notion.site/