

90/30 Club (ML reading) #47: TurboQuant: Near-Optimal Vector Quantization for LLM Memory
Week 47: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
TurboQuant proposes a fundamentally simple but surprisingly powerful idea: if you randomly rotate high-dimensional vectors, their coordinates become nearly independent and well-behaved, so you can just quantize each coordinate optimally and still get near-optimal global performance. The result is a data-oblivious, online quantization scheme that achieves distortion rates within a small constant factor of the information-theoretic optimum.
What makes this especially relevant is its application to KV cache compression in large language models. The paper shows that you can push KV cache storage down to ~3–3.5 bits per channel with essentially no quality loss, directly attacking one of the biggest bottlenecks in long-context inference
Join us at Mox to explore:
-Is TurboQuant actually a breakthrough, or is it a clever recombination of classical ideas?
-What matters more in practice: provable near-optimality or engineering simplicity + deployability?
-If KV cache is the real bottleneck for long-context LLMs, does this shift where we should focus optimization (away from weights → toward runtime state)?
Discussion at 20:00, (optional) quiet reading from 19:00.