

Discord Deep Dive: Smarter Compression for Trillion-Parameter AI Models
Join us for an exclusive deep-dive with the Cerebras team! Sherif Cherfa will talk about state of the art compression method for large models, exclusive from Cerebras Research.
REAP: Smarter Compression for Trillion-Parameter AI Models
Frontier AI models like Qwen3-480B and Kimi-K2 pack hundreds of billions of parameters. Join us for a deep dive into REAP (Router-weighted Expert Activation Pruning), a new one-shot technique from Cerebras Research that can cut up to 50% of experts from massive Mixture-of-Experts models while preserving over 96% of their capabilities.
We'll cover:
Why pruning beats merging for generative tasks like code generation, reasoning, and tool use
The concept of functional subspace collapse — what goes wrong when you merge experts instead of removing them
How REAP's saliency scoring identifies which experts to cut by measuring both selection frequency and actual impact
Results across models from 21B to 1 trillion parameters, including SWE-Bench agentic coding benchmarks
Whether you're deploying large models and care about memory footprint, or you're just curious about the internals of MoE architectures, this talk will give you a concrete, research-backed framework for thinking about model compression.
📄 Paper: arxiv.org/abs/2510.13999
💻 Code & model checkpoints: github.com/CerebrasResearch/reap
***
Cerebras is the world’s fastest AI inference, up to 15x faster than leading GPUs. Cerebras Inference is powered by our Wafer-Scale Engine (WSE-3) - the world's largest AI chip. Get free compute at cerebras.ai.