Cover Image for Discord Deep Dive: Smarter Compression for Trillion-Parameter AI Models
Cover Image for Discord Deep Dive: Smarter Compression for Trillion-Parameter AI Models
Avatar for Cerebras
Presented by
Cerebras
5 Going

Discord Deep Dive: Smarter Compression for Trillion-Parameter AI Models

Virtual
Registration
Welcome! To join the event, please register below.
About Event

Join us for an exclusive deep-dive with the Cerebras team! Sherif Cherfa will talk about state of the art compression method for large models, exclusive from Cerebras Research.

REAP: Smarter Compression for Trillion-Parameter AI Models

Frontier AI models like Qwen3-480B and Kimi-K2 pack hundreds of billions of parameters. Join us for a deep dive into REAP (Router-weighted Expert Activation Pruning), a new one-shot technique from Cerebras Research that can cut up to 50% of experts from massive Mixture-of-Experts models while preserving over 96% of their capabilities.

We'll cover:

  • Why pruning beats merging for generative tasks like code generation, reasoning, and tool use

  • The concept of functional subspace collapse — what goes wrong when you merge experts instead of removing them

  • How REAP's saliency scoring identifies which experts to cut by measuring both selection frequency and actual impact

  • Results across models from 21B to 1 trillion parameters, including SWE-Bench agentic coding benchmarks

Whether you're deploying large models and care about memory footprint, or you're just curious about the internals of MoE architectures, this talk will give you a concrete, research-backed framework for thinking about model compression.

📄 Paper: arxiv.org/abs/2510.13999

💻 Code & model checkpoints: github.com/CerebrasResearch/reap

***

Cerebras is the world’s fastest AI inference, up to 15x faster than leading GPUs. Cerebras Inference is powered by our Wafer-Scale Engine (WSE-3) - the world's largest AI chip. Get free compute at cerebras.ai.

Avatar for Cerebras
Presented by
Cerebras
5 Going