Discord Deep Dive: Smarter Compression for Trillion-Parameter AI Models

Cerebras

Virtual

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Join us for an exclusive deep-dive with the Cerebras team! Sherif Cherfa will talk about state of the art compression method for large models, exclusive from Cerebras Research.

REAP: Smarter Compression for Trillion-Parameter AI Models

Frontier AI models like Qwen3-480B and Kimi-K2 pack hundreds of billions of parameters. Join us for a deep dive into REAP (Router-weighted Expert Activation Pruning), a new one-shot technique from Cerebras Research that can cut up to 50% of experts from massive Mixture-of-Experts models while preserving over 96% of their capabilities.

We'll cover:

Why pruning beats merging for generative tasks like code generation, reasoning, and tool use
The concept of functional subspace collapse — what goes wrong when you merge experts instead of removing them
How REAP's saliency scoring identifies which experts to cut by measuring both selection frequency and actual impact
Results across models from 21B to 1 trillion parameters, including SWE-Bench agentic coding benchmarks

Whether you're deploying large models and care about memory footprint, or you're just curious about the internals of MoE architectures, this talk will give you a concrete, research-backed framework for thinking about model compression.

📄 Paper: arxiv.org/abs/2510.13999

💻 Code & model checkpoints: github.com/CerebrasResearch/reap

***

Cerebras is the world’s fastest AI inference, up to 15x faster than leading GPUs. Cerebras Inference is powered by our Wafer-Scale Engine (WSE-3) - the world's largest AI chip. Get free compute at cerebras.ai.

Presented by

Cerebras

Hosted By

5 Going

AI