Ruihang Lai & Hao Kang - PithTrain: A Compact and Agent-Native MoE Training System

ML Systems and Theory - Cohere Labs Open Science Community

Google Meet

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time

Ruihang Lai is a fourth-year Ph.D. student in the Computer Science Department at Carnegie Mellon University, advised by Tianqi Chen and Todd Mowry. His research interests lie in machine learning compilers, large language model training and inference systems, and machine learning systems more broadly. He is also a Project Management Committee (PMC) member of the Apache TVM project.

Hao Kang is an incoming CS PhD in the Language Technologies Institute at Carnegie Mellon University, co-advised by Chenyan Xiong and Tianqi Chen. He works on the training of LLMs, with interests in hardware-aligned architectures and the systems that enable such training.

Presented by

ML Systems and Theory - Cohere Labs Open Science Community

Led by Harsha Nelaturu and Andrej Jovanović. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science

Hosted By