Doğaç Eldenk - Attention Drift – What speculative decoding models learn

ML Systems and Theory - Cohere Labs Open Science Community

Google Meet

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Speculative decoding speeds up LLM inference by drafting tokens with a small model, but drafters degrade sharply under template perturbation and long contexts. We identify a new phenomenon, attention drift: as the drafter generates within a speculation chain, its attention shifts away from the prompt onto its own recent tokens. We trace this to hidden-state magnitude accumulation across drafting steps and fix it with a post-norm architecture—EAGLE 3.1—that improves resilience and performance.

Bio: Doğaç is a Master's student in Northwestern University's Computer Science program, joining Fal as a Machine Learning Engineer. His work focuses on inference acceleration, from speculative decoding to agentic GPU kernel optimization and discovery.

Presented by

ML Systems and Theory - Cohere Labs Open Science Community

Led by Harsha Nelaturu and Andrej Jovanović. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science

Hosted By

AI