

Led by Harsha Nelaturu and Andrej Jovanović. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science
Hosted By
Doğaç Eldenk - Attention Drift – What speculative decoding models learn
Registration
About Event
Speculative decoding speeds up LLM inference by drafting tokens with a small model, but drafters degrade sharply under template perturbation and long contexts. We identify a new phenomenon, attention drift: as the drafter generates within a speculation chain, its attention shifts away from the prompt onto its own recent tokens. We trace this to hidden-state magnitude accumulation across drafting steps and fix it with a post-norm architecture—EAGLE 3.1—that improves resilience and performance.
Bio: Doğaç is a Master's student in Northwestern University's Computer Science program, joining Fal as a Machine Learning Engineer. His work focuses on inference acceleration, from speculative decoding to agentic GPU kernel optimization and discovery.
Led by Harsha Nelaturu and Andrej Jovanović. Part of the Cohere Labs Open Science initiative https://cohere.com/research/open-science
Hosted By