Cover Image for AI Journal Club for Researchers ft. Yonggan Fu (NVIDIA)
Cover Image for AI Journal Club for Researchers ft. Yonggan Fu (NVIDIA)
Avatar for Workato Developer Events
Join us at our AI Research Lab in San Francisco.
Registration
Past Event
Please click on the button below to join the waitlist. You will be notified if additional spots become available.
About Event

Join the Workato AI Research Lab for small, discussion-driven sessions with fellow AI researchers.

These gatherings will bring together researchers to share recent papers, discuss ongoing work, and exchange perspectives on how AI research is shaping real world systems. The focus is on open dialogue, technical depth, and learning from peers working at the forefront of their field.

Efficient Language Modeling with Hybrid Architectures

Transformers with attention mechanisms have become the dominant choice for language models (LMs) due to their strong performance and long-term recall enabled by key-value (KV) caches. However, their quadratic computational cost and high memory demands pose significant efficiency challenges. In contrast, state space models (SSMs) such as Mamba offer constant-time complexity and are well suited for hardware efficiency, but they struggle with memory-intensive recall tasks.

In this talk, we present our research on building hybrid architectures that combine the strengths of attention mechanisms and SSMs to achieve both accurate and efficient language modeling. We first introduce Hymba (ICLR’25 Spotlight), a hybrid-head LM architecture that integrates attention heads and SSM heads within the same layer, enabling parallel and complementary processing of the same inputs. This hybrid-head design allows each layer to simultaneously leverage the high-resolution recall of attention and the efficient context summarization of SSMs, increasing the model’s expressiveness in handling diverse information flows.

To further improve real-device efficiency, we introduce Nemotron-Flash (NeurIPS’25), one of the strongest 3B-scale small language models to date. We identify two key architectural factors—depth–width ratios and operator choices—which are critical for small-batch latency and large-batch throughput, respectively, and develop targeted architectural improvements to optimize both. Beyond architectural design, we also enhance LM training with a weight normalization technique that enables more effective weight updates and improves convergence. We hope that the actionable insights and guidelines presented in this work will inform future research on low-latency, high-throughput language models.

Featured Speaker

Dr. Yonggan Fu, ​Research Scientist at NVIDIA 

Yonggan Fu is a Research Scientist at NVIDIA. He obtained his PhD from the Georgia Institute of Technology, advised by Dr. Yingyan (Celine) Lin. He was a recipient of the IBM PhD Fellowship and was selected as a Machine Learning and Systems Rising Star in 2023.

Yonggan’s research focuses on building efficient foundation models and algorithms that democratize AI on everyday devices. At NVIDIA, he led the development of Nemotron-Flash, Hymba, and Efficient-DLM. His research work has been featured as spotlight papers at ICLR (2025, 2021, and 2020), an oral paper at ECCV (2024), and was selected as an IEEE Micro Top Pick (2023).

Who Should Attend

AI Researchers and practitioners working at the intersection of AI research and real world systems

About Workato

​Workato is an enterprise automation and integration platform that orchestrates workflows across applications, data, and systems enabling secure, governed execution of complex processes as organizations adopt AI agents at scale. You can explore Workato's end-to-end capabilities here.

Location
Workato
600 Illinois St, San Francisco, CA 94107, USA
Avatar for Workato Developer Events
Join us at our AI Research Lab in San Francisco.