Cover Image for Efficient Large Language Model Deployment with Quantization: A UPenn Lecture | Sponsored by Turing
Cover Image for Efficient Large Language Model Deployment with Quantization: A UPenn Lecture | Sponsored by Turing
Avatar for Turing Events
Presented by
Turing Events
Hosted By
26 Went

Efficient Large Language Model Deployment with Quantization: A UPenn Lecture | Sponsored by Turing

Zoom
Registration
Past Event
Welcome! To join the event, please register below.
About Event

You are invited to join a special virtual stream of Prof. Mayur Naik’s CIS 7000 course on Large Language Models at the University of Pennsylvania, sponsored by Turing.

Speaker: Guangxuan Xiao, Ph.D. Candidate, MIT

Abstract: Large language models (LLMs) achieve state-of-the-art performance across a wide range of AI applications but demand significant compute and memory resources, limiting their deployment on edge devices and cloud servers alike. Quantization, which reduces model precision to accelerate inference and lower memory usage, offers a promising solution, yet it comes with unique challenges when applied to LLMs.

In this talk, I will present a series of cutting-edge quantization techniques developed by our lab to address these challenges. We begin with SmoothQuant, a post-training quantization method that enables INT8 weight and activation quantization, preserving accuracy and achieving up to 1.56x speedup and 2x memory reduction across models such as Llama-2, OPT, and Falcon. Then, we explore AWQ, a hardware-efficient, activation-aware weight quantization technique for low-bit, on-device LLMs, which delivers superior accuracy and more than 3x speedup on edge GPUs. Finally, we introduce QServe, an innovative W4A8KV4 quantization and system co-design solution for large-batch LLM serving, which significantly reduces GPU dequantization overhead and improves throughput by up to 3.5x compared to existing frameworks.

These advancements collectively offer practical, scalable solutions for efficient LLM deployment in both cloud and edge environments, democratizing access to powerful language models while optimizing hardware costs.

Bio: Guangxuan Xiao is a third-year Ph.D. candidate at MIT EECS, advised Prof. Song Han. He focuses on creating efficient algorithms for deep learning, especially for large language models (LLMs). His work has earned widespread attention, receiving over 9,000 GitHub stars and making a tangible impact on industry practices. His key contributions, including SmoothQuant and StreamingLLM, have been widely adopted and integrated into platforms such as NVIDIA's TensorRT-LLM, HuggingFace, and Intel's Neural Compressor.

Supplementary Reading: [paper 1] [paper 2] [paper 3]

Learn more about the course at CIS 7000 - Large Language Models.

Avatar for Turing Events
Presented by
Turing Events
Hosted By
26 Went