Cover Image for Inference & GPU Optimization: GPTQ
Cover Image for Inference & GPU Optimization: GPTQ
Avatar for AIM Workshops & Events
Join us every week live to learn concepts and code from The LLM Edge!
Hosted By
32 Went

Inference & GPU Optimization: GPTQ

YouTube
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Large Language Models (LLMs) keep getting bigger, and their performance across many tasks and problems keeps improving. We expect this to continue.

At the same time, they’re getting smaller, and their performance is improving across specific tasks and problems. Similarly, expect more of this in the future.

The performance vs. cost tradeoff when it comes to LLMs and Small Language Models (SLMs) must always consider computing requirements for two reasons:

  1. Fine-Tuning & Alignment: Any fine-tuning and alignment work that must be done to commercial off-the-shelf or open-source LLMs

  2. Inference: Right-sizing your model for the performance, task, and scale of the application

During inference, we must ensure the best possible output.

Optimizing our inference means ensuring the best output with the minimum possible compute.

One way to accomplish this is to leverage techniques like quantization, which compresses our LLM while maintaining a high-quality output.

Specialized versions of quantization have been developed, and we cover these techniques in this series!

In Part I we covered Activation-aware Weight Quantization (AWQ). In Part II we’ll over Generative Pretrained Transformer Quantization (GPTQ).

In this event, we’ll leverage our foundational understanding of optimizing inference for GPT models to dig into the components that make GPTQ useful and unique. We will also dig deeper into the details of quantization for LLMs.

According to the authors of the original paper, GPTQ is:

a one-shot weight quantization method based on approximate second-order information

We’ll break down exactly what this language means, from concepts to code!

📚 You’ll learn:

  • How quantization and GPT! help to speed up inference time and remove latency!

  • How GPTQ compares to other leading methods like AWQ

🤓 Who should attend the event:

  • Aspiring AI Engineers who want to optimize inference for LLM applications

  • AI Engineering leaders who want to serve LLM applications at scale in production

Speakers:

  • Dr. Greg” Loughnane is the Co-Founder & CEO of AI Makerspace, where he is an instructor for The AI Engineering Bootcamp. Since 2021 he has built and led industry-leading Machine Learning education programs.  Previously, he worked as an AI product manager, a university professor teaching AI, an AI consultant and startup advisor, and an ML researcher.  He loves trail running and is based in Dayton, Ohio.

  • Chris “The Wiz” Alexiuk is the Co-Founder & CTO at AI Makerspace, where he is an instructor for The AI Engineering Bootcamp. During the day, he is also a Developer Advocate at NVIDIA. Previously, he was a Founding Machine Learning Engineer, Data Scientist, and ML curriculum developer and instructor. He’s a YouTube content creator YouTube who’s motto is “Build, build, build!” He loves Dungeons & Dragons and is based in Toronto, Canada.

Follow AI Makerspace on LinkedIn and YouTube to stay updated about workshops, new courses, and corporate training opportunities.

Avatar for AIM Workshops & Events
Join us every week live to learn concepts and code from The LLM Edge!
Hosted By
32 Went