Cover Image for GPU accelerated Spark data processing and metadata management for Gen AI workloads
Cover Image for GPU accelerated Spark data processing and metadata management for Gen AI workloads
Avatar for Open Lakehouse + AI
2 Going

GPU accelerated Spark data processing and metadata management for Gen AI workloads

Virtual
Registration
Welcome! To join the event, please register below.
About Event

Abstract

This talk outlines the architecture and functionality of NVIDIA's GPU-accelerated Data Science Platform, designed to streamline data processing and metadata capture for Generative AI (Gen AI) workloads. The platform provides APIs for ingestion, processing, and retrieval, leveraging RAPIDS Accelerator for Apache Spark™ compute to GPU-accelerate the pipelines. We use open source technologies like Apache Spark™, Rapids, Delta Lake, and Kubeflow.

Delta Lake, an open source storage layer of Open Lakehouse, establishes reliable, high-quality medallion architecture, providing the ACID properties necessary for versioning, reproducibility, and concurrent metadata management of the massive datasets feeding the Gen AI model training.

Speaker

Niranjan Nataraja is a Senior Manager - Accelerated Data Processing and ML Platform at NVIDIA. With more than 15 years at NVIDIA, he has worked on numerous projects building big data pipelines for data science tasks and creating mathematical models for data center operations and cloud gaming services. Niranjan has a Master’s degree in Industrial Engineering from Texas A&M University with a primary focus in production economics.

Avatar for Open Lakehouse + AI
2 Going