Cover Image for GPU accelerated Spark data processing and metadata management for Gen AI workloads

Presented by

Open Lakehouse + AI is a global community advancing open lakehouse and AI through adoption, sharing real-world use cases, and collaboration. Check out our upcoming events!

Hosted By

AI

GPU accelerated Spark data processing and metadata management for Gen AI workloads

Open Lakehouse + AI

YouTube

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Abstract

This talk outlines the architecture and functionality of NVIDIA's GPU-accelerated Data Science Platform, designed to streamline data processing and metadata capture for Generative AI (Gen AI) workloads. The platform provides APIs for ingestion, processing, and retrieval, leveraging RAPIDS Accelerator for Apache Spark™ compute to GPU-accelerate the pipelines. We use open source technologies like Apache Spark™, Rapids, Delta Lake, and Kubeflow.

Delta Lake, an open source storage layer of Open Lakehouse, establishes reliable, high-quality medallion architecture, providing the ACID properties necessary for versioning, reproducibility, and concurrent metadata management of the massive datasets feeding the Gen AI model training.

Speaker

Niranjan Nataraja is a Senior Manager - Accelerated Data Processing and ML Platform at NVIDIA. With more than 15 years at NVIDIA, he has worked on numerous projects building big data pipelines for data science tasks and creating mathematical models for data center operations and cloud gaming services. Niranjan has a Master’s degree in Industrial Engineering from Texas A&M University with a primary focus in production economics.