Cover Image for Open Lakehouse + AI Mini Summit | Mountain View
Cover Image for Open Lakehouse + AI Mini Summit | Mountain View
Avatar for Open Lakehouse + AI
156 Going

Open Lakehouse + AI Mini Summit | Mountain View

Register to See Address
Mountain View, California
Registration
Approval Required
Your registration is subject to approval by the host.
Welcome! To join the event, please register below.
About Event

Open Lakehouse + AI Mini Summit - Thursday, November 13, 12:00 PM – 4:30 PM PST | Mountain View, CA

​We're bringing together the open source and data engineering community for a half-day focused on the latest in open lakehouse and AI architectures! 🚀 With two dynamic tracks, there’s something for everyone—whether you work on data infrastructure, contribute to open source, or want to dive into the future of AI and interoperable lakehouse systems, you’ll fit right in.

Agenda

12:00PM — Lunch/Registration

1:30PM — Welcome Remarks (Jules Damji, Lisa Cao — Databricks)

1:30PM - 2:05PM — Session #1: “From Data to AI: Leveraging Unity Catalog to Train at Scale” — Aniruth Narayanan, Databricks

1:30PM - 2:05PM — Session #1: “Scaling Multimodal AI Lakehouse with Lance & LanceDB” — Chang She, LanceDB

2:10PM - 2:45PM — Session #2: “Scaling Apache Spark at OpenAI” — Chao Sun, OpenAI

2:10PM - 2:45PM — Session #2: "What's New in Spark-Iceberg Integration via DSV2"— Szehon Ho, Databricks & Huaxin Gao, Snowflake

2:45PM - 3:05PM — BREAK

3:05PM - 3:40PM — Session #3: “Declarative Pipelines: What’s Next for Apache Spark” — Sandy Ryza, Databricks

3:05PM - 3:40PM — Session #3: “How feature platform fit into the world of AI” — Hao Xu (Apple)

3:45PM - 4:20PM — Session #4: “Upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics” — DB Tsai & Xiao Li, Databricks

4:30PM - Event Ends

5:00PM - 6:30PM Spark Happy Hour 🍻

6:30PM — Goodnight

​​Don't miss this opportunity to accelerate your data journey and contribute to shaping the future of data and AI! ​🌟

Abstracts

From Data to AI: Leveraging Unity Catalog to Train at Scale

Modern AI applications require a complex mix of tabular data, documents, and images. Managing these assets across disjointed systems creates security risks and operational friction, slowing the path from data to production-ready models and making end-to-end lineage nearly impossible.

In this session, we will explore how Unity Catalog provides a single, unified plane to manage the entire AI data lifecycle at scale. We’ll cover:

Unifying Data in Open Formats: Leverage Unity Catalog as a single source of truth to discover, manage, and govern structured data in open table formats like Delta Lake and Apache Iceberg

Incorporating Unstructured Data with Volumes: Use Unity Catalog Volumes to bring unstructured data - such as images, audio, and documents - under the same familiar access control and governance model as your tables

From Governed Data to Production Models: Manage the full model lifecycle with MLflow and register models in Unity Catalog for seamless governance

Leave this session with a comprehensive playbook for building AI on your data to break down silos and simplify governance.

​Scaling Multimodal AI Lakehouse with Lance & LanceDB

LanceDB’s Multimodal Lakehouse (MMLH) is the next-generation lakehouse built from day one to treat documents, video, audio, images, and sensor streams as first-class data. These multimodal workloads—powering innovators like Midjourney, WorldLabs, and Runway—unlock massive value, yet scaling AI-driven multimodal apps remains painful on traditional lakehouses.

​MMLH provides a unified foundation optimized across the multimodal AI lifecycle:
​- AI application serving: low-latency random-access reads and search APIs for vectors, text, and binaries
- Feature engineering + data curation: schema primitives that evolve seamlessly across blobs and metadata for model-driven inference and bulk backfills
- Training & fine-tuning: high-throughput petabyte-scale data loading with efficient vector and full-text search

​We’ll dive into the key capabilities—fast random-access at scale, vector + full-text search, and optimized schema primitives—so you can iterate rapidly without blowing your budget. By the end, you’ll have a concrete blueprint for running production-grade, petabyte-scale multimodal pipelines with LanceDB’s MMLH, freeing your team to focus on innovation instead of data plumbing.

Scaling Apache Spark at OpenAI

In this talk, we’ll share our experiences operating Apache Spark at scale within OpenAI’s data platform. We’ll discuss how we run both Databricks Spark and self-hosted open-source Spark in parallel, with a deeper focus on the latter - covering key areas such as cluster management, job submission architecture, access control, and dynamic scaling. We’ll also highlight how Spark and Delta Lake power some of our most critical data pipelines, the challenges we’ve faced in building and maintaining them, and the approaches that have made Spark a reliable and efficient foundation for large-scale data processing at OpenAI.

What's New in Spark-Iceberg Integration via DSV2

DataSource V2 (DSv2) is one of Apache Spark’s most important yet least visible features. It bridges Spark’s world-class processing engine and a wide range of data backends—from next-generation table formats like Apache Iceberg, to established and proven JDBC sources, to legacy systems such as Hive Metastore and raw Parquet files. While most users never interact with DSv2 directly, its design enables Spark to transparently map its powerful query optimizations to the unique capabilities and performance features of vastly different data sources.In this talk, we’ll uncover the evolution of DSV2 and explore specific examples of integrating Apache Iceberg in areas like reporting statistics, pushdown aggregates, maintenance procedures, and advanced column metadata like default values, generated columns, and constraints. We’ll also look ahead to what’s next for DSv2—what its ongoing evolution could mean for both the future of Spark and the rapidly-evolving world of table formats.

Declarative Pipelines: What's Next for the Apache Spark™

Early this year, we announced Spark Declarative Pipelines (SDP), which has made it dramatically easier to build robust Spark pipelines using a framework that abstracts away orchestration and complexity. The SDP declarative framework extends beyond individual queries to enable a mix of batch and streaming pipelines, keeping multiple datasets fresh.

In this session, we'll share a broader vision for the future of Spark Declarative Pipelines — one that opens the door to a new level of openness, standardization, and community momentum. We'll cover key takeaways:
core concepts behind Spark Declarative Pipelines;
where the architecture is headed; and
what this shift means for both existing users and Spark engineers building procedural code.

How feature platform fit into the world of AI

As the AI landscape accelerates, traditional machine learning infrastructure may seem less relevant. But is it really? In fact, Feast has continued to grow rapidly, gaining adoption across industries as a critical foundation for AI systems. In this talk, we’ll explore how Feast is evolving beyond a feature store into a broader feature platform for AI. We’ll highlight recent innovations such as the Compute Engine, Feast for Retrieval-Augmented Generation (RAG), and On-Demand Feature Views, showing how Feast serves as the “glue” that connects data, models, and applications in modern AI workflows.

Upcoming Apache Spark™ 4.1: The Next Chapter in Unified Analytics

Apache Spark™ is the main open source engine for big data. The upcoming Spark 4.1 release improves its capabilities for both large deployments and individual developers. This release includes:

Introduction to Apache Arrow-based optimized Unified Interface for PySpark UDFs

Enhanced UDFs and UDTFs with better debugging tools and a smoother developer experience

More flexible data source support through a simplified Python Data Source API

Richer SQL capabilities, including Time Data Types and SQL Scripting support

A more “Pythonic” experience — easier installation, clearer error messages, and modern APIs

Spark Declarative Pipelines (SDP) for Spark data engineers to build robust pipelines

Spark in the Generative AI era, with features tailored for AI and LLM workloads

Real-Time Mode (RTM) enables streaming and interactive workloads with sub-second latency

And much more…

Location
Please register to see the exact location of this event.
Mountain View, California
Avatar for Open Lakehouse + AI
156 Going