Cover Image for Open Lakehouse Meetup | Amsterdam
Cover Image for Open Lakehouse Meetup | Amsterdam
Avatar for Open Lakehouse + AI

Open Lakehouse Meetup | Amsterdam

Register to See Address
Amsterdam, Noord-Holland
Registration Closed
This event is not currently taking registrations. You may contact the host or subscribe to receive updates.
About Event

Open Lakehouse Meetup - Wednesday, August 27, 5:00 PM – 9:00 PM GMT+2 | Amsterdam, Netherlands

​We're bringing together the open source and data engineering community for an evening focused on the latest in open lakehouse and AI architectures! 🚀 Whether you work on data infrastructure, contribute to open source, or want to dive into the future of AI and interoperable lakehouse systems, you’ll fit right in.

​​Don't miss this opportunity to accelerate your data journey and contribute to shaping the future of data and AI! ​🌟

5:00 - 6:00PM: Registration & Mingling 

6:00 - 6:05PM: Welcome Remarks

​6:05 - 6:40PM: Session #1 – Scaling Multimodal AI Lakehouse with Lance & LanceDB

  • Chang She, Co-founder & CEO of LanceDB, Co-author of Pandas

6:40 - 6:50PM: Session #2: Your Lakehouse Has Everything You Need

6:50 - 7:25PM: Session #3: DuckLake - The SQL-Powered Lakehouse Format

7:25 - 8:00PM: Session #4: Composable Open Table Formats - integrating open table formats with the composable data stack

8:00 - 9:00PM: Reception with bites and beverages

9:00 PM: Goodnight

_________________

Session Abstracts

Scaling Multimodal AI Lakehouse with Lance & LanceDB

LanceDB’s Multimodal Lakehouse (MMLH) is the next-generation lakehouse built from day one to treat documents, video, audio, images, and sensor streams as first-class data. These multimodal workloads—powering innovators like Midjourney, WorldLabs, and Runway—unlock massive value, yet scaling AI-driven multimodal apps remains painful on traditional lakehouses.

MMLH provides a unified foundation optimized across the multimodal AI lifecycle:

  • AI application serving: low-latency random-access reads and search APIs for vectors, text, and binaries

  • Feature engineering + data curation: schema primitives that evolve seamlessly across blobs and metadata for model-driven inference and bulk backfills

  • Training & fine-tuning: high-throughput petabyte-scale data loading with efficient vector and full-text search

We’ll dive into the key capabilities—fast random-access at scale, vector + full-text search, and optimized schema primitives—so you can iterate rapidly without blowing your budget. By the end, you’ll have a concrete blueprint for running production-grade, petabyte-scale multimodal pipelines with LanceDB’s MMLH, freeing your team to focus on innovation instead of data plumbing.

Your Lakehouse Has Everything You Need

Your team just got tasked with building an ML pipeline that processes millions of images, documents, and videos stored across your lakehouse. You need to generate embeddings, deduplicate content, and prepare training datasets—all at production scale. The good news? Your existing lakehouse already has everything you need. Iceberg, Delta Lake, and Unity Catalog can store and manage multimodal data. After all, images and videos are just URLs to blobs in S3, and documents are paths to files. The challenge is finding a query engine that can actually process it. Good news, all you need is Python. Simply pip install daft, and use Daft's familiar Python DataFrame API to transform your lakehouse into a multimodal processing powerhouse.

Image processing for ML training: Load millions of images from Iceberg tables, resize and augment them, and prepare training datasets - Document processing: Extract text from PDFs, generate embeddings, build search indexes at scale - Video analysis: Process video files, extract frames, run computer vision models - Large-scale deduplication: Find and remove duplicate content across text, images, and documents - Batch inference: Run foundation models on terabyte datasets stored in your lakehouse Coming from Spark? Daft has a PySpark-compatible API. Your existing code works with minimal changes. We'll walk through a complete multimodal pipeline: load images from Delta Lake, pre-process, generate embeddings, and store results in your vector DB of choice. Pure Python, familiar APIs, production scale.

DuckLake - The SQL-Powered Lakehouse Format

Managing changes to tables in data lakes has been very challenging in the past. The formats and systems involved did not exactly cooperate, and as a result sketchy workarounds were all-too-common. This is ostensibly solved by the advent of Lakehouse formats, that attempt to sanitize changes by specifying formats, processes and conventions to enable changes to tables.

However, common Lakehouse formats like Iceberg only appear majestic until one starts looking under the surface. There lurks a huge amount of complexity and  engineering decisions with trade-offs that no longer hold. And even after all that, the hard problems like transactional consistency are delegated to an opaque catalog server, e.g. Polaris or Unity Catalog.

DuckLake re-imagines the Lakehouse design by putting a SQL database in charge of managing metadata. This allows a very elegant design that still scales arbitrarily and greatly reduces complexity, with the actual table data still being on object stores in open format. For the first time, DuckLake allows a “multi-player” experience with DuckDB, where computation can happen anywhere and in parallel, but with centralized transactional safety.

Composable Open Table Formats - Integrating Open Table Formats with the Composable Data Stack

Lakehouse architecture and composable data systems are shaping the modern data landscape, driven by the need for interoperability between increasing number of compute engines and formats. Due to the fast paced adoption of technologies and standards, infrastructure has now grown to support the seamless exchange of data and logic but still has key gaps that need to be addressed. By making open table formats fully composable, we are able to create more extensible and reliable systems that can lead to the success of these technologies similar to Apache Arrow and other technologies. In this talk we will explore a novel set of APIs implementing open table formats like Apache Iceberg, Delta Lake, and more with a strong focus on composability and interoperability across query engines.

_________________

SPEAKER BIOS

Chang She is the CEO and cofounder of LanceDB, the developer-friendly, open-source database for multi-modal AI. A serial entrepreneur, Chang has been building DS/ML tooling for nearly two decades and is one of the original contributors to the pandas library. Prior to founding LanceDB, Chang was VP of Engineering at TubiTV, where he focused on personalized recommendations and ML experimentation.

Sammy Sidhu is a Deep Learning and systems veteran, holding over a dozen publications and patents in the space. Sammy graduated from the University of California, Berkeley where he did research in Deep Learning and High Performance Computing. He then joined DeepScale as the Chief Architect and led the development of perception technologies for autonomous vehicles. During this time, DeepScale grew rapidly and was subsequently acquired by Tesla in 2019. Staying in Autonomous Vehicles, Sammy joined Lyft Level 5 as a Senior Staff Software Engineer, building out core perception algorithms as well as infrastructure for machine learning and embedded systems. Level 5 was then acquired by Toyota in 2021, adopting much of his work.

Hannes Mühleisen is a creator of the DuckDB database management system and Co-founder and CEO of DuckDB Labs. He is a senior researcher at the Centrum Wiskunde & Informatica (CWI) in Amsterdam. He is also Professor of Data Engineering at Radboud University Nijmegen.

Robert Pack has extensive experience in designing and implementing Data & AI platforms within large multinational organizations. Through this work he has been an avid contributor to the open lakehouse ecosystem - specifically Delta Lake. Now at Databricks, his focus is entirely facilitating and contributing to the open source ecosystem for building lakehouse architectures.

Ion Koutsouris is a maintainer of the delta-rs project, with a strong background in business IT and data science. A “recovering data scientist,” Ion has shifted his focus from pure data science to engineering roles in the data and machine learning space.

Location
Please register to see the exact location of this event.
Amsterdam, Noord-Holland
Avatar for Open Lakehouse + AI