Cover Image for The Bangalore Data Lakehouse Meetup
Cover Image for The Bangalore Data Lakehouse Meetup
172 Going

The Bangalore Data Lakehouse Meetup

Hosted by Fredson & 3 others
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

Join us on Saturday, June 27, for a half-day meetup of all things Iceberg, open data, and modern analytics, cohosted by Cloudera, e6data, and Olake.

Expect deep technical talks, live demos, great food, and plenty of time to connect with fellow data practitioners.

Venue: Cloudera Bangalore Office
Time: 10:00 AM to 2:00 PM IST


Speakers

  1. 10:30 -11:00 AM [Keynote]
    Andrew Madson [Head of DevRel, Fivetran]: Iceberg for Agents- Turning Lakehouse Data Into AI-Ready Context

    AI agents fail in production because they're overwhelmed with data but starved for context. LLM models aren’t the problem.
    The bottleneck is the data stack: fragmented silos, inconsistent definitions, and logic hidden in tribal knowledge.
    Agents need structured, reliable, and interpretable context-not just data access.

    In this session, we'll show how Apache Iceberg becomes the backbone of AI-ready pipelines. You’ll learn how to elevate your Iceberg implementation from a storage format to a live context layer that powers structured retrieval-augmented generation (RAG), schema-aware agents, and autonomous reasoning grounded in truth.

    What we’ll cover: 1. Iceberg Foundations for AI - from ACID to Time Travel 2. From Rows to Relationships - The role of the semantic layer 3. Structured RAG in Practice - Fully open source The session includes a live demo of a fully open-source Structured RAG stack built on Apache Iceberg, featuring semantic query translation, hybrid retrieval, and governed agent reasoning. Expect architecture diagrams, real code, and practical guidance.

  2. 11:00 – 11:30 AM
    Amit Prabhu, Vishwajeet Kumar [Razorpay]: Taming Apache Iceberg at Scale: Streaming Ingestion and Incremental Denormalization at Razorpay

    At Razorpay, our lakehouse platform ingests over 6 billion events daily while generating nearly a million reports every month. As scale grew, full-refresh denormalization became unsustainable — joins across 10–30 entities consumed heavy compute, report freshness lagged by up to 48 hours, and streaming workloads introduced compaction pressure, merge amplification, and out-of-order update challenges on Apache Iceberg.

    In this talk, we’ll share how we re-architected our platform around Apache Iceberg to support both high-throughput streaming and efficient incremental serving. Using Iceberg features such as hidden partitions, bucketing, metadata pruning, and optimized file layouts, we improved streaming merges and enabled scalable runtime joins. We’ll also cover our incremental denormalization framework for highly mutating datasets. Instead of rebuilding tables, we process only impacted records using secondary indexes and graph traversal across entity relationships. Unlike conventional CDC approaches, our framework propagates updates from both fact and dimension entities, including backdated foreign-key updates. These approaches reduced denormalization compute cost and generation time by over 85% while significantly improving freshness and scalability.

  3. 11:30 - 12:00 PM
    Akshat Mathur [Cloudera]:  From Data Chaos to Control: How a Global Telco Tamed Petabyte-Scale Challenges with Apache Iceberg

    When a leading telecommunications operator hit the scaling wall with their legacy Hive infrastructure, managing petabyte-scale customer data across billions of records became untenable. Their IDPR workloads suffered from slow queries, rising storage costs,partition explosion, and schema changes that broke downstream systems.

    This session explains why they chose Apache Iceberg and how it transformed their architecture, including the business and technical decision criteria used to select Iceberg over other open table format.

  4. 12:00 - 12:30 PM 
    Harini Anand [IBM]: Talk to Your Lakehouse: Building an MCP Server for Apache Iceberg

    AI agents are becoming first-class citizens in data engineering, yet most lakehouses still expect humans at the keyboard.

    This talk explores building a Model Context Protocol (MCP) server that exposes an Apache Iceberg catalog, tables, snapshots, schema history, partition specs, and manifests as typed tools that LLMs can reason over and act on. We walk through the Iceberg REST Catalog spec, map its endpoints to MCP tool definitions, and demo an agent that can discover tables by intent, inspect snapshot lineage, explain schema evolution across versions, and construct time-travel queries, all through natural language. Along the way we cover the real engineering challenges: catalog auth delegation, metadata payload size vs. LLM context windows, distinguishing read-safe vs. write-dangerous tool surfaces, and how to scope tool descriptions so the model doesn't hallucinate partition filters. Drawing from experience building AI infra on IBM watsonx.data, we'll close with a reference architecture and lessons on where "agentic data access" works today and where it still falls over. Audience takeaways: A mental model for MCP-over-Iceberg, a reference tool schema, and a practical checklist for safely exposing catalog operations to an LLM agent.

  5. 12:30 - 1:00 PM
    Shreyansh Roy [American Express]: From Express APIs to LLMs: Building GenAI-Ready Highly Scalable Web Apps with Apache Iceberg

    As distributed web applications scale, they generate a massive, non-stop influx of user interactions, clickstreams, and operational logs. While traditional transactional databases (SQL/NoSQL) excel at serving live application traffic, they hit a dramatic performance and cost wall when tasked with storing and querying the petabyte-scale historical datasets required to feed modern Generative AI models and Retrieval-Augmented Generation (RAG) pipelines. Enter Apache Iceberg.

    This session explores how application engineers can bridge the gap between high-scale distributed web backends and an open data lakehouse architecture. We will dive into the patterns of using Change Data Capture (CDC) and message streams to offload heavy operational logs from app layers into cheap cloud object storage, utilizing Iceberg to enforce database-grade ACID transactions. Moving beyond basic storage, we will address the critical data infrastructure challenges unique to both distributed web backends and GenAI, and the way Iceberg is solving that challenges in cost-effective manner.

  6. 1:00 - 1:30 PM
    Merlyn Mathew, Ankit Kumar [OLake]: Scaling Apache Iceberg Without the Maintenance Headaches

    Every CDC sync quietly adds small files and equality deletes. Over time scans slow down, and most teams do not notice until queries that finished in seconds starts taking minutes. Existing compaction approaches fall short on continuous ingestion. A single rewrite strategy treats small files and equality deletes the same way. Running it frequently causes write amplification. Running it infrequently lets degradation compound. Neither is sustainable under continuous ingestion.

    This talk covers why a tiered compaction model changes the game, by making the compaction more configurable and efficient depending on volume, frequency and nature of updates. Three independent tiers each targeting a different stage of table degradation, and the tradeoffs behind scheduling each one. Attendees leave with a practical framework for designing compaction schedules for their own CDC pipelines.


​About Cloudera

Cloudera is the only data and AI platform company that brings AI to data anywhere: in clouds, data centers, and at the edge. Cloudera delivers 100% of data in all forms–whether it is in Cloudera or anywhere in the entire data estate. The world’s largest organizations rely on Cloudera to fuel insights that boost bottom lines, safeguard against threats, and save lives.

Follow us on LinkedIn
Follow us on X
Subscribe to our YouTube Channel


​ About OLake by Datazip

OLake, is Iceberg-native ingestion, built for speed and reliability. It is the fastest and most reliable way to bring operational database data into your data lakehouse with Apache Iceberg. From high-volume CDC pipelines to handling schema evolution and large documents, OLake is designed to make ingestion simple, cost-efficient, and production-ready.

As an Iceberg-native, open-source first project, we’re proud to contribute back to the community and to make it easier for data teams everywhere to adopt Iceberg in production.

Get started today:

📲 OLake Quickstart Guide
🖥️ OLake on GitHub
📚 OLake Documentation
💬 OLake Community Slack

Follow OLake on LinkedIn


About e6data

e6data is a lakehouse compute engine built for SQL analytics and AI workloads at 60% lower cost without migration. It enables enterprises to query data directly from their data lakehouse, autoscales to handle over 1,000 QPS during traffic spikes, and performs vector search, all without data movement, query rewrites, or changes to existing data infrastructure.

Highly performant on Iceberg, e6data is trusted by industry leaders like Freshworks (NASDAQ: FRSH) and Chargebee for its high concurrency and complex workloads.

Location
Cloudera
3rd Floor, No. 6/B, Summit, 80 Feet Rd, Koramangala 1A Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, India
172 Going