Palo Alto Data Analytics Platforms - Meetup
Network with fellow users and engineers of Data Analytics Platforms to learn about current issues, methods, and best practices. Collectively, we integrate many open source projects like Iceberg, Flink, Spark, and Polaris. Gather with the community to share where we are and where we’ll go in the future.
We’ve lined up a few short talks this session:
Vishal Jhala & Madhukar Mulpuri – PayPal - Building a Global Data Lake from Zero to Production in 90 Days
PayPal is building a global Network of Wallets and we were tasked with building a data lake for the network from scratch—in just one quarter. The challenge wasn't just technical - we had to build the team while building the platform. How do you make the right architectural bets, and deliver production-grade infrastructure when every decision has downstream implications for security, compliance, and scale? This talk shares how we transformed an ambitious mandate into a strategic platform serving three critical stakeholder needs: real-time network intelligence for operations, governance visibility for risk management, and trust through transparency for participants. We'll explore the architectural principles that enabled rapid delivery, schema enforcement with evolution flexibility and cross-cloud resiliency—while navigating constraints like PII security, schemaless-to-schema data transformation, and BigQuery's eventual consistency at scale. You'll walk away with practical examples of navigating the classic trade-off between perfect architecture and shipped product.
Venkata krishnan Sowrirajan – LinkedIn - Charting New Territory: LinkedIn’s Early Bet on Flink Batch for Large-Scale Workloads
As one of the earliest adopters of Flink Batch, LinkedIn has taken a bold step toward redefining large-scale batch processing. This talk shares how we built a production-grade Flink Batch platform from the ground up—covering architectural decisions, platform engineering challenges, and lessons learned while scaling it across mission-critical workflows. If you're considering Flink beyond streaming, this is your inside look at what it takes to run Flink Batch reliably at scale.
Chao Sun – OpenAI - Scaling Apache Spark at OpenAI
In this talk, Chao will share lessons from running Apache Spark at massive scale within OpenAI’s data platform. He will cover how OpenAI operates both Databricks Spark and self-hosted open-source Spark in parallel, with a deeper dive into the self-hosted stack - including cluster management, job-submission architecture, access control, and dynamic scaling. Chao will also highlight the key open-source projects powering OpenAI’s Spark infrastructure and offer a look at the roadmap ahead.
Cliff Lau & Prasad Karkera – GEICO - Developing an AI Solution to a Manual Maintenance Problem
Maintaining healthy data ecosystems requires regular table maintenance, but most data owners neither need nor want to manage these details. To address this, we developed an automated system that recommends and executes table maintenance based on table activity metrics. This presentation goes into some detail on our approach - starting with a self service tool, before leveraging iceberg event metrics to automatically enroll and schedule maintenance for tables as they enter our data ecosystem.