
South Bay Systems: Apache Pinot on Object Storage / Variants in Apache Doris
Welcome to another edition of South Bay Systems! This time, we'll have a double feature! First we'll have Songqiao Su and Raghav Yadav talking about optimizing Apache Pinot for real-time analytics, then we'll have Owen Xiao talking about variants and semi-structured data in Apache Doris.
Agenda
6:00 PM: Doors open, food and socializing
6:30 PM — 7:00 PM: Apache Pinot Talk
7:00 PM — 7:30 PM: Apache Doris Talk
7:30 PM onward : Community socializing!
Food and beverages will be provided, courtesy of our hosts, Adobe.
Low-Latency Serving on Cloud Object Stores with Apache Pinot
In this talk, we present the evolution of Apache Pinot’s architecture: first from tightly coupled storage and compute, to decoupled cloud storage, and now toward native support for Parquet as a first-class segment format. We will discuss key technical innovations such as the implementation of a Parquet-compatible forward index reader, which enables all of Pinot’s indexing strategies to operate directly on Parquet files. Additional optimizations include index pinning, Parquet page-level selective reads, page prefetching for efficient I/O parallelism, and page caching. Together, these enhancements allow Pinot’s indexing and query execution framework to deliver sub-second performance directly on Parquet data, going far beyond conventional metadata-based pruning approaches.
Speaker Bio
Songqiao Su is a Staff Software Engineer at StarTree.AI, working on building tiered storage and improving compute–storage decoupling in Apache Pinot and StarTree Cloud. His work focuses on large-scale, high-performance distributed systems. Before joining StarTree, he worked on network and RPC infrastructure at Facebook and Databricks.
Raghav Yadav is a Staff Software Engineer at StarTree.AI, working on building a low-latency serving layer on Iceberg in Apache Pinot and StarTree Cloud. His expertise spans distributed databases and large-scale systems, with experience in cloud-scale data infrastructure at Microsoft Azure, real-time streaming databases as a founding engineer at Grainite, and now real-time OLAP analytics at StarTree.
The Evolution of Semi-Structured Data Analytics: From Text, JSON to VARIANT
Abstract
Semi-structured data, such as JSON, is gaining widespread adoption due to its flexibility. However, traditional databases and data warehouses are built for structured schemas, creating new challenges in storing and analyzing semi-structured formats. In this session, we’ll explore:
Characteristics and challenges of semi-structured data
Limitations of traditional approaches
Apache Doris’ native solution for semi-structured analytics
Comparison with Snowflake, Iceberg (VARIANT type), and Elasticsearch
Real-world applications in Log Analytics, Distributed Tracing, and IoT
Speaker Bio
Owen Xiao is a co-founder of VeloDB and a PMC member of Apache Doris, where he leads product strategy, observability, and AI-driven R&D for both open-source and enterprise data platforms. With over 10 years of experience in database kernel development and distributed systems architecture, he has helped scale analytical databases for global enterprises.
