

Computing x Biology Systems Reading Group | LatchBio + Modal
The intersection of computing and engineering biology is a playground for systems: operating systems, file systems, virtualization, programming languages, databases, compilers, fuzzers, distributed systems, etc.
The data generated from molecular measurement kits (spatial, single-cell, etc.) is doubling every few years and things are starting to break.
We'll hear from five awesome speakers who will walk through design decisions, paper highlights + snippets of source code:
Max Smolin | LatchBio: Building "Forch", a Utilitarian Cloud Container Orchestrator
Noam Teyssier | Arc Institute: cyto: ultra high-throughput processing of 10x-flex single cell sequencing
Pavan Ramkumar | SLAF Project: SLAF: A single-cell omics storage format for the virtual cell era
Dhruv Gautam | Arc Institute + Berkeley: Lessons in Perturbation Modeling: STATE, STACK, and Beyond
Ben Shabobo | Modal: Leveraging Serverless Distributed Computing to Scale Computational Biology
Event space provided by LatchBio and Modal is generously sponsoring food / refreshments.
Important: Our office (Lobby 5) is on the 4th street side of the building. Come in on the river side through the sliding doors or through the lobby on the Berry St. side.
Agenda
5:00 - 6:00 Meet others. Eat + drink.
6:00 - 8:30 Talks + Q&A
8:30 - TBD Socialize
Abstracts
Building "Forch", a Utilitarian Cloud Container Orchestrator
Max Smolin | LatchBio
> How hard can it be to spin up an EC2 instance and feed it a container? In Kubernetes, why does this take a distributed key-value store, Promise Theory, record-of-intent, and a dozen layers of abstraction? Is a better world possible? This talk covers the ground-up design and development of the container orchestrator that has been replacing K8s for our internal needs (currently at 50% of workloads and growing). It includes a brief discussion of our experiences operating a large Kubernetes cluster, a set of motivations for a new orchestrator, the principles of Forch's design, as well as the most interesting implementation details. It is intended as a systems design case study for a software engineer audience.
cyto: ultra high-throughput processing of 10x-flex single cell sequencing
Noam Teyssier | Arc Institute
> Single-cell genomics is scaling toward billion-cell atlases, but computational analysis remains a bottleneck. Here we present cyto, an ultra-high-throughput processor for 10x Genomics Flex single-cell sequencing optimized for production-scale analysis. cyto exploits the fixed sequence geometry of Flex libraries through O(1) hash lookups rather than alignment, and leverages BINSEQ, a binary sequencing format that enables highly parallel processing. On a 320,000-cell multiplexed dataset, cyto completes processing in 13 minutes versus CellRanger's 3.7 hours, a 17× speedup with 32× fewer CPU-hours, while maintaining 99.85% concordance with CellRanger outputs and identical cell type clustering. cyto is open-source and provides the computational foundation for atlas-scale single-cell projects and genome-wide perturbation screens.
SLAF: A single-cell omics storage format for the virtual cell era
Pavan Ramkumar | SLAF Project
> Single-cell transcriptomics datasets have scaled 2,000-fold in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks in two ways. SLAF is a high-performance format for single-cell transcriptomics data built on top of the Lance table format and Polars. For users of scanpy or anndata, it should feel like you never left. SLAF provides an advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.
Lessons in Perturbation Modeling: STATE, STACK, and Beyond
Dhruv Gautam | Arc Institute & UC Berkeley
> Perturbation modeling tackles the problem of causal effect estimation under significant biological noise. In contrast to sequence modeling, and although data-generation is exponentially growing, perturbation models will continue to operate in the data bound regime. In this talk, we will discuss the implications of this, the engineering decisions behind SOTA models like STATE and STACK, and designing the right inductive biases that enable unique downstream biological capabilities. We will also discuss how formalizing the underlying training objectives of perturbation biology enables rapid future experimentation.
Leveraging Serverless Distributed Computing to Scale Computational Biology
Ben Shababo | Modal
> Many patterns and workflows in computational biology involve iterating between highly parallelized computations and aggregating those results. In this talk, we'll show how to use serverless computing to implement and optimize these workflows such that compute resources can be provisioned dynamically at runtime depending on the specifics of a job.
Excited to see you guys here and learn a bit more about computers.