


Systems Reading Group with Arc Institute, LatchBio + FutureHouse
The intersection of computing and engineering biology is a playground for systems: operating systems, file systems, virtualization, programming languages, databases, compilers, fuzzers, distributed systems, etc.
In this biotech flavored version of the SF systems reading group we'll hear from three awesome speakers who will walk through design decisions, paper highlights + snippets of source code:
Aidan Abdulali | LatchBio: A Distributed Filesystem Built on Postgres and S3
Noam Teyssier | Arc Institute: BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences
James Braza | FutureHouse: Edge of Tomorrow Algorithms
Abhinav Adduri | Arc Institute: Scaling Deep Learning to 1B+ Single Cells
Event space provided by LatchBio and Greylock is generously sponsoring food / refreshments.
Important: Our office (Lobby 5) is on the 4th street side of the building. Come in on the river side through the sliding doors or through the lobby on the Berry St. side.
Agenda
5:30 - 6:30 Meet others. Eat + drink.
6:30 - 8:00 Talks + Q&A
8:00 - TBD Socialize
Abstracts
LData: A Distributed Filesystem Built on Postgres and S3
Aidan Abdulali | LatchBio
> LatchBio builds data infrastructure to store, analyze and visualize lorgevolumes of molecular data. A core component of this platform is a distributed file system called LData. This talk walks through its architecture and illustrates how to build a complex distributed system with little more than a database.
Noam Teyssier | Arc Institute: A Family of High-Performance Binary Formats for Nucleotide Sequences
> Modern genomics produces billions of sequencing records per run, which are typically stored as gzip-compressed FASTQ files. While this format is widely used, it is not optimalfor high-throughput processing due to its reliance on single-threaded decompression andsequential parsing of irregularly sized records. Here, we present BINSEQ, a family of simple binary formats that enable high-throughput parallel processing of sequencing data. We demonstrate that BINSEQ files are up to 32x faster thancompressed FASTQ for parallel processing and can reduce analysis time from hoursto minutes for large-scale genome and transcriptome analyses, particularly for resource intensive applications like alignment, mapping, and de novo assembly.
Edge of Tomorrow Algorithms
James Braza | FutureHouse
> Imagine you're given a model, a benchmark, and just one day to saturate the benchmark. Normally training the model takes a week, but if you do not succeed in one day, the day resets. This talk is on a progression of algorithms from FutureHouse's aviary and ether0 papers that solve this exact problem, bringing us to the edge of tomorrow.
Scaling Deep Learning to 1B+ Single Cells
Abhinav Adduri | Arc Institute
> Single cell transcriptomics data repositories have experienced dramatic growth in recent years. Similar to how internet-scale data enabled a new intelligence frontier for language models, the wealth of observational and perturbational data being generated will enable cellular models that reveal new biological insights. However, computational tools have not kept pace with the rapid development of single cell assays, presenting challenges in training and evaluating models on these datasets. In this short talk, I’ll describe how we scaled STATE to 300M cells, what avoidable mistakes we made, and what advancements are needed to efficiently scale to 1B+ cells.
Excited to see you guys here and learn a bit more about computers.