

AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets
Everyone wants to talk about models.
Far fewer people want to talk about datasets.
And yet, in real ML systems, datasets are often where everything quietly breaks: unclear licenses, missing provenance, undocumented splits, hidden preprocessing choices, fragile loaders, weak reproducibility, and metadata that only exists in a half-forgotten README.
This week at AI Reading Club, hosted by Schoolab, we will read and discuss:
Croissant: A Metadata Format for ML-Ready Datasets
NeurIPS 2024 — Datasets and Benchmarks Track
And we are especially excited to welcome Slava Tykhonov, co-author of the CroissantML standard, Head of AI and Interoperability at CODATA, and Harvard Dataverse Ambassador.
Croissant proposes a shared metadata format for ML-ready datasets. Its goal is practical and ambitious: make datasets easier to discover, understand, load, reuse, govern, and connect across repositories, tools, and frameworks.
The central question of the session:
What happens when datasets become first-class, machine-readable artifacts in the ML ecosystem?
We will explore how Croissant tries to solve one of the most underrated problems in machine learning infrastructure: making datasets portable across ML workflows without forcing everyone to change the underlying data format.
We will discuss:
why dataset metadata is still painful in modern ML
how Croissant builds on Schema.org and JSON-LD
the four layers of the format: dataset metadata, resources, structure, and semantics
how Croissant can describe images, text, audio, tabular, and multimodal datasets
how it connects to Hugging Face, Kaggle, OpenML, TensorFlow Datasets, PyTorch, and JAX workflows
why Responsible AI metadata matters for licensing, privacy, fairness, safety, reproducibility, and governance
what this means for serious ML engineering beyond notebooks and leaderboard culture
This is not just a paper about metadata.
It is a paper about infrastructure.
The kind of infrastructure that determines whether AI systems can be inspected, reused, audited, and trusted — or whether every dataset remains a mysterious zip file with vibes.
As usual, the session will be discussion-oriented. No need to be an expert in semantic web standards, dataset governance, or ML infrastructure. Curiosity is enough.
Come if you care about the less glamorous parts of AI that quietly decide whether systems work in the real world.
Format
Short paper introduction, guest perspective from a Croissant co-author, guided discussion, practical examples, and open conversation.
Who should join
People building, researching, evaluating, deploying, or seriously trying to understand AI systems below the demo layer.
Recommended preparation
If you are short on time: read the abstract, introduction, and Section 3.
If you enjoy suffering beautifully in the name of reproducibility: read the full paper.