Cover Image for AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets
Cover Image for AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets
Avatar for AI Reading Club
Presented by
AI Reading Club
13 Going

AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets

Register to See Address
Paris, France
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

Everyone wants to talk about models.

Far fewer people want to talk about datasets.

And yet, in real ML systems, datasets are often where everything quietly breaks: unclear licenses, missing provenance, undocumented splits, hidden preprocessing choices, fragile loaders, weak reproducibility, and metadata that only exists in a half-forgotten README.

This week at AI Reading Club, hosted by Schoolab, we will read and discuss:

Croissant: A Metadata Format for ML-Ready Datasets
NeurIPS 2024 — Datasets and Benchmarks Track

And we are especially excited to welcome Slava Tykhonov, co-author of the CroissantML standard, Head of AI and Interoperability at CODATA, and Harvard Dataverse Ambassador.

Croissant proposes a shared metadata format for ML-ready datasets. Its goal is practical and ambitious: make datasets easier to discover, understand, load, reuse, govern, and connect across repositories, tools, and frameworks.

The central question of the session:

What happens when datasets become first-class, machine-readable artifacts in the ML ecosystem?

We will explore how Croissant tries to solve one of the most underrated problems in machine learning infrastructure: making datasets portable across ML workflows without forcing everyone to change the underlying data format.

We will discuss:

  • why dataset metadata is still painful in modern ML

  • how Croissant builds on Schema.org and JSON-LD

  • the four layers of the format: dataset metadata, resources, structure, and semantics

  • how Croissant can describe images, text, audio, tabular, and multimodal datasets

  • how it connects to Hugging Face, Kaggle, OpenML, TensorFlow Datasets, PyTorch, and JAX workflows

  • why Responsible AI metadata matters for licensing, privacy, fairness, safety, reproducibility, and governance

  • what this means for serious ML engineering beyond notebooks and leaderboard culture

This is not just a paper about metadata.

It is a paper about infrastructure.

The kind of infrastructure that determines whether AI systems can be inspected, reused, audited, and trusted — or whether every dataset remains a mysterious zip file with vibes.

As usual, the session will be discussion-oriented. No need to be an expert in semantic web standards, dataset governance, or ML infrastructure. Curiosity is enough.

Come if you care about the less glamorous parts of AI that quietly decide whether systems work in the real world.

Format

Short paper introduction, guest perspective from a Croissant co-author, guided discussion, practical examples, and open conversation.

Who should join

People building, researching, evaluating, deploying, or seriously trying to understand AI systems below the demo layer.

Recommended preparation

If you are short on time: read the abstract, introduction, and Section 3.

If you enjoy suffering beautifully in the name of reproducibility: read the full paper.

Location
Please register to see the exact location of this event.
Paris, France
Avatar for AI Reading Club
Presented by
AI Reading Club
13 Going