Cover Image for AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets

Presented by

A biweekly AI Reading Club to read and discuss the papers that built modern LLMs. The goal is to understand the actual engineering breakthroughs (tokenization, self-attention, decoding, efficiency,

Hosted By

13 Going

AI

AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets

Name: AI Reading Club — Croissant: A Metadata Format for ML-Ready Datasets
Start: 2026-07-02T18:00:00.000+02:00
End: 2026-07-02T19:00:00.000+02:00
Location: Paris, France

AI Reading Club

Register to See Address

Paris, France

Approval Required

Your registration is subject to host approval.

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Everyone wants to talk about models.

Far fewer people want to talk about datasets.

And yet, in real ML systems, datasets are often where everything quietly breaks: unclear licenses, missing provenance, undocumented splits, hidden preprocessing choices, fragile loaders, weak reproducibility, and metadata that only exists in a half-forgotten README.

This week at AI Reading Club, hosted by Schoolab, we will read and discuss:

Croissant: A Metadata Format for ML-Ready Datasets
NeurIPS 2024 — Datasets and Benchmarks Track

And we are especially excited to welcome Slava Tykhonov, co-author of the CroissantML standard, Head of AI and Interoperability at CODATA, and Harvard Dataverse Ambassador.

Croissant proposes a shared metadata format for ML-ready datasets. Its goal is practical and ambitious: make datasets easier to discover, understand, load, reuse, govern, and connect across repositories, tools, and frameworks.

The central question of the session:

What happens when datasets become first-class, machine-readable artifacts in the ML ecosystem?

We will explore how Croissant tries to solve one of the most underrated problems in machine learning infrastructure: making datasets portable across ML workflows without forcing everyone to change the underlying data format.

We will discuss:

why dataset metadata is still painful in modern ML
how Croissant builds on Schema.org and JSON-LD
the four layers of the format: dataset metadata, resources, structure, and semantics
how Croissant can describe images, text, audio, tabular, and multimodal datasets
how it connects to Hugging Face, Kaggle, OpenML, TensorFlow Datasets, PyTorch, and JAX workflows
why Responsible AI metadata matters for licensing, privacy, fairness, safety, reproducibility, and governance
what this means for serious ML engineering beyond notebooks and leaderboard culture

This is not just a paper about metadata.

It is a paper about infrastructure.

The kind of infrastructure that determines whether AI systems can be inspected, reused, audited, and trusted — or whether every dataset remains a mysterious zip file with vibes.

As usual, the session will be discussion-oriented. No need to be an expert in semantic web standards, dataset governance, or ML infrastructure. Curiosity is enough.

Come if you care about the less glamorous parts of AI that quietly decide whether systems work in the real world.

Format

Short paper introduction, guest perspective from a Croissant co-author, guided discussion, practical examples, and open conversation.

Who should join

People building, researching, evaluating, deploying, or seriously trying to understand AI systems below the demo layer.

Recommended preparation

If you are short on time: read the abstract, introduction, and Section 3.

If you enjoy suffering beautifully in the name of reproducibility: read the full paper.

Location

Please register to see the exact location of this event.