Cover Image for Datasheets for Datasets — The Hidden Contract Behind Every AI System
Cover Image for Datasheets for Datasets — The Hidden Contract Behind Every AI System
Avatar for AI Reading Club
Presented by
AI Reading Club
Hosted By
4 Going

Datasheets for Datasets — The Hidden Contract Behind Every AI System

Register to See Address
Paris, France
Registration
Approval Required
Your registration is subject to host approval.
Welcome! To join the event, please register below.
About Event

Before we ask whether an AI model is powerful, fair, safe, or intelligent, we should ask a more uncomfortable question:

What did its dataset teach it to believe?

Most AI failures do not begin in the model architecture.

They begin earlier.

In the data.

The dataset decides what the model sees, what it ignores, what it repeats, what it amplifies, and what it mistakes for truth. A model can pass benchmarks, produce beautiful demos, and still fail badly in the real world if its training or evaluation data does not match the world where the system is deployed.

And yet, in many AI projects, datasets are still treated like mysterious ZIP files from the internet.

Downloaded.
Cleaned “a bit.”
Used for training.
Forgotten forever.

For this AI Reading Club session, we will discuss one of the most important papers in responsible and production-grade AI:

Datasheets for Datasets
by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford.

Paper:
https://arxiv.org/pdf/1803.09010

The idea is simple, almost obvious — which is usually where the dangerous ideas live.

In electronics, components come with datasheets. A datasheet tells you what the component is, how it behaves, how it was tested, what its limits are, and how it should or should not be used.

So why do we build machine learning systems on datasets that often come with less documentation than a €3 sensor?

This paper proposes that datasets should also come with datasheets: structured documentation explaining why the dataset was created, who created it, what it contains, how the data was collected, what preprocessing was done, what risks exist, what uses are appropriate, what uses should be avoided, and how the dataset will be maintained.

This is not only an ethics paper.

It is an engineering paper.

It is a product-risk paper.

It is a deployment paper.

It is a “your beautiful AI system is secretly standing on undocumented assumptions” paper.

In 2026, this matters more than ever.

Foundation models, RAG systems, agents, fine-tuning pipelines, benchmarks, synthetic data, evaluation sets, compliance workflows — all of them depend on data. But without documentation, we often cannot answer basic questions:

Where did this data come from?

Who is represented?

Who is missing?

Was consent involved?

What was filtered out?

What labels were added?

What bias may be hidden inside?

Can this dataset be used for this task?

Should it be used at all?

And the most brutal question:

Would you still trust your model if you had to explain its dataset in public?

During this session, we will unpack the paper through a practical builder’s lens.

We will discuss:

  • why dataset documentation is still one of the weakest parts of modern AI engineering

  • how undocumented datasets create technical, ethical, legal, and product risks

  • why “good benchmark performance” can hide bad dataset assumptions

  • how datasheets can help teams build more reliable AI systems

  • what this paper means for RAG, agents, fine-tuning, evaluation, and regulated AI

  • how founders, engineers, researchers, and product teams can use these ideas in real projects

This session is for AI engineers, researchers, founders, product people, data scientists, policy-curious builders, and anyone who suspects that “we trained it on some data” is not exactly a production-grade explanation.

No prior expertise is required.

You do not need to be an ML researcher.
You do not need to love compliance.
You do not need to pretend dataset documentation is your favorite weekend activity.

You only need curiosity — and maybe a healthy distrust of magical AI demos.

AI Reading Club is not academic theater. We read important papers because they change how we build.

You can see the broader reading curriculum here:

https://github.com/hghalebi/ai-reading-club/tree/main/curriculum

You can also join the community on Discord:

https://discord.gg/5rAMsuVXXp

This paper changes how we look at data.

And once you see datasets as engineering artifacts with assumptions, limits, risks, and intended uses, you cannot unsee it.

Come if you care about AI that survives contact with reality.

Come if you build systems where mistakes matter.

Come if you want to understand why the future of AI will not only be shaped by bigger models — but by better documentation, better accountability, and better engineering discipline.

And come if you have ever trusted a dataset because the filename looked official.

Yes, unfortunately, that is all of us.

Location
Please register to see the exact location of this event.
Paris, France
Avatar for AI Reading Club
Presented by
AI Reading Club
Hosted By
4 Going