

Datasheets for Datasets — The Hidden Contract Behind Every AI System
Before we ask whether an AI model is powerful, fair, safe, or intelligent, we should ask a more uncomfortable question:
What did its dataset teach it to believe?
Most AI failures do not begin in the model architecture.
They begin earlier.
In the data.
The dataset decides what the model sees, what it ignores, what it repeats, what it amplifies, and what it mistakes for truth. A model can pass benchmarks, produce beautiful demos, and still fail badly in the real world if its training or evaluation data does not match the world where the system is deployed.
And yet, in many AI projects, datasets are still treated like mysterious ZIP files from the internet.
Downloaded.
Cleaned “a bit.”
Used for training.
Forgotten forever.
For this AI Reading Club session, we will discuss one of the most important papers in responsible and production-grade AI:
Datasheets for Datasets
by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford.
Paper:
https://arxiv.org/pdf/1803.09010
The idea is simple, almost obvious — which is usually where the dangerous ideas live.
In electronics, components come with datasheets. A datasheet tells you what the component is, how it behaves, how it was tested, what its limits are, and how it should or should not be used.
So why do we build machine learning systems on datasets that often come with less documentation than a €3 sensor?
This paper proposes that datasets should also come with datasheets: structured documentation explaining why the dataset was created, who created it, what it contains, how the data was collected, what preprocessing was done, what risks exist, what uses are appropriate, what uses should be avoided, and how the dataset will be maintained.
This is not only an ethics paper.
It is an engineering paper.
It is a product-risk paper.
It is a deployment paper.
It is a “your beautiful AI system is secretly standing on undocumented assumptions” paper.
In 2026, this matters more than ever.
Foundation models, RAG systems, agents, fine-tuning pipelines, benchmarks, synthetic data, evaluation sets, compliance workflows — all of them depend on data. But without documentation, we often cannot answer basic questions:
Where did this data come from?
Who is represented?
Who is missing?
Was consent involved?
What was filtered out?
What labels were added?
What bias may be hidden inside?
Can this dataset be used for this task?
Should it be used at all?
And the most brutal question:
Would you still trust your model if you had to explain its dataset in public?
During this session, we will unpack the paper through a practical builder’s lens.
We will discuss:
why dataset documentation is still one of the weakest parts of modern AI engineering
how undocumented datasets create technical, ethical, legal, and product risks
why “good benchmark performance” can hide bad dataset assumptions
how datasheets can help teams build more reliable AI systems
what this paper means for RAG, agents, fine-tuning, evaluation, and regulated AI
how founders, engineers, researchers, and product teams can use these ideas in real projects
This session is for AI engineers, researchers, founders, product people, data scientists, policy-curious builders, and anyone who suspects that “we trained it on some data” is not exactly a production-grade explanation.
No prior expertise is required.
You do not need to be an ML researcher.
You do not need to love compliance.
You do not need to pretend dataset documentation is your favorite weekend activity.
You only need curiosity — and maybe a healthy distrust of magical AI demos.
AI Reading Club is not academic theater. We read important papers because they change how we build.
You can see the broader reading curriculum here:
https://github.com/hghalebi/ai-reading-club/tree/main/curriculum
You can also join the community on Discord:
https://discord.gg/5rAMsuVXXp
This paper changes how we look at data.
And once you see datasets as engineering artifacts with assumptions, limits, risks, and intended uses, you cannot unsee it.
Come if you care about AI that survives contact with reality.
Come if you build systems where mistakes matter.
Come if you want to understand why the future of AI will not only be shaped by bigger models — but by better documentation, better accountability, and better engineering discipline.
And come if you have ever trusted a dataset because the filename looked official.
Yes, unfortunately, that is all of us.