LAION, The Pile, and more datasets

Dec 14, 2022

What's actually used to train these LLMs? A brief look at some of the datasets involved.

LAION-5B

Stable Diffusion was trained on a dataset called LAION-5B ("Large-scale Artificial Intelligence Open Network"), which is comprised of 5.85 billion image-text pairs crawled from the internet. The actual crawled data comes from Common Crawl.

Common Crawl

3.15 billion pages contained in 380 TB. OpenAI's GPT-3 was, in part, trained by the data in Common Crawl. It is a non-profit founded by Gil Elbaz in 2011 (Elbaz founded Applied Semantics, which was acquired by Google in 2003 for $102mm and later became AdSense).

The Pile

A set of 22 smaller datasets was used to train GPT-J.

A filtered subset of Common Crawl
PubMed Central
"Books3" a collection of ebooks downloaded from Bibliotik
OpenWebText2 – scraped URLs from Reddit with a score of 3 or higher
ArXiv
GitHub
FreeLaw
Stack Exchange
USPTO Backgrounds
PubMed Abstracts
Gutenberg
OpenSubtitles
Wikipedia
DM Mathematics
Ubuntu IRC
BookCorpus2 – a set of 18k books from "somewhere"
EuroParl
Hacker News
YouTube Subtitles
PhilPapers
NIH ExPorter
Enron Emails

GPT-3 dataset

The book corpuses used aren't specified in the GPT-3 paper. Most likely because they are from gray hat sources like Bibliotik.

Common Crawl
OpenWebText2
Books1 (most likely either Gutenberg)
Books2 (most likely BookCorpus?)
Wikipedia

Matt Rickard

Discussion about this post

Ready for more?