I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.
A 103B-token pre-LLM Usenet corpus offers a large human-written dataset for training and fine-tuning experiments.
Excerpt
Posted this to r/MachineLearning a couple weeks ago (30K views, 100+ upvotes) and have been meaning to share it here where the fine-tuning angle is more directly relevant.
I spent years building and processing a complete Usenet corpus from 1980 to 2013. Here’s why it might matter for local model work specifically:
Zero AI contamination. Every post predates LLMs by decades. Training on this won’t bake in GPT mannerisms, refusal patterns, or RLHF artifacts. It’s raw human writing - argumentative, unfiltered, stylistically diverse across 33 years.
Pre-SEO, pre-algorithm internet. People wrote longer, more substantively, without optimizing for engagement. The writing character is noticeably different from anything scraped from the modern web.
Good hierarchies for domain fine-tuning:
• comp.\* — 10.3B tokens of computing discussion from people literally building the internet
• sci.\* — 3.3B tokens of scientific back-and-forth
• rec.\* — 16.5B tokens of hobbies, sports, arts, games
• humanities.\* — philosophy, literature, classic texts
The numbers:
• 103.1B tokens (cl100k\_base)
• 408M posts across 18,347 newsgroups
• 1980–2013, 96.6% English
Processing: deduplicated, alt.binaries.\* excluded, binaries removed, email addresses redacted, MBOX → gzip JSONL.
Someone in the community already fine-tuned Gemma 4 on the sample data (wyan/usenet-gemma-4-E2B-lora on HF) — works as a proof of concept even if it’s early days.
Samples (5K posts per hierarchy +
Read at source: https://www.reddit.com/r/LocalLLaMA/comments/1tphhqk/i_built_a_103btoken_usenet_corpus_19802013_preweb/
Discussions
- reddit · 111 points · 47 comments
- reddit · 150 points · 79 comments
- reddit · 156 points · 80 comments
- reddit · 174 points · 82 comments
- reddit · 185 points · 88 comments
- reddit · 194 points · 90 comments
- reddit · 204 points · 91 comments
- reddit · 210 points · 99 comments
- reddit · 212 points · 100 comments
- reddit · 221 points · 100 comments
- reddit · 218 points · 100 comments
- reddit · 231 points · 102 comments
- reddit · 244 points · 103 comments
- reddit · 257 points · 104 comments
- reddit · 277 points · 111 comments