A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]

By dh7net

· r/MachineLearning · May 28, 2026

Jasper released MONET, an Apache-licensed 104.9 million image-text dataset with tooling and a paper for text-to-image training.

Categories: OSS & Tools, Research

Excerpt

Hello everyone. The new dataset is named MONET, is Apache 2.0 and available on HF: [https://huggingface.co/datasets/jasperai/monet](https://huggingface.co/datasets/jasperai/monet) **MONET is open, Apache 2.0-licensed image–text dataset. It was built from 2.9 billion images and refined to 104.9 million high-quality samples.** We are also publishing [a paper](https://arxiv.org/abs/2605.21272) that explains how the dataset was created if you are curious and 3 compagnions projects * [A umap to visualize the distribution](https://huggingface.co/spaces/jasperai/monet-umap) * [A retreival tool to do text or image search](https://huggingface.co/spaces/jasperai/monet-retrieval) * [A codebase to train T2i model based on MONET](https://github.com/gojasper/nano-t2i/tree/main) Hope this will be usefull!

Read at source: https://www.reddit.com/r/MachineLearning/comments/1tq2vxa/a_new_dataset_with_more_that_100m_hiquality/

Discussions

reddit · 52 points · 12 comments
reddit · 55 points · 13 comments
reddit · 59 points · 14 comments