AI Dataset2026

i1-captions (zlab-princeton)

Provides the full caption corpus used to train and ablate the i1 text-to-image model: 12 curated subsets with multiple caption variants (long/short, VLM-generated, rendered text) to enable reproducible training and captioning experiments.

Visit Website

Introduction

Most public T2I recipes omit a transparent, reproducible caption corpus. This dataset supplies the exact caption sets used across controlled experiments and the final training run for the i1 3B text-to-image diffusion model, making the data side of the recipe auditable and reusable.

What Sets It Apart

Consolidates captions for 12 curated subsets (ImageNet-22K, YFCC, RedCaps, Megalith10m, Pexels, Places365, iNaturalist, Midjourney-v6, GPT-Edit, FLUX-Reason, RenderedText, TextAtlas). Total rows: 166,734,751; total file size: ~153 GB.
Multiple caption variants per image: caption1..caption5 (long Qwen3‑VL‑30B‑A3B style), short variants, no_center_crop variants, plus VLM-generated captions from Qwen2/2.5/3 families.
Designed to support the i1 controlled experiments: random sampling among caption variants during training and ablations on prompt length, synthetic captioners, and image preprocessing.
Dataset contains captions only; corresponding image downloads and image–caption pairing are provided via the i1 data_processing pipelines (images must be obtained separately).

Who It's For and Trade-offs

Great fit if you need a reproducible caption corpus for training or studying text-to-image models, comparing synthetic captioners, or replicating the i1 experiments. Expect heavy storage and I/O requirements (hundreds of gigabytes) and extra work to fetch and align the image files. License metadata is not set in the dataset card—verify licensing before large-scale use.

Back

Information

Websitehuggingface.co
OrganizationsPrinceton University
AuthorsBoya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu
Published date2026/05/13

More Items

AI Dataset2026

ArithMark 3.0

AxiomicLabs

A multiple-choice benchmark for evaluating language-model arithmetic: 1,000 continuation-style elementary word problems (4 choices, balanced labels) organized by topic, grade band, and difficulty. Designed for base-model continuation log-likelihood scoring; released under Apache-2.0.

evaluation benchmarks benchmark huggingface nlp+4

AI Dataset2026

XYZ-Aquila SFT

XYZAILab

Provides 7,000 bilingual multi-turn, search-oriented tool-use trajectories (5,000 English, 2,000 Chinese) for supervised fine-tuning and analysis of agentic search models. Includes serialized system/user/assistant messages, embedded Qwen3 tool schemas, and conversion scripts; not a standalone benchmark.

web-search agent-skills ai-agent multilingual huggingface+4

AI Dataset2026

Reasoning Corpus 5M

QyrouQyrouNnet-AI, SupraLabs

Provides ~5M model-generated reasoning chains (within 5k sequence length) with structured fields for supervised fine-tuning, reasoning distillation, and instruction tuning. Includes separate fields for prompt, reasoning trace, final answer and a ChatML view; streaming access recommended for large-scale use.

reasoning distillation deepseek qwen gemma+7

i1-captions (zlab-princeton)

Introduction

What Sets It Apart

Who It's For and Trade-offs

Information

Categories

Tags

More Items

ArithMark 3.0

XYZ-Aquila SFT

Reasoning Corpus 5M