LogoAIAny
Icon for item

i1-captions (zlab-princeton)

Provides the full caption corpus used to train and ablate the i1 text-to-image model: 12 curated subsets with multiple caption variants (long/short, VLM-generated, rendered text) to enable reproducible training and captioning experiments.

Introduction

Most public T2I recipes omit a transparent, reproducible caption corpus. This dataset supplies the exact caption sets used across controlled experiments and the final training run for the i1 3B text-to-image diffusion model, making the data side of the recipe auditable and reusable.

What Sets It Apart
  • Consolidates captions for 12 curated subsets (ImageNet-22K, YFCC, RedCaps, Megalith10m, Pexels, Places365, iNaturalist, Midjourney-v6, GPT-Edit, FLUX-Reason, RenderedText, TextAtlas). Total rows: 166,734,751; total file size: ~153 GB.
  • Multiple caption variants per image: caption1..caption5 (long Qwen3‑VL‑30B‑A3B style), short variants, no_center_crop variants, plus VLM-generated captions from Qwen2/2.5/3 families.
  • Designed to support the i1 controlled experiments: random sampling among caption variants during training and ablations on prompt length, synthetic captioners, and image preprocessing.
  • Dataset contains captions only; corresponding image downloads and image–caption pairing are provided via the i1 data_processing pipelines (images must be obtained separately).
Who It's For and Trade-offs

Great fit if you need a reproducible caption corpus for training or studying text-to-image models, comparing synthetic captioners, or replicating the i1 experiments. Expect heavy storage and I/O requirements (hundreds of gigabytes) and extra work to fetch and align the image files. License metadata is not set in the dataset card—verify licensing before large-scale use.

Information

  • Websitehuggingface.co
  • OrganizationsPrinceton University
  • AuthorsBoya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu
  • Published date2026/05/13

Categories