Customized video generation requires large, paired examples that preserve a subject's visual identity while mapping text prompts to novel video outputs. CustoMDiT (PexelsCustom-1M) supplies over 1.03M curated (identity, text, video) triplets spanning 8,373 identity categories drawn from ~320K Pexels HD clips — a scale intended to make identity-conditional, open-domain fine-tuning and benchmarking feasible without proprietary datasets.
What Sets It Apart
- Scale with identity labels: 1,036,431 train triplets across 8,373 identity categories, which lets models learn both instance-level appearance and broader category variation — so you can fine-tune for personalization rather than only class-level generation.
- Per-video structured annotations: per-video JSONs include keyframe indices, original captions, generated alternative captions, bounding boxes and segmentation masks — so conditioning signals (reference image + localized object masks) are ready for training conditional diffusion/transformer pipelines.
- Practical dataset packaging: metadata CSVs, an annotations archive, and an extract_frames script streamline dataset assembly for common training stacks — so researchers can reproduce the training setup used by CustomDiT while sourcing videos separately.
- Open licensing for assets provided: dataset files are CC-BY-4.0, enabling reuse of metadata and annotations; note that source videos come from Pexels and must be obtained under Pexels' terms.
Who It's For and Trade-offs
Great fit if you are developing or evaluating identity-preserving video generation models (e.g., personalized diffusion/transformer approaches), training adapters or fine-tuning open weights, or benchmarking customization workflows at scale. It lowers the barrier to experiments that need paired reference frames plus object-level annotations. Look elsewhere if you need fully prepackaged video files (CustoMDiT provides metadata/annotations only; videos must be downloaded from Pexels), require consent/usage guarantees beyond Pexels' licensing, or need domain-specific footage not well represented on Pexels (e.g., medical, surveillance). Also expect substantial storage and I/O demands when assembling HD video assets and extracting frames.
Where It Fits
This dataset fills a practical gap between small, identity-focused personalization sets and very large uncurated video corpora by combining identity-aware captions, segmentation/bbox metadata, and explicit reference-frame extraction guidance. For teams building personalized video pipelines that condition on a reference image plus text, CustoMDiT offers a ready metadata backbone to assemble large-scale training data while respecting third-party video hosting rules.
