Generative Models by Stability AI
Generative Models by Stability AI is a public GitHub repository that organizes code, configuration and demos for a family of Stability AI's generative models (text-to-image, image-to-video, novel-view/4D video synthesis, etc.). The project emphasizes modular, config-driven design so researchers and engineers can compose encoders/conditioners/samplers and run both training and inference workflows.
Key highlights:
-
Models & releases (selected):
- May 20, 2025 — Stable Video 4D 2.0 (SV4D 2.0): enhanced video-to-4D diffusion for novel-view video synthesis (trained to generate multi-view frames, better spatio-temporal consistency).
- July 24, 2024 — Stable Video 4D (SV4D): video-to-4D diffusion used for novel-view video synthesis (5 frames x 8 views, sampling strategies for longer sequences).
- March 18, 2024 — SV3D: image-to-video / multi-view synthesis (variants SV3D_u and SV3D_p).
- November 2023 — SDXL-Turbo and Stable Video Diffusion releases and related technical reports.
- July 2023 — SDXL family (base/refiner) initial releases and licensing notes.
-
Repository design & components:
- Config-driven instantiation (yaml configs + instantiate_from_config pattern) to combine embedders, networks, samplers, and guiders.
- GeneralConditioner abstraction for handling diverse conditioning (text, classes, spatial conditionings).
- Separate samplers (numerical solvers) and guidance wrappers; denoiser framework for continuous & discrete-time models.
- Training examples (configs/example_training), support for PyTorch Lightning, and notes on dataset format (webdataset).
-
Demos & inference:
- Streamlit and Gradio demo scripts for sampling and video demos.
- Quickstart sampling scripts and example commands for SV3D/SV4D/SV4D2.0.
- Instructions to obtain model weights from Hugging Face and where to place them (checkpoints/).
-
Practical notes for users:
- Installation steps (virtualenv, PyTorch wheel index, requirements files) and packaging with Hatch.
- Guidance for low-VRAM inference (encoding_t/decoding_t flags, lower resolution) and background removal suggestions for better results (rembg/SAM2/Clipdrop).
- Invisible watermark embedding/detection utilities and instructions to run detection scripts.
-
Use cases:
- Research reproducibility of advanced generative models, rapid prototyping of novel-view and video synthesis, base code for training new diffusion-based models, and demonstration apps for sampling/visualization.
This repository is best suited for researchers and engineers familiar with PyTorch, diffusion models, and Hugging Face model distribution. It contains both high-level demos and low-level training/configuration examples to support experimentation and production prototyping.
