LLaVA is an open-source large language and vision assistant project that introduces visual instruction tuning to teach large language models to understand and follow multimodal (image+text) instructions. The repository includes papers, model checkpoints (Model Zoo), training and evaluation scripts, demos (Gradio), and tooling for fine-tuning, quantized inference, and deployment. LLaVA aims to bring LLM-level conversational capabilities to vision tasks and has continued evolving (LLaVA-1.5, LLaVA-NeXT, video/interactive variants).