MiniMind-V

MiniMind-V is an open-source tiny visual-language model (VLM) project that demonstrates how to train a 26M-parameter multimodal VLM from scratch quickly and cheaply (example: ~1 hour / single 3090 GPU and very low rental cost). The repo provides end-to-end code for data cleaning, pretraining, supervised fine-tuning (SFT), evaluation and demo, using CLIP as the visual encoder and MiniMind as the base LLM.

Visit Website

Introduction

Oops! Something went wrong

[next-mdx-remote-client] error compiling MDX: Unexpected character `~` (U+007E) before name, expected a character that can start a name, such as a letter, `$`, or `_` More information: https://mdxjs.com/docs/troubleshooting-mdx

Back

Information

Websitegithub.com
AuthorsJingyao Gong (jingyaogong)
Published date2024/09/11

More Items

Isaac Lab

2022

NVIDIA (Isaac Sim / Omniverse team)

Isaac Lab is an open-source, GPU-accelerated robotics learning framework built on NVIDIA Isaac Sim. It provides high-fidelity physics and sensor simulation, ready-to-train environments and robot models, and integrations for reinforcement and imitation learning workflows to accelerate sim-to-real research and large-scale robot training.

nvidia RL physics ai-framework ai-train+3

Kornia

2018

Kornia contributors, E. Riba +4

Kornia is a differentiable computer-vision library built on top of PyTorch. It provides a comprehensive collection of differentiable image processing operators, geometric vision modules, advanced augmentation pipelines, and integrations with pre-trained vision models — all designed to run in batch on GPU and integrate with deep learning workflows.

vision pytorch ai-library github ai-image+2

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

2025

ByteDance

Dolphin is an open-source document image parsing project from ByteDance that uses heterogeneous anchor prompting and a document-type-aware two-stage architecture. It handles both digital-born and photographed documents, offering page-level and element-level parsing (text, tables, formulas, code). Dolphin-v2 (3B) improves accuracy and adds multi-page PDF support, deployment recipes (vLLM, TensorRT-LLM), and Hugging Face model hosting. The repository includes code, demos, pretrained models, and a BibTeX citation; license: MIT.

github bytedance ocr paper vision+3