AIAny - AI Model

AI Model2026

MiniCPM5-1B

A 1.08B-parameter causal LLM engineered for on-device text generation with native long-context (131k tokens) and built-in Think/No-Think modes. It emphasizes tool-calling support, lightweight deployment formats (BF16, GGUF, MLX), and RL+OPD post-training for stronger reasoning and code generation.

llm transformers huggingface vllm ollama+3

AI Model2026

Bonsai Image · Ternary 4B (gemlite 2-bit)

Prism ML (prism-ml)

A ternary-weight (~1.58-bit) 4B text-to-image diffusion transformer optimized for NVIDIA GPUs using Gemlite INT2 and HQQ; it reduces the transformer to ~1.21 GB (4.55 GB CUDA payload) and targets 1024×1024 generation with a 4-step FlowMatch-Euler sampler.

huggingface ai-image image nvidia ai-inference+3

AI Model2026

google/gemma-4-12B-it

Google DeepMind

Instruction-tuned, unified Gemma 4 12B multimodal model that accepts text, image and audio inputs and generates text outputs locally. Encoder-free design reduces multimodal latency and fits on consumer devices while offering long-context support and native thinking/system-prompt features.

gemma google deepmind multimodal transformers+5

AI Model2026

Gemma 4 12B Unified

Google DeepMind

A 12B unified, encoder-free multimodal model that directly ingests text, images and audio and returns text; supports very long contexts (up to 256K tokens), native function-calling/thinking modes, and small-model deployment for local or on-device use.

gemma multimodal transformers google deepmind+8

AI Model2026

Step 3.7 Flash

stepfun-ai

Processes images and text to produce structured, reasoning-rich text outputs for high-throughput agentic workflows. Sparse MoE design (198B total, ~11B active per token), 256k context window and selectable reasoning levels—optimized for single-pass parsing, verification, and multi-step automation.

multimodal llm transformers vllm ai-inference+4

AI Audio2026

MOSS-TTS-v1.5

OpenMOSS-Team

Generates multilingual text-to-speech with zero-shot voice cloning, token-level duration control, and inline pause markers. v1.5 improves multilingual fidelity (with language tags), cloning stability, and long-reference handling—suitable for research and production TTS pipelines.

speech audio voice multilingual huggingface+2

AI Model2026

Keye-VL-2.0-30B-A3B

Kwai-Keye

Performs hour-scale video understanding and fine-grained temporal localization while exposing agent-style multimodal tool/code/search abilities. Built on a sparse-attention long-context architecture (DSA) and a specialized inference stack—best used in GPU-backed research or production evaluation.

multimodal video deepseek transformers huggingface+5

AI Model2026

Mellum2 Thinking

JetBrains

Generates text with explicit chain-of-thought traces for multi-step reasoning and math-heavy tasks, emitting reasoning inside <think>...</think> blocks. Uses a Mixture-of-Experts design and 131k token context for long, verifiable workflows—best when you need inspectable reasoning.

huggingface transformers llm vllm foundation-model+1

AI Model2026

LocateAnything-3B

NVIDIA

Performs fast, high-quality vision–language grounding: given an image plus a natural-language prompt it returns bounding boxes or points for referred objects. Uses Parallel Box Decoding for parallel coordinate prediction (higher throughput) and targets research/non-commercial use.

nvidia vision multimodal transformers huggingface+5

AI Model2026

PaddleOCR-VL-1.6

PaddlePaddle

Performs image-to-text document parsing and OCR for complex elements (tables, formulas, charts, seals), with multilingual support (en/zh). It uses region-aware data optimization and progressive post-training to improve weak-region supervision and is plug-and-play compatible with PaddleOCR-VL-1.5.

ocr multimodal vision image multilingual+5

AI Model2026

nvidia/Qwen3.6-35B-A3B-NVFP4

nvidia

Quantized NVFP4 build of the Qwen3.6-35B MoE language model, optimized with NVIDIA Model Optimizer to cut model size and GPU memory by ~3.06× for inference. Designed for vLLM and NVIDIA GPU deployments (Hopper/Blackwell).

nvidia huggingface vllm llm ai-inference+3

AI Model2026

Cosmos3-Super-Text2Image

NVIDIA

Generates high-fidelity images from text prompts using NVIDIA's 64B Cosmos3-Super multimodal foundation model. Integrates with Hugging Face Diffusers and vLLM‑Omni, is released under OpenMDW1.1 for commercial use, and is optimized for Physical AI workflows (robotics, AV, simulation).

nvidia huggingface diffusers vllm ai-image+5

Category

Explore by categories

All Categories

AI Leaderboard

AI Agent Tutorials

AI Coding Tutorials

AI Model

AI Agent Papers

Chatbot

AI Dataset

Machine Learning Foundation Books

AI Train

AI Deploy

AI Client

Machine Learning Foundation Papers

Machine Learning Foundation Tutorials

AI Image Demos

AI Agent

Large Language Model Tutorials

Large Language Model Papers

Machine Learning Engineering Papers

Computer Vision Tutorials

Computer Vision Papers

Natural Language Processing Papers

Reinforcement Learning Papers

Speech Technology Papers

AI API

AI Coding

AI Image

AI Video

MLOps

MCP Client

MCP Server

AI Video Papers

AI Audio

AI Others

AI Infra

Embodied AI

MiniCPM5-1B

Bonsai Image · Ternary 4B (gemlite 2-bit)

google/gemma-4-12B-it

Gemma 4 12B Unified

Step 3.7 Flash

MOSS-TTS-v1.5

Keye-VL-2.0-30B-A3B

Mellum2 Thinking

LocateAnything-3B

PaddleOCR-VL-1.6

nvidia/Qwen3.6-35B-A3B-NVFP4

Cosmos3-Super-Text2Image