Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

coding

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

distillation

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gcode

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

physics

pi

plugin

polars

postgres

privacy

programming

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

redis

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

sora

speech

sqlite

ssh

stt

swe

swift

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

vulkan

web-search

windsurf

xAI

xai

AI Dataset2026

WBench

meituan-longcat

Provides a 289-case (1,058-turn) multi-turn benchmark that evaluates interactive video world models across 22 metrics and five dimensions (quality, setting, interaction, consistency, physics). Includes first-/third-person and navigation splits plus a 20-model leaderboard for head-to-head comparisons.

video ai-video vision physics huggingface+4

AI Dataset2026

tran-vi-teacher

ngocdang83

Parallel Chinese→Vietnamese dataset of webnovel (xianxia) text provided in JSON for NMT training and teacher-student distillation. In-domain, ~100K–1M examples with CC-BY-4.0 license — useful for fine-tuning or distillation experiments but limited by narrow genre and small download footprint.

translation huggingface nlp multilingual pandas+1

AI Model2026

google/gemma-4-12B-it

Google DeepMind

Instruction-tuned, unified Gemma 4 12B multimodal model that accepts text, image and audio inputs and generates text outputs locally. Encoder-free design reduces multimodal latency and fits on consumer devices while offering long-context support and native thinking/system-prompt features.

gemma google deepmind multimodal transformers+5

AI Dataset2026

Qwen3.7 Max Pi Traces

armand0e, TeichAI

Provides raw newline-delimited JSON agent traces where assistant responses were generated by qwen/qwen3.7-max, captured with Teich; includes 47 JSONL files, an embedded tools schema snapshot, and conversion guidance for supervised fine‑tuning and distillation.

huggingface llm ai-agent agent-skills ai-train+2

AI Model2026

Gemma 4 12B Unified

Google DeepMind

A 12B unified, encoder-free multimodal model that directly ingests text, images and audio and returns text; supports very long contexts (up to 256K tokens), native function-calling/thinking modes, and small-model deployment for local or on-device use.

gemma multimodal transformers google deepmind+8

AI Dataset2026

StreamAudio-2M

zhifeixie

Large streaming-audio dataset for training and evaluating audio-LLMs and audio agents. About 2.28M clips grouped into multi-turn “streams” across six task subsets (ASR, speech translation, audio understanding, voice chat, proactive response, environment-aware); audio shipped as tar shards.

audio ASR translation speech voice+2

AI Audio2026

MOSS-TTS-v1.5

OpenMOSS-Team

Generates multilingual text-to-speech with zero-shot voice cloning, token-level duration control, and inline pause markers. v1.5 improves multilingual fidelity (with language tags), cloning stability, and long-reference handling—suitable for research and production TTS pipelines.

speech audio voice multilingual huggingface+2

AI Model2026

Keye-VL-2.0-30B-A3B

Kwai-Keye

Performs hour-scale video understanding and fine-grained temporal localization while exposing agent-style multimodal tool/code/search abilities. Built on a sparse-attention long-context architecture (DSA) and a specialized inference stack—best used in GPU-backed research or production evaluation.

multimodal video deepseek transformers huggingface+5

AI Model2026

Mellum2 Thinking

JetBrains

Generates text with explicit chain-of-thought traces for multi-step reasoning and math-heavy tasks, emitting reasoning inside <think>...</think> blocks. Uses a Mixture-of-Experts design and 131k token context for long, verifiable workflows—best when you need inspectable reasoning.

huggingface transformers llm vllm foundation-model+1

AI Model2026

LocateAnything-3B

NVIDIA

Performs fast, high-quality vision–language grounding: given an image plus a natural-language prompt it returns bounding boxes or points for referred objects. Uses Parallel Box Decoding for parallel coordinate prediction (higher throughput) and targets research/non-commercial use.

nvidia vision multimodal transformers huggingface+5

AI Dataset2026

MeasL-Bench

kepeng

Benchmark for evaluating vision–language models on measurement-grounded inputs vs. RGB, emphasizing low-light, HDR, and visibility-sensitive evidence recovery. Contains 2,183 paired test examples with local image assets for controlled RAW↔RGB comparisons.

vision multimodal image huggingface ai-image

AI Model2026

PaddleOCR-VL-1.6

PaddlePaddle

Performs image-to-text document parsing and OCR for complex elements (tables, formulas, charts, seals), with multilingual support (en/zh). It uses region-aware data optimization and progressive post-training to improve weak-region supervision and is plug-and-play compatible with PaddleOCR-VL-1.5.

ocr multimodal vision image multilingual+5