Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

benchmarks

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

coding

coding-agents

copilot

course

cpu

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

distillation

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gcode

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

long-horizon

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

parquet

physics

pi

plugin

polars

postgres

privacy

programming

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

redis

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

software-engineering

sora

speech

sqlite

ssh

stt

swe

swift

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

vulkan

web-search

windsurf

xAI

xai

DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF

A 40B GGUF-quantized Qwen3.6 variant fine-tuned with Claude 4.6 Opus and Deckard/Heretic datasets for multimodal image-text-to-text tasks. Offers 256K context, custom NEO-CODE Di-IMatrix quants for long conversations and coding, optimized for local inference and creative/coding use cases; safety alignment removed.

huggingface llm multimodal vision ai-image+8

L2P: Unlocking Latent Potential for Pixel Generation

Transforms pretrained latent-diffusion priors into pixel-space diffusion models by removing the VAE and training shallow pixel layers on LDM-generated synthetic images — enabling fast convergence, native 4K output, and low-data training on 8 GPUs.

vision image ai-image foundation-model paper+3

OpenCS2 - POV Renders

Julien Blanchon

Provides tick-aligned Counter-Strike 2 player POV video clips with per-tick inputs and world-state sidecars — near-lossless 1280×720@32fps video, per-player stereo audio, and parquet indexes for event/kill/round filtering; suited for RL, video classification and clip mining.

video ai-video RL audio pandas+5

Hy-MT2-30B-A3B

Tencent Hunyuan

A 30B mixture-of-experts multilingual translation model supporting 33 languages and instruction-following translation. Offers MoE architecture, fast-thinking mode, and quantized/deployment-ready variants for production translation and subtitle tasks.

multilingual transformers huggingface vllm llm+3

Irodori-TTS-500M-v3

Aratako (Chihiro Arata)

Generates high-quality Japanese speech from text with zero-shot voice cloning and emoji-based style controls; uses a flow-matching diffusion transformer over DACVAE continuous latents, includes a duration predictor and integrated SilentCipher watermarking. Japanese-only.

speech voice audio transformers pytorch+1

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Research-focused text-to-image foundation model that prioritizes training efficiency: a 3.8B-parameter architecture trained on an 800M image-text corpus with mixed-resolution learning, FLUX.2 VAE, RL tuning, and a distilled 4-step Lens-Turbo for fast high-resolution generation.

microsoft huggingface ai-image image transformers+3

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Fengyi Fu, Mengqi Huang +12

Delivers image and video generation, editing, and understanding inside a single 3B-parameter multimodal model trained from scratch with a multi-task recipe. Notable for strong unified benchmarks at 3B scale; inference requires large GPU memory (≈40GB+ VRAM).

bytedance multimodal video ai-video ai-image+5

Lens-Turbo (microsoft/Lens-Turbo)

A 4-step distilled variant of Microsoft's Lens foundational text-to-image model for fast, high-resolution image synthesis. Optimized for mixed-resolution inference up to 1440×1440, GPT-OSS text features and FLUX.2 latents, intended for low-latency prototyping and research under an MIT license.

microsoft huggingface foundation-model ai-image image+2

Nemotron 3.5 ASR

Multilingual streaming ASR that transcribes 40 language-locales using a cache-aware FastConformer‑RNNT architecture. Supports language-ID prompting (or auto-detect), punctuation/capitalization, and configurable chunk sizes to trade latency vs. accuracy for production transcription and streaming voice agents.

nvidia huggingface ASR speech multilingual+3

SANA-WM (Bidirectional)

Efficient-Large-Model

Generates minute-scale, 720p videos from a single image using a 2.6B image-to-video diffusion transformer with precise 6‑DoF camera control and an optional LTX‑2 refiner; designed for long-context, memory-efficient modeling but requires large refiner checkpoints (~41 GB).

video ai-video ai-image huggingface gemma+2

MOSS-Transcribe-Diarize

OpenMOSS-Team, MOSI.AI +1

Converts long-form multi-speaker audio/video into a compact, speaker-aware transcript with timestamps and anonymous speaker labels in one pass. Combines ASR and diarization in a single model, supports custom prompts/hotwords, and targets meetings, podcasts, interviews and long recordings.

ASR audio speech stt transformers+5

Cosmos-Framework

End-to-end Python framework for training and serving NVIDIA's Cosmos world models (Cosmos3), integrating distributed training (FSDP/TP/CP/PP), DCP/safetensors checkpoints, dataset adapters, multiple inference backends, online serving, and agent skills.

nvidia ai-train ai-serving pytorch cuda+8