Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

benchmarks

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

coding

coding-agents

copilot

course

cpu

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

distillation

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gcode

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

long-horizon

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

parquet

physics

pi

plugin

polars

postgres

privacy

programming

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

redis

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

software-engineering

sora

speech

sqlite

ssh

stt

swe

swift

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

vulkan

web-search

windsurf

xAI

xai

Cosmos3-Super-Text2Image

Generates high-fidelity images from text prompts using NVIDIA's 64B Cosmos3-Super multimodal foundation model. Integrates with Hugging Face Diffusers and vLLM‑Omni, is released under OpenMDW1.1 for commercial use, and is optimized for Physical AI workflows (robotics, AV, simulation).

nvidia huggingface diffusers vllm ai-image+5

LFM2.5-8B-A1B

Hybrid LFM2.5 text-generation model optimized for on-device assistants and agentic workflows — 8.3B total / 1.5B active parameters with 131,072-token context. Prioritizes low-latency, high-throughput inference and multilingual instruction-following; not optimized for pure heavy programming or knowledge-heavy QA without retrieval.

llm transformers huggingface multilingual vllm+5

AI Video Papers2026

EarlyTom: Early Token Compression Completes Fast Video Understanding

Hesong Wang, Xin Jin +5

Performs training-free early-stage visual token compression inside the vision encoder to cut time-to-first-token (TTFT) and FLOPs for Video-LLMs. Introduces a decoupled spatial token selection strategy and reports up to 2.65× TTFT reduction and 61% FLOPs savings on LLaVA-OneVision-7B (NVIDIA A100) while preserving full-token accuracy — aimed at latency-sensitive video understanding.

video vision ai-video multimodal llm+3

Step-3.7-Flash (GGUF quantizations)

GGUF quantizations of Step-3.7-Flash: a sparse multimodal Mixture-of-Experts LLM with native image understanding, selectable reasoning levels, and a 256K context window. Ships multiple calibrated Q3/Q4/IQ quant files plus an mmproj vision projector for local llama.cpp inference on high-memory hosts.

huggingface llm vision multilingual ai-inference+4

AI Video Papers2026

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Yuyang Zhao, Yicheng Pan +7

Enables real-time streaming video-to-video editing (1280×704 @24 FPS) on a single RTX 5090 GPU. Uses a Hybrid Diffusion Transformer for balanced local/global modeling, Cycle‑Reverse Regularization for temporal consistency, and system-level mixed-precision and fused kernels to maximize throughput.

video ai-video vision transformers nvidia+2

Large Language Model Papers2026

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Haodi Lei, Yafu Li +9

Introduces Draft-OPD, an on-policy distillation method for training lightweight draft models used in speculative decoding — it focuses learning on draft-induced errors via target-assisted rollouts and replay, improving acceptance length and enabling >5× lossless LLM inference acceleration.

paper NLP llm ai-inference ai-serving+2

Computer Vision Papers2026

Colored Noise Diffusion Sampling

Hadar Davidson, Noam Issachar +1

Reallocates injected noise energy across frequency bands to match a diffusion model's spectral bias, improving sampling fidelity without retraining. Uses a timestep- and frequency-dependent colored-noise schedule as a plug-and-play inference-time SDE solver; shows sizable FID drops on ImageNet-256.

paper vision image ai-image ai-inference

unsloth/gemma-4-12b-it-GGUF

A GGUF-quantized, locally runnable build of Gemma 4 12B Unified (image-text-to-text) packaged by unsloth; preserves multimodal (image/audio) input support under an Apache-2.0 license and is compatible with common GGUF runtimes and Unsloth Studio.

gemma google deepmind huggingface multimodal+7

Cosmos3-Super

Generates and reasons about multimodal physical-world content—text, images, video, audio, and robot/action trajectories—conditioned on combinations of text, image, video and action inputs. The 64B “Super” variant targets Physical AI use cases and supports vLLM‑Omni, Diffusers, and action prediction.

nvidia huggingface multimodal robotics ai-video+5

ByteDance/Bernini-R

Provides the renderer weights and inference code for Bernini’s video renderer, enabling text→video, image→video and video editing inference. Offers a ready diffusers-format bundle or safetensors checkpoints under Apache‑2.0; intended for multi‑GPU/Hopper inference and reproducible research.

bytedance huggingface diffusers video ai-video+3

Echo-LongVideo (JoyAI-Echo)

Echo Team @ Joy Future Academy, JD, jdopensource

Generates minute-level, multi-shot synchronized audio+video from a single text prompt, using a paired cross-modal memory to preserve character appearance and voice across shots. Uses DMD-distilled few-step inference for ~7.5× speedup; requires high-GPU memory and is released under the LTX-2 community license.

ai-video video audio multimodal huggingface+3

Nex-N2-Pro

Agentic LLM for long-horizon, environment-driven workflows: decomposes goals, generates and executes code/tool calls, evaluates outputs, and iterates. The Pro variant emphasizes coding and terminal execution and is published for use with sglang and multi-node H100 deployment.

transformers llm ai-agent ai-coding huggingface+5