Search
Collection
Category
Tag
Daily AI

Tag

Explore by tags

AIAIAny

Curated AI Resources for Everyone

[email protected]

Product

Search
Collection
Category
Tag

Resources

Blog

Company

Privacy Policy
Terms of Service
Sitemap

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

benchmarks

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

coding

coding-agents

copilot

course

cpu

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

distillation

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gcode

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

long-horizon

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

parquet

physics

pi

plugin

polars

postgres

privacy

programming

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

redis

retrieval

RL

rl

robotics

rust

science

security

segmentation

shodan

skillkit

software-engineering

sora

speech

sqlite

ssh

stt

swe

swift

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

vulkan

web-search

windsurf

xAI

xai

AI Video Papers·2026

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Cong Chen, Guo Gan +8

Decouples perception and reasoning for hours-long videos by streaming inputs into a three-tier Hierarchical Graph Memory and using an agentic Observation–Reason–Action retrieval loop; reduces reasoning context to ~2% of full video while improving benchmark accuracy.

#paper #ai-video #multimodal #GNN #agent-skills+3

AI Model·2026

unsloth/gemma-4-12B-it-qat-GGUF

unsloth, Google DeepMind

GGUF-format QAT (quantization-aware training) build of Gemma 4 12B that reduces memory needs for local or lightweight inference while preserving near bfloat16 quality. Ready for any-to-any conversational pipelines and ecosystem deployment.

#gemma #huggingface #google #deepmind #transformers+5

Speech Technology Papers·2026

MMAE: A Massive Multitask Audio Editing Benchmark

Ziyang Ma, Ruiqi Yan +36

Provides a comprehensive benchmark for instruction-based audio editing across seven audio modalities and eight operation types, with 2,000 high-fidelity samples and a rubric that decomposes tasks into 17,741 verifiable criteria for multi-dimensional evaluation.

#audio #multimodal #paper #speech #ai-leaderboard

Computer Vision Papers·2026

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Yu Li, Menghan Xia +9

Simulates egocentric, embodied human–world interactions and enables customizable, self-evolving local scenes by defining anchor views and text-driven evolution. Uses exogenous viewpoints and full-body motion supervision to improve spatial grounding and interaction consistency.

#vision #robotics #paper #multimodal #ai

AI Agent Papers·2026

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Wanli Li, Bowen Zhou +5

Benchmark for long-horizon computer-use agents that must orchestrate GUI, CLI, and code operations within single trajectories across 114 real-world tasks. Evaluated on a real Ubuntu desktop and paired with a trajectory-aware judge that inspects deliverables, artifacts, and action traces—revealing a top PassRate of ~41.2%.

#paper #ai-agent #agent-skills #cli #terminal+3

Computer Vision Papers·2026

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Xin Jin, Huanqia Cai +9

Models visual preference as distributions over rubric scores and introduces Z-Reward, a teacher–student framework that decouples reasoning-heavy judgment (teacher trained with GDSO) from efficient deployment (student via RISD). Demonstrates higher human-preference accuracy and works as a differentiable reward for text-to-image optimization.

#paper #vision #multimodal #ai-image #RL+1

AI Model·2026

Nemotron-Labs-Audex-30B-A3B

Zhifeng Kong, Sang-gil Lee +18·NVIDIA

Adds discrete audio tokens and an audio encoder to a 30B MoE text LLM so a single model can perform ASR, speech translation, TTS, text-to-audio and speech-to-speech while preserving text reasoning and long-context capabilities; supports thinking/instruct modes and up to 1M-token context.

#nvidia #huggingface #transformers #vllm #llm+6

AI Agent Papers·2026

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongcheng Gao, Hailong Qu +19

A benchmark that evaluates interactive spatial reasoning for multimodal agents in realistic tasks. It unifies eight heterogeneous simulators under a simulator-agnostic protocol, provides 760 human-annotated tasks with vision-only partial observability, and uses text-based actions plus terminal-state verification to measure task success.

#paper #multimodal #vision #agent-skills #ai-agent+2

AI Agent Papers·2026

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Kevin Qinghong Lin, Batu EI +4·University of Oxford, Stanford University

Turns raw datasets into verifiable multimodal news features via a multi-agent newsroom pipeline. Key innovations: (1) an Inspector that links each claim to data/code/external references for re-execution and audit; (2) multimodal asset generation (interactive maps, audio, visuals) tailored to the story.

#agent-skills #multimodal #ai-agent #paper #code+3

Computer Vision Papers·2026

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

Wenhao Yan, Fengjia Guo +2

End-to-end framework for controlled character animation that transfers motion from driving videos to reference characters without intermediate pose or background representations. Introduces the MotionPair‑60K end-to-end motion-transfer dataset, in‑context mask conditioning and mode‑specific RoPE for task unification, plus Bias‑Aware DPO to mitigate synthetic-detail errors.

#paper #vision #video #multimodal #ai-video+1

AI Dataset·2026

HIW-500: Humanoids In-the-Wild Dataset

BitRobot, Unitree +1

Provides 500+ hours of human whole-body teleoperation demonstrations for humanoid robot learning in real homes, with synchronized video, joint states, action traces and language annotations. Includes 23K+ episodes, fine-grained subtask labels, and raw ROS/MCAP plus compressed LeRobot formats.

#robotics #multimodal #vision #huggingface #ai-train+1

AI Model·2026

DiffusionGemma 26B A4B

Google DeepMind

Generates text from interleaved text, image, and short-video inputs using discrete diffusion and block‑autoregressive multi‑canvas sampling; built on a sparse MoE (8/128) Gemma 4 backbone and optimized for low‑latency inference and very long contexts (up to 256K tokens).

#gemma #foundation-model #multimodal #vision #transformers+5