Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

benchmarks

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

coding

coding-agents

copilot

course

cpu

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

distillation

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gcode

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

long-horizon

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

parquet

physics

pi

plugin

polars

postgres

privacy

programming

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

redis

retrieval

RL

rl

robotics

rust

science

security

segmentation

shodan

skillkit

software-engineering

sora

speech

sqlite

ssh

stt

swe

swift

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

vulkan

web-search

windsurf

xAI

xai

stabilityai/stable-audio-3-medium

Generates music, sound effects, and general audio from text prompts using a medium-size Stable Audio 3 diffusion model — a balance of generation quality and inference cost suitable for prototyping, demo assets, and creative sound design workflows.

audio speech AIGC foundation-model ai-tools+1

Voices in the Wild

Provides a large-scale ASR corpus organized by normalized acoustic subsets for robustness training and evaluation. About 645,925 examples across 54 acoustic conditions (noise, echo, far-field, recording distortions) with many distortion/dropout/noise Parquet splits. Distributed as split Parquet files; license not specified on the dataset page.

audio speech ASR multilingual voice+1

Miso TTS 8B

Generates conversational speech and voice continuation from text and optional audio context, outputting Mimi audio codes. Built on a Sesame-style CSM with an 8B Llama-like backbone plus a smaller autoregressive audio decoder. Suited for local TTS inference and voice-cloning workflows.

pytorch audio voice speech huggingface+1

StreamAudio-2M

Large streaming-audio dataset for training and evaluating audio-LLMs and audio agents. About 2.28M clips grouped into multi-turn “streams” across six task subsets (ASR, speech translation, audio understanding, voice chat, proactive response, environment-aware); audio shipped as tar shards.

audio ASR translation speech voice+2

MOSS-TTS-v1.5

Generates multilingual text-to-speech with zero-shot voice cloning, token-level duration control, and inline pause markers. v1.5 improves multilingual fidelity (with language tags), cloning stability, and long-reference handling—suitable for research and production TTS pipelines.

speech audio voice multilingual huggingface+2

SmoothConv

ASLP@NPU, QualiaLabs

Provides ~100 hours of expert-annotated, multi-channel Chinese conversational speech with per-segment timestamps, speaker IDs and paralinguistic labels for turn-taking, overlap/interruption detection and full‑duplex dialogue research. Licensed for academic/non-commercial use (CC BY‑NC 4.0).

speech ASR huggingface voice nlp+1

Speech Technology Papers2026

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Ruiqi Li, Yu Zhang +4

Zero-shot TTS for expressive long-form monologue and multi-speaker dialogue, designed to preserve acoustic consistency, conversational coherence, and affective continuity. Trained on SwanData-Speech and using a 25 Hz VAE, pause-aware text conditioning, and a flow-matching DiT with DiffusionNFT fine-tuning.

paper speech audio voice foundation-model

Higgs Audio v3 TTS

Converts text into expressive conversational speech across 100+ languages with zero-shot voice cloning and inline control tokens for emotion, style, prosody, pauses, and sound effects. Released under a research/non-commercial license; commercial use requires separate licensing.

huggingface audio speech multilingual transformers+2

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

End-to-end evaluation framework for conversational voice agents that runs bot-to-bot audio simulations and scores agents on task accuracy (EVA-A) and interaction experience (EVA-X). Includes per-scenario backend state, accent/noise perturbations, and 213 scenarios across airline, healthcare HR, and enterprise IT domains.

huggingface voice speech ASR tts+4

ZONOS2

Gabriel Clark, Sofian Mejjoute +3Zyphra

Multilingual, low-latency text-to-speech model for speech generation and zero-shot voice cloning. Uses an MoE backbone with ECAPA-TDNN speaker embeddings, supports audio prefixes, fine-grained prosody/emotion controls and 44.1kHz output; optimized for Linux + NVIDIA GPUs.

tts audio multilingual voice huggingface+4

CohereLabs/cohere-transcribe-arabic-07-2026

Transcribes Arabic speech to text using a CohereLabs-trained ASR model compatible with the Hugging Face Transformers pipeline. Provides safetensors weights, endpoint compatibility and a DOI-tagged release; suitable for Arabic transcription workflows but may require adaptation for diverse dialects or noisy audio.

ASR speech audio transformers huggingface+4

Gepard

Nineninesix, Inc., NVIDIA +1

Generates streaming, low‑latency neural speech for real‑time dialogue by autoregressively producing audio frames as text arrives; joint text–speech training preserves natural prosody. Optimized for vLLM streaming (~50 ms first chunk), supports short‑clip voice cloning and four languages.

tts vllm qwen transformers huggingface+5