LogoAIAny
  • Search
  • Collection
  • Category
  • Tag
  • Blog
LogoAIAny

Tag

Explore by tags

LogoAIAny

Learn Anything about AI in one site.

support@aiany.app
Product
  • Search
  • Collection
  • Category
  • Tag
Resources
  • Blog
Company
  • Privacy Policy
  • Terms of Service
  • Sitemap
Copyright © 2026 All Rights Reserved.
  • All

  • 30u30

  • ASR

  • ChatGPT

  • GNN

  • IDE

  • RAG

  • ai-agent

  • ai-api

  • ai-api-management

  • ai-client

  • ai-coding

  • ai-demos

  • ai-development

  • ai-framework

  • ai-image

  • ai-image-demos

  • ai-inference

  • ai-leaderboard

  • ai-library

  • ai-rank

  • ai-serving

  • ai-tools

  • ai-train

  • ai-video

  • ai-workflow

  • AIGC

  • alibaba

  • amazon

  • anthropic

  • audio

  • blog

  • book

  • bytedance

  • chatbot

  • chemistry

  • claude

  • course

  • deepmind

  • deepseek

  • engineering

  • foundation

  • foundation-model

  • gemini

  • github

  • google

  • gradient-booting

  • grok

  • huggingface

  • LLM

  • llm

  • math

  • mcp

  • mcp-client

  • mcp-server

  • meta-ai

  • microsoft

  • mlops

  • NLP

  • nvidia

  • ocr

  • ollama

  • openai

  • paper

  • physics

  • plugin

  • pytorch

  • RL

  • science

  • sora

  • translation

  • tutorial

  • vibe-coding

  • video

  • vision

  • xAI

  • xai

Icon for item

TEN Framework

2024
TEN-framework

TEN Framework is an open-source framework for building real-time multimodal conversational voice AI agents. It supports low-latency, high-quality interactions with components like STT, LLM, and TTS, and includes extensible agent examples such as voice assistants, lip-sync avatars, and speech diarization. It integrates with services like Agora, OpenAI, and Deepgram, and its ecosystem features VAD, turn detection, and portal tools for real-time communication and hardware integration.

ai-frameworkai-agentchatbotaudioASR+3
Icon for item

CosyVoice (Fun-CosyVoice)

2024
FunAudioLLM

CosyVoice (Fun-CosyVoice) is a multilingual, LLM-based text-to-speech (TTS) system that provides end-to-end capabilities for training, inference and deployment. It focuses on zero-shot voice cloning, strong content consistency, speaker similarity and natural prosody, supports many languages and Chinese dialects, pronunciation inpainting, text normalization, and low-latency bi-streaming for production use.

audioLLMhuggingfacepytorchai-inference+2
Icon for item

NexaSDK

2024
NexaAI

NexaSDK is a cross‑platform developer toolkit and low‑level inference engine (NexaML) for running AI models locally on NPUs, GPUs and CPUs. It supports GGUF, MLX and .nexa model formats, provides Day‑0 support for new architectures, multimodal capabilities (text, vision, audio), mobile SDKs (Android/iOS), OpenAI‑compatible APIs, and optimized NPU support.

githubai-inferenceai-servingai-clientai-framework+5
Icon for item

VideoCaptioner

2024
WEIFENG2333

VideoCaptioner is an AI-powered video subtitling assistant that combines ASR (local or cloud) with LLM-based subtitle segmentation, correction and translation. It supports offline GPU transcription, concurrent chunk transcription, VAD, speaker-aware processing, batch subtitling and one-click subtitle-to-video synthesis, with both GUI and CLI options.

videoai-videoaudioASRLLM+3
Icon for item

Project AIRI

2024
moeru-ai

AIRI is an open-source, self-hosted digital companion / AI VTuber platform that recreates and extends the idea of Neuro-sama. It integrates LLMs, speech recognition and synthesis, game-playing agents (Minecraft, Factorio), realtime voice chat, and multi-platform frontends (web, desktop, mobile). AIRI supports many LLM providers, RAG/memory components, Web and native acceleration (WebGPU, CUDA/Metal via native runtimes), and aims to let users own and run persistent AI characters locally or on their servers.

llmai-agentai-clientchatbotaudio+4
Icon for item

Chatterbox TTS

2025
Resemble AI

Chatterbox is an open-source family of state-of-the-art text-to-speech models from Resemble AI. It includes Chatterbox-Turbo (a 350M-parameter efficient model with paralinguistic tags and single-step mel decoding), Chatterbox, and a multilingual model supporting 23+ languages. Designed for low-latency voice agents, narration, and creative workflows; includes built-in PerTh watermarking and demo/Hub integrations.

audiogithubai-toolspytorchhuggingface
Icon for item

WhisperLiveKit

2025
QuentinFuxa

WhisperLiveKit is an ultra-low-latency, self-hosted speech-to-text toolkit with speaker identification. Powered by leading simultaneous speech research like Simul-Whisper and WhisperStreaming, it enables intelligent buffering and incremental processing for real-time transcription, translation across 200 languages, and speaker diarization. Ideal for meeting notes, accessibility tools, and content creation.

githubai-toolsai-inferenceai-servingASR+2
Icon for item

VibeVoice

2025
Microsoft

VibeVoice is Microsoft's open-source frontier voice AI framework designed for generating expressive, long-form, multi-speaker conversational audio (e.g., podcasts) from text. It supports up to 90 minutes of speech with up to 4 distinct speakers. Key innovations include continuous speech tokenizers at 7.5 Hz frame rate and next-token diffusion using LLMs for context and high-fidelity acoustics. Recently released VibeVoice-Realtime-0.5B for real-time streaming TTS with ~300ms latency

microsoftaudiogithubhuggingfacepaper+2

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

2015
Dario Amodei, Rishita Anubhai +32

This paper presents Deep Speech 2, an end-to-end deep learning system for automatic speech recognition that works across vastly different languages (English and Mandarin). It replaces traditional hand-engineered ASR pipelines with neural networks, achieving human-competitive transcription accuracy on standard datasets. The system uses HPC techniques for 7x speedup, enabling faster experimentation. Key innovations include Batch Normalization for RNNs, curriculum learning (SortaGrad), and GPU deployment optimization (Batch Dispatch). The approach demonstrates that end-to-end learning can handle diverse speech conditions including noise, accents, and different languages, representing a significant step toward universal speech recognition systems.

30u30paperaudioASR
  • Previous
  • 1
  • 2
  • Next