AIAny - audio

Handy

2024

cjpais

Handy is a free, open-source, extensible speech-to-text desktop application that works completely offline on Windows, macOS, and Linux. It leverages local AI models like Whisper and Parakeet V3 for privacy-focused transcription, allowing users to press a shortcut, speak, and paste text directly into any app without sending audio to the cloud. Key features include GPU acceleration, automatic language detection, and high extensibility for developers.

github ai-tools ASR audio openai+2

Ultimate Vocal Remover GUI

2020

Anjok07, aufr33

Ultimate Vocal Remover GUI (UVR) is an open-source graphical application that uses deep neural network source-separation models to remove vocals from audio files. It bundles the interface, Python and necessary dependencies, supports Windows/macOS/Linux, offers GPU acceleration options (NVIDIA, some AMD via DirectML, Apple MPS), and relies on FFmpeg and Rubber Band for non-WAV processing and time-stretch/pitch features. Models provided are trained by the UVR core developers (with some Demucs models as exceptions). Released under the MIT license.

github audio pytorch ai-tools

edge-tts

2022

rany2

edge-tts is a Python module that enables the use of Microsoft Edge's online text-to-speech service directly from Python code or via command-line tools like edge-tts and edge-playback, without requiring Microsoft Edge, Windows, or an API key.

github ai-tools audio microsoft ai-library

Buzz

2022

Chidi Williams (chidiwilliams)

Buzz is an open-source desktop and CLI tool that transcribes and translates audio offline using OpenAI's Whisper. It supports macOS, Windows and Linux, offers GUI and command-line interfaces, can be installed via native installers, Flatpak/Snap, winget or PyPI, and supports GPU acceleration via PyTorch/CUDA.

audio ASR translation github pytorch+1

FunASR

2022

Alibaba DAMO Academy, Northwestern Polytechnical University (NWPU) +5

FunASR is an open-source end-to-end speech recognition toolkit (ASR) led by Alibaba DAMO Academy. It supports ASR, voice activity detection (VAD), punctuation restoration, speaker verification/diarization, multi-talker ASR, emotion recognition and more. FunASR provides many industrial-grade pretrained models, inference scripts, and deployment runtimes for research and production use.

ASR audio pytorch ai-library huggingface+4

faster-whisper

2023

SYSTRAN

faster-whisper is an open-source reimplementation of OpenAI's Whisper model by SYSTRAN that uses CTranslate2 for fast, memory-efficient inference. It offers faster transcription (CPU/GPU), supports 8-bit quantization, batched transcription, word-level timestamps, VAD filtering, and provides tools to convert Transformers/Whisper checkpoints into CTranslate2 models.

github ASR audio ai-inference ai-library

NotebookLM

2023

Google

NotebookLM is an AI-powered research and note-taking tool developed by Google, utilizing Gemini models to analyze user-uploaded sources like documents, websites, and videos, generating summaries, insights, and audio overviews while prioritizing privacy.

google gemini ai-tools llm audio+1

RealtimeSTT

2023

Kolja Beigel

RealtimeSTT is an open-source, low-latency speech-to-text library for real-time applications. It combines voice-activity detection, wake-word activation and fast transcription (GPU-ready) to power voice assistants and other low-latency ASR use cases. The project is community-driven and accepts contributions.

ASR github pytorch audio ai-tools+3

LiveKit Agents

2023

LiveKit

LiveKit Agents is an open-source framework for building realtime, programmable voice and multimodal AI agents that run on servers. It provides integrations for STT, LLMs, TTS, realtime WebRTC APIs, job scheduling/dispatch, telephony, semantic turn detection, testing tools, and an extensible plugin ecosystem — enabling production-ready voice agents and conversational experiences.

ai-agent audio ai-tools github llm+3

Pipecat

2023

pipecat-ai

Pipecat is an open-source Python framework for building real-time voice and multimodal conversational AI agents. It provides pluggable integrations for STT, TTS, LLMs, transports (WebSocket/WebRTC), and composable pipelines to enable low-latency voice-first assistants, multimodal interfaces, and conversational agents.

chatbot audio ai-framework ai-agent github+4

GPT-SoVITS-WebUI

2024

RVC-Boss

GPT-SoVITS-WebUI is an open-source few-shot voice conversion and text-to-speech WebUI. It supports zero-shot (5s) TTS, few-shot (1min) fine-tuning for voice cloning, cross-lingual synthesis, integrated tooling (vocal separation, dataset slicing, ASR, text labeling) and multiple pretrained model versions (v1–v4, v2Pro, etc.).

github audio pytorch ASR huggingface+3

ebook2audiobook

2024

Drew Thomasson (GitHub: DrewThomasson)

ebook2audiobook is an open-source tool to convert eBooks (epub/pdf/mobi/txt etc.) into organized audiobooks with chapters and metadata. It supports multiple TTS engines (XTTSv2, Bark, VITS, Fairseq, YourTTS, Tacotron2 and more), optional voice cloning, and up to 1,158 languages. Offers a Gradio GUI, headless mode, Docker support, and remote demos (Hugging Face, Colab). Intended for legally acquired, non-DRM eBooks.

audio github ai-tools ai-train ai-inference

Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

alibaba

amazon

anthropic

audio

blog

book

bytedance

chatbot

chemistry

claude

course

deepmind

deepseek

engineering

foundation

foundation-model

gemini

github

google

gradient-booting

grok

huggingface

LLM

llm

math

mcp

mcp-client

mcp-server

meta-ai

microsoft

mlops

NLP

nvidia

ocr

ollama

openai

paper

physics

plugin

pytorch

RL

science

sora

translation

tutorial

vibe-coding

video

vision

xAI

xai

Handy