Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

physics

pi

plugin

polars

postgres

privacy

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

RL

robotics

rust

science

security

segmentation

shodan

skillkit

sora

speech

sqlite

ssh

stt

swe

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

windsurf

xAI

xai

Pipecat

pipecat-aiDaily

Builds real-time voice and multimodal AI agents as composable streaming pipelines. Vendor-neutral: swap among 20+ STT, 20+ LLM and 30+ TTS providers over WebRTC or WebSockets, and compose multi-agent systems with handoff and parallel workers.

chatbot audio ai-framework ai-agent github+4

Call Center AI

Lets AI agents place and answer business phone calls, holding spoken conversations to collect structured data, answer questions, and escalate to humans. Built on Azure Communication Services and Azure OpenAI, with RAG over your own documents.

microsoft github ai-api chatbot audio+5

GPT-SoVITS-WebUI

Clones a voice from a 5-second sample for zero-shot TTS, or fine-tunes on ~1 minute of audio for few-shot synthesis. Covers Chinese, English, Japanese, Korean, and Cantonese, with a WebUI bundling vocal separation, ASR, and dataset labeling.

github audio pytorch ASR huggingface+3

ebook2audiobook

Converts e-books (epub, pdf, mobi, docx, and more) into chapter-aware audiobooks, with optional zero-shot voice cloning. Bundles eight TTS engines including XTTSv2 and Bark, and covers 1,158 languages via Meta's MMS — all runnable on CPU or GPU.

audio gitHub ai-tools python huggingface+1

MiniCPM-o

OpenBMBOpenBMB, ModelBest

Runs GPT-4o-class vision, speech, and full-duplex audio-video conversation on a 9B model small enough to deploy on phones and tablets. The 4.5 release scores 77.6 on OpenCompass and adds real-time bilingual voice with voice cloning.

foundation-model llm pytorch vision audio+5

Gemini-API

Asynchronous, reverse-engineered Python API for programmatic access to the Google Gemini web app — supports persistent cookie auth, streaming text, image/video/audio generation, deep-research workflows, model selection, and a CLI for automation and chatbots.

gemini ai-api python cli image+6

MoneyPrinterV2

Automates online monetization workflows—generating and scheduling YouTube Shorts, posting to X (Twitter), running affiliate campaigns, and outreach. Modular provider-based design (TTS, LLM hooks, CRON scheduler) and configurable pipelines; legal/ToS risks mean use with caution.

github ai-tools ai-video audio AIGC+2

omi

Continuously captures your screen and spoken conversations, transcribes them in real time, generates summaries and action items, and exposes a memory-backed chat that can retrieve what you've seen and heard. Works across desktop, mobile and wearable devices and supports local SDKs and cloud sync.

ai-client chatbot audio python rust+5

MLX-VLM

Provides local inference, fine-tuning, and a server/CLI for vision–language and omni (image/audio/video) models via MLX. Supports multi-image chat, audio/video inputs, activation quantization (CUDA), TurboQuant KV cache, and LoRA/QLoRA fine-tuning for on-device workflows.

vision ai-inference ai-serving python github+5

OmniParse

Ingests documents, images, audio, video and web pages and converts them into structured, LLM-friendly markdown and parsed data. Runs locally (fits on a T4 GPU), supports ~20 file types, offers OCR, transcription, table extraction and a Gradio UI; deployable via Docker/Skypilot. Licensed under GPL-3.0; some model weights carry cc-by-nc-sa restrictions for commercial use.

ocr multimodal audio video docker+5

screenpipe

Mediar, Inc., Louis BeaumontMediar, Inc.

Continuously records your screen and audio 24/7 to a local, searchable timeline you can query in natural language. Stores screenshots with accessibility data in SQLite, and a plugin system runs scheduled AI agents on what it captures.

mcp-server mcp agent-skills audio ocr+5

TEN-framework/ten-framework

TEN Framework (TEN-framework)TEN Framework

Builds real-time multimodal conversational AI agents with voice-assistant examples, VAD, turn detection, RTC/WebSocket transport, avatars, transcription, and edge-device demos.

github ai-framework ai-agent audio ai-tools