Overview
TEN Framework is a powerful open-source toolkit designed specifically for developing real-time multimodal conversational AI agents, with a strong emphasis on voice-based interactions. Launched as part of an innovative ecosystem, it enables developers to create low-latency, high-fidelity AI experiences that go beyond traditional chatbots, incorporating audio processing, real-time streaming, and multimodal elements like lip-sync avatars. The framework is particularly suited for applications requiring seamless human-AI dialogue, such as virtual assistants, interactive games, and hardware-integrated voice systems.
At its core, TEN provides a modular architecture that chains together essential AI pipelines: Speech-to-Text (STT) for accurate audio transcription, Large Language Models (LLM) for intelligent response generation, and Text-to-Speech (TTS) for natural voice output. This setup ensures end-to-end voice interactions happen in near real-time, minimizing delays that are critical for conversational flow. Developers can leverage pre-built extensions for Voice Activity Detection (VAD) and Turn Detection to handle interruptions and multi-speaker scenarios effectively.
Key Features and Capabilities
Modular Agent Examples
TEN shines through its diverse agent examples, which serve as ready-to-use blueprints for common use cases:
- Multi-Purpose Voice Assistant: A versatile, low-latency assistant supporting both Real-Time Communication (RTC) and WebSocket connections. It can be enhanced with memory modules for context retention, VAD for noise suppression, and turn detection for natural dialogue. This example is ideal for building responsive voice interfaces.
- Lip Sync Avatars: Integrates with avatar providers like HeyGen, Tavus, and Trulience to synchronize AI-generated speech with animated characters, including anime-style Live2D models. This adds visual engagement to voice AI, perfect for virtual meetings or entertainment apps.
- Speech Diarization: Real-time speaker identification and labeling, demonstrated in interactive games like 'Who Likes What?' It uses tools like Speechmatics to track multiple voices, enabling applications in meetings or podcasts.
- SIP Call Extension: Powers phone calls via SIP protocol, bridging AI agents with traditional telephony for broader accessibility.
- Transcription Tool: A standalone audio-to-text converter, useful for logging or analysis without full conversational setup.
- Hardware Integration (ESP32-S3 Korvo V3): Runs lightweight agents on Espressif's development board, allowing LLM-powered communication directly on edge devices for IoT scenarios.
These examples are housed in the ai_agents directory, making it easy to customize and deploy.
Ecosystem Integration
TEN isn't a standalone framework; it's the cornerstone of a broader ecosystem:
- TEN VAD: A lightweight, high-performance Voice Activity Detector for streaming audio, optimized for low latency.
- TEN Turn Detection: Facilitates full-duplex conversations by intelligently managing speaking turns, preventing awkward overlaps.
- Agent Examples Repository: Curated use cases powered by TEN, showcasing practical implementations.
- TEN Portal: Official documentation and blog hub at theten.ai, providing guides, updates, and community resources.
The framework integrates seamlessly with leading AI services: Agora for real-time audio/video, OpenAI for LLMs, Deepgram for ASR, and ElevenLabs for TTS. This modularity allows developers to swap providers based on needs, ensuring flexibility and cost-efficiency.
Getting Started and Deployment
Quick Setup
To dive in quickly:
- Prerequisites: Obtain API keys for Agora, OpenAI, Deepgram, and ElevenLabs. Ensure Docker, Docker Compose, and Node.js (v18 LTS) are installed on a system with at least 2 CPU cores and 4GB RAM.
- Local Development: Clone the repo, set up
.envwith keys, launch development containers viadocker compose up -d, build and run agents (e.g.,cd agents/examples/voice-assistantthentask install && task run), and access UIs atlocalhost:49483(TMAN Designer) andlocalhost:3000(Agent UI). - Customization: Use the visual TMAN Designer to configure extensions, or edit JSON properties directly.
For faster prototyping, GitHub Codespaces offers a Docker-free environment with pre-configured setups.
Self-Hosting Options
- Docker Deployment: Build custom images from example Dockerfiles and run them with environment variables, exposing ports like 3000 for the web UI.
- Cloud Services: Separate backend (on Docker-friendly platforms like Fly.io or AWS ECS) from frontend (deployable to Vercel or Netlify). Configure CORS and environment vars like
AGENT_SERVER_URLfor seamless operation.
Advanced users can run a beta transcriber app directly from TEN Manager without full Docker setup.
Community and Contributions
TEN fosters a vibrant community across platforms: Follow updates on X (Twitter), join Discord for developer discussions, connect on LinkedIn or Hugging Face, and engage Chinese users via WeChat. With over 9,000 stars on GitHub since its June 2024 launch, it's gaining traction rapidly.
Contributions are encouraged— from code fixes to documentation improvements. The project uses Apache 2.0 licensing (with conditions), and third-party attributions are detailed. Check issues, projects, or reach out to maintainers like @elliotchen200 on X.
Why Choose TEN?
In a crowded AI landscape, TEN stands out for its focus on real-time voice modalities, bridging the gap between text-based LLMs and immersive audio experiences. Its open-source nature democratizes access to production-grade tools, while the ecosystem ensures scalability from prototypes to deployments. Whether you're building the next smart home device or an engaging virtual companion, TEN provides the foundational blocks for innovative, responsive AI agents.
