Multilingual, low-latency text-to-speech model for speech generation and zero-shot voice cloning. Uses an MoE backbone with ECAPA-TDNN speaker embeddings, supports audio prefixes, fine-grained prosody/emotion controls and 44.1kHz output; optimized for Linux + NVIDIA GPUs.