AIAny - VideoCaptioner

VideoCaptioner (Kaka Subtitle Assistant)

Overview

VideoCaptioner is an open-source tool that automates the full subtitle pipeline for videos by combining speech recognition (local or cloud) with large language models (LLMs) for intelligent segmentation, correction and translation. It targets users who need fast, accurate and readable subtitles without heavy configuration — supporting both offline GPU-powered transcription and online APIs.

Key features

Multi-mode ASR: supports online endpoints and local Whisper/fasterWhisper models (including GPU acceleration).
LLM-based processing: uses LLMs for smart sentence segmentation, subtitle optimization and high-quality translation (supports integration with OpenAI-compatible services, DeepSeek, SiliconCloud, etc.).
VAD and audio separation: voice activity detection and human-voice separation (MDX-Net) to reduce noise and transcription hallucinations.
High-precision timestamps: supports word/character-level timestamps for accurate subtitle alignment.
Batch processing and concurrency: chunked concurrent transcription with automatic merging, plus batch subtitle synthesis for many videos.
Subtitle export and styling: outputs SRT/ASS/VTT/TXT and supports multiple subtitle style templates and soft/hard subtitle synthesis.
Lightweight desktop distribution: small packaged executable for Windows (≈60MB), plus macOS/Linux run scripts and a web-style documentation site.

Supported components & workflow

Download/import video (supports many platforms including Bilibili, YouTube, TikTok, X, Douyin, etc.).
Transcription: choose online API (B/J endpoints) or local Whisper/fasterWhisper models (Tiny/Small/Medium/Large-v2...).
Post-processing with LLM: intelligent segmentation (semantic or sentence-based), correction (punctuation, capitalization, domain terms), and optional translation (LLM or MS/Google translate).
Subtitle synthesis: generate/substitute subtitles in desired format and optionally burn-in (hardcode) or produce soft subtitles for players.

Deployment & quick start

Windows: download executable from the Releases page and run. GUI guides API and model configuration.
macOS/Linux: clone the repo and run provided run.sh (script sets up venv, installs deps, checks ffmpeg/aria2).
Local models: supports fasterWhisper (recommended) and WhisperCpp; Large-v2 is suggested for Chinese quality; fasterWhisper recommended for accuracy and timestamps.
LLM proxy: project offers an API proxy (https://api.videocaptioner.cn) to simplify using diverse LLM providers and higher concurrency.

Typical usage scenarios

Content creators who need high-quality, translated subtitles for lectures, talks, and short videos.
Teams doing bulk subtitling where speed, accuracy and consistent terminology matter.
Users requiring offline/off-network transcription for privacy-sensitive material.

Notes & tips

For Chinese ASR, use at least Medium or Large-v2 Whisper variants; for other languages smaller models may suffice.
Enable VAD and audio separation for noisy videos to reduce hallucinations.
When using LLM translation, enabling "reflection" (iterative translation optimization) improves quality but increases token usage and latency.

Project info & community

Repository owner: WEIFENG2333
GitHub metadata (from repo): created 2024-10-31, stars: 12,218 (as collected in context).
Documentation: https://weifeng2333.github.io/VideoCaptioner/
Releases and packaged executables available on the GitHub Releases page.

VideoCaptioner

Introduction

VideoCaptioner (Kaka Subtitle Assistant)

Overview

Key features

Supported components & workflow

Deployment & quick start

Typical usage scenarios

Notes & tips

Project info & community

Links

Information

Categories

Tags

More Items

edge-tts

DiffSynth-Studio

X-AnyLabeling