LogoAIAny
Icon for item

VideoCaptioner

VideoCaptioner is an AI-powered video subtitling assistant that combines ASR (local or cloud) with LLM-based subtitle segmentation, correction and translation. It supports offline GPU transcription, concurrent chunk transcription, VAD, speaker-aware processing, batch subtitling and one-click subtitle-to-video synthesis, with both GUI and CLI options.

Introduction

VideoCaptioner (Kaka Subtitle Assistant)

Overview

VideoCaptioner is an open-source tool that automates the full subtitle pipeline for videos by combining speech recognition (local or cloud) with large language models (LLMs) for intelligent segmentation, correction and translation. It targets users who need fast, accurate and readable subtitles without heavy configuration — supporting both offline GPU-powered transcription and online APIs.

Key features
  • Multi-mode ASR: supports online endpoints and local Whisper/fasterWhisper models (including GPU acceleration).
  • LLM-based processing: uses LLMs for smart sentence segmentation, subtitle optimization and high-quality translation (supports integration with OpenAI-compatible services, DeepSeek, SiliconCloud, etc.).
  • VAD and audio separation: voice activity detection and human-voice separation (MDX-Net) to reduce noise and transcription hallucinations.
  • High-precision timestamps: supports word/character-level timestamps for accurate subtitle alignment.
  • Batch processing and concurrency: chunked concurrent transcription with automatic merging, plus batch subtitle synthesis for many videos.
  • Subtitle export and styling: outputs SRT/ASS/VTT/TXT and supports multiple subtitle style templates and soft/hard subtitle synthesis.
  • Lightweight desktop distribution: small packaged executable for Windows (≈60MB), plus macOS/Linux run scripts and a web-style documentation site.
Supported components & workflow
  1. Download/import video (supports many platforms including Bilibili, YouTube, TikTok, X, Douyin, etc.).
  2. Transcription: choose online API (B/J endpoints) or local Whisper/fasterWhisper models (Tiny/Small/Medium/Large-v2...).
  3. Post-processing with LLM: intelligent segmentation (semantic or sentence-based), correction (punctuation, capitalization, domain terms), and optional translation (LLM or MS/Google translate).
  4. Subtitle synthesis: generate/substitute subtitles in desired format and optionally burn-in (hardcode) or produce soft subtitles for players.
Deployment & quick start
  • Windows: download executable from the Releases page and run. GUI guides API and model configuration.
  • macOS/Linux: clone the repo and run provided run.sh (script sets up venv, installs deps, checks ffmpeg/aria2).
  • Local models: supports fasterWhisper (recommended) and WhisperCpp; Large-v2 is suggested for Chinese quality; fasterWhisper recommended for accuracy and timestamps.
  • LLM proxy: project offers an API proxy (https://api.videocaptioner.cn) to simplify using diverse LLM providers and higher concurrency.
Typical usage scenarios
  • Content creators who need high-quality, translated subtitles for lectures, talks, and short videos.
  • Teams doing bulk subtitling where speed, accuracy and consistent terminology matter.
  • Users requiring offline/off-network transcription for privacy-sensitive material.
Notes & tips
  • For Chinese ASR, use at least Medium or Large-v2 Whisper variants; for other languages smaller models may suffice.
  • Enable VAD and audio separation for noisy videos to reduce hallucinations.
  • When using LLM translation, enabling "reflection" (iterative translation optimization) improves quality but increases token usage and latency.
Project info & community
  • Repository owner: WEIFENG2333
  • GitHub metadata (from repo): created 2024-10-31, stars: 12,218 (as collected in context).
  • Documentation: https://weifeng2333.github.io/VideoCaptioner/
  • Releases and packaged executables available on the GitHub Releases page.

Information

  • Websitegithub.com
  • AuthorsWEIFENG2333
  • Published date2024/10/31