The surprising idea behind this project is to treat video editing as a text problem: the LLM never "watches" frames by default, it reads packed, word‑timed transcripts and reasons about cuts, then uses on‑demand visual composites only for ambiguous decisions. That shift reduces noise, keeps the agent focused on audio-driven edit points, and makes automated, production-conscious edits feasible without dumping thousands of frames into the model.
What Sets It Apart
- Transcript-first pipeline: one ElevenLabs Scribe pass per source produces word-level timestamps and diarization; the agent reasons from a compact ~12KB packed transcript rather than raw frames. This enables precise, speech-aligned cuts (filler removal, false starts) without giving the LLM an overwhelming visual input.
- On-demand visuals + self-eval: a lightweight timeline_view PNG (filmstrip + waveform + word labels) is produced only at decision points and on rendered output cut boundaries, catching visual jumps, audio pops, and subtitle occlusion. The repo caps auto-fix iterations and only surfaces previews once self-eval passes.
- Production-aware defaults: small audio fades at cuts (30ms), per-segment auto color grading, subtitle burn-in in readable chunks, and a parallelized animation overlay system (HyperFrames/Remotion/Manim/PIL) let the agent deliver a ready-to-upload final.mp4 rather than a rough draft.
Who It's For and Trade-offs
Great fit if you regularly edit talking-heads, tutorials, interviews or other speech-driven content and want to offload repetitive assembly and tidy-up tasks to an agent (filler removal, consistent grades, subtitles, basic overlays). It removes much manual timing work and enforces production rules (no hidden subtitles, audio fades, cut sanity checks). Look elsewhere if your workflow demands frame‑accurate creative grading, complex manual compositing, or bespoke shot selection driven primarily by visuals — the system prioritizes audio-first decisions and uses visuals sparingly for verification, so fully visual-driven editorial choices may require manual intervention.
Where It Fits
Use this as an automation layer between raw takes and a human pass: it handles noisy takes, consolidates good phrases, and outputs a coherent first-pass final that a human editor can refine. It’s especially useful when you need fast turnarounds on educational or presenter-driven content and want consistent subtitles, safe cuts, and reproducible grading across episodes.
