Streaming video generation shifts the workload model from one-shot batch jobs to long-lived interactive sessions that must preserve per-session state and deliver each generated chunk under tight latency. The paper's central insight is that placement and GPU budgeting cannot be treated independently: a closed-loop that continuously rebalances sessions and adapts GPU capacity prevents bottleneck GPUs during bursts while avoiding waste during idle periods.
Key Findings
- Joint closed-loop scheduling (migration-aware placement + load-driven autoscaling) reduces worst-case per-chunk latency significantly, because it actively migrates sessions away from transient bottlenecks instead of leaving long-running placements fixed.
- Runtime mechanisms (coalesced chunk processing for intra-GPU batching, GPU–CPU offloading for idle-session suspension/resumption, and NCCL-based GPU–GPU migration) improve GPU utilization and enable fast, low-overhead rebalancing so latency targets are met without excessive capacity.
- On production traces across multiple model sizes and clusters up to 64 NVIDIA B300 GPUs, the approach cuts worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average versus baseline configurations, demonstrating a better latency–cost tradeoff in dynamic workloads.
Who it's for and trade-offs
Great fit if you operate multi-user, multi-GPU services that stream video content in real time and need to balance tight per-chunk latency with cost (e.g., personalized content, interactive media, cloud video features). Look elsewhere or simplify if your workload is primarily one-shot offline generation, single-GPU, or you cannot support GPU migration/NCCL or GPU–CPU state offload—the system adds orchestration complexity and depends on fast migration and reliable runtime telemetry.
Where it fits
This paper targets the operational layer of generative-video services—positioned between model implementations and cloud infra autoscaling. It contrasts with static provisioning or request-level scheduling used in traditional LLM/one-shot image generation by treating long-lived sessions and temporal demand bursts as first-class scheduling constraints.
Methods (brief)
The system formulates online scheduling that jointly controls session placement and GPU provisioning. The placement controller runs event-driven min–max rebalancing to reduce the maximum per-chunk latency; the autoscaler adjusts GPU budget using runtime load feedback. Together with coalesced chunk execution and state-migration primitives, this enables both latency stability and cost efficiency without changing the generation models themselves.
