Tag
Explore by tags
Serves interactive, long-lived streaming video-generation sessions by jointly scheduling session placement and GPU autoscaling to meet tight per-chunk latency. Combines migration-aware placement, load-driven autoscaling, coalesced chunk processing, GPU–CPU offloading and NCCL GPU–GPU migration; reports ~37% reductions in worst-case per-chunk latency and GPU operating cost.
Benchmark for evaluating procedural skill evolution in LLM agents: isolates reusable skill bodies, role-specific work surfaces, and hidden oracle assets to measure whether skill refinements transfer across tasks, roles, and model backbones. Includes 382 workplace tasks, 22 skills, and a controlled evaluation protocol.
NVFP4-quantized variant of Qwen3.6-27B that reduces parameter bits from 16 to 4, cutting disk and GPU memory requirements by ~2.5× while keeping comparable benchmark accuracy; ready for vLLM-based inference on NVIDIA hardware and supports long, multimodal contexts.
