Prefill–decode disaggregation can leave decode workers balanced by request count but imbalanced by expert-weight loading: two equally loaded workers may still differ in latency if their batches activate different expert sets. ELDR reduces this hidden skew by predicting the experts a request will use from its prefill activations and routing the request to decode workers that are both a good signature match and currently lightly loaded—so weight loads are localized and repeated expert fetches are reduced.
Key Findings
- Median TPOT reductions of 5.9–13.9% across deployments up to 40 GPUs, evaluated on three MoE models and two workloads; outputs are unchanged, so correctness is preserved. This means noticeable tail-latency and overall latency improvement without altering generation results.
- Signature-based locality yields fewer distinct experts per worker batch, cutting per-step weight loads; offline balanced K-means partitions signature space so workers get complementary signature regions, while online locality-band routing picks the least-loaded worker among top matches.
- A signature cache co-indexed with the KV cache at KV-block granularity preserves exact signatures under prefix caching, enabling consistent routing with cached prefixes and avoiding stale locality estimates.
Who it's for and trade-offs
Great fit if you run PD-disaggregated serving for large MoE LLMs where per-decode-step expert weight transfers materially affect latency—ELDR improves decode latency without changing model outputs. Look elsewhere if your serving setup is not PD-disaggregated, your models are dense (non-MoE), or expert-loads are insignificant compared to other bottlenecks (e.g., network or token decoding). ELDR requires collecting prefill activations, maintaining signature state and K-means partitions, and extra routing logic, so it adds modest bookkeeping and memory overhead in exchange for lower decode weight-transfer costs.
How it works (brief)
From each request's prefill pass ELDR computes an expert signature predicting future expert activations. Offline, balanced K-means partitions signatures across decode workers. Online, the router considers a small locality band of best-matching workers and assigns the request to the least-loaded one in that band. The signature cache is co-indexed with the KV cache at block granularity so cached prefixes preserve exact signature lookups, keeping routing decisions coherent under prefix reuse.
