AIAny - ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Prefill–decode disaggregation can leave decode workers balanced by request count but imbalanced by expert-weight loading: two equally loaded workers may still differ in latency if their batches activate different expert sets. ELDR reduces this hidden skew by predicting the experts a request will use from its prefill activations and routing the request to decode workers that are both a good signature match and currently lightly loaded—so weight loads are localized and repeated expert fetches are reduced.

Key Findings

Median TPOT reductions of 5.9–13.9% across deployments up to 40 GPUs, evaluated on three MoE models and two workloads; outputs are unchanged, so correctness is preserved. This means noticeable tail-latency and overall latency improvement without altering generation results.
Signature-based locality yields fewer distinct experts per worker batch, cutting per-step weight loads; offline balanced K-means partitions signature space so workers get complementary signature regions, while online locality-band routing picks the least-loaded worker among top matches.
A signature cache co-indexed with the KV cache at KV-block granularity preserves exact signatures under prefix caching, enabling consistent routing with cached prefixes and avoiding stale locality estimates.

Who it's for and trade-offs

Great fit if you run PD-disaggregated serving for large MoE LLMs where per-decode-step expert weight transfers materially affect latency—ELDR improves decode latency without changing model outputs. Look elsewhere if your serving setup is not PD-disaggregated, your models are dense (non-MoE), or expert-loads are insignificant compared to other bottlenecks (e.g., network or token decoding). ELDR requires collecting prefill activations, maintaining signature state and K-means partitions, and extra routing logic, so it adds modest bookkeeping and memory overhead in exchange for lower decode weight-transfer costs.

How it works (brief)

From each request's prefill pass ELDR computes an expert signature predicting future expert activations. Offline, balanced K-means partitions signatures across decode workers. Online, the router considers a small locality band of best-matching workers and assigns the request to the least-loaded one in that band. The signature cache is co-indexed with the KV cache at block granularity so cached prefixes preserve exact signature lookups, keeping routing decisions coherent under prefix reuse.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Introduction

Key Findings

Who it's for and trade-offs

How it works (brief)

Information

Categories

Tags

More Items

MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients