AIAny - Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Why this matters Most MoE routers select experts by simple similarity between router rows and tokens, but there has been no principled way to make each router row actually encode the expressive structure of its associated expert. The paper's core insight is that aligning a router row with the expert's principal singular direction gives the most informative representative vector for token–expert affinity, and that this alignment can be driven efficiently during training.

Key Findings

Manifold Power Iteration (MPI): the authors introduce a "power‑then‑retract" update that performs a power iteration step on router weights followed by a retraction enforcing a norm constraint. So what: the step nudges router rows toward principal singular vectors while keeping stability and computational cost manageable.
Theoretical convergence: they provide analysis showing MPI drives router rows to converge to the principal singular directions of their associated expert matrices. So what: gives a principled justification for the redesign rather than a purely heuristic modification.
Empirical pretraining across scales: reported experiments pretrain MoE models from ~1B to ~11B parameters and show that routers trained with MPI yield better token‑expert alignment and improved downstream pretraining behavior. So what: the method scales and can be integrated into large MoE pretraining pipelines.
Practical constraint handling: the retraction enforces norm constraints that stabilize training, reducing instability that can arise from unconstrained power iterations.

Who it's for and tradeoffs

Great fit if you: work on large-scale MoE architectures, care about improving routing quality and token‑expert matching, or operate MoE pretraining pipelines at billion‑parameter scales. Look elsewhere if you: need a plug‑and‑play inference optimization for already trained MoE models (MPI is integrated in training updates), or your setting cannot afford extra per‑update computations—although the paper emphasizes MPI's efficiency, it still adds computation compared to vanilla router updates.

Where it fits

This paper sits at the intersection of model architecture design and training methodology for sparse expert models. It is most relevant to researchers and engineers optimizing MoE routing rules and those building large sparse foundation models where expert specialization and reliable routing matter.

Method sketch

MPI alternates a power iteration-like step that amplifies the dominant singular direction of the expert-related representation with a geometric retraction that enforces a norm manifold constraint. Conceptually this converts router rows from arbitrary proxies into vectors that mathematically summarize each expert's most expressive direction, improving dot‑product based affinity between tokens and experts.

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Introduction

Key Findings

Who it's for and tradeoffs

Where it fits

Method sketch

Information

Categories

Tags

More Items

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Beyond Euclidean Clipping: Overcoming Exploration Collapse in LLM RL via Riemannian Isometric Policy Optimization

Scaling Laws for Hypernetwork-Based Knowledge Injection in Large Language Models