Skip to content

moe

Mixture-of-Experts MLP model for rsl-rl.

Mirrors lav2/runner/skrl/cfg/LAV2_base_moe.py but adapted to rsl-rl's MLPModel interface.

Uses loss-free load balancing: expert bias terms are adjusted after each PPO update based on expert utilisation, without an auxiliary loss term.

References
  • https://kexue.fm/archives/10699 (Math details)
  • https://kexue.fm/archives/10757 (Auxiliary-Loss-Free Load Balancing)
  • https://github.com/ambisinister/lossfreebalance (Loss-Free impl)

Classes:

Name Description
MoELayer

Mixture-of-experts layer with sigmoid top-k routing.

MoEMLPModel

MLP model whose trunk uses a mixture-of-experts layer.

MoELayer

MoELayer(input_size: int, output_size: int, num_experts: int = 4, k: int = 2, bias_update_speed: float = 0.001)

Bases: Module

Mixture-of-experts layer with sigmoid top-k routing.

Uses unbiased sigmoid gate values for expert weighting and adds a learned bias to routing logits for load balancing. Bias corrections are accumulated in :attr:bias_updates and applied externally (e.g. by :class:~lav2.runner.rsl_rl.algorithms.moe.PPOMoE).

Initialize experts, gate network, and load-balancing bias.

Methods:

Name Description
forward

Compute routed expert outputs.

forward

forward(x: Tensor) -> torch.Tensor

Compute routed expert outputs.

MoEMLPModel

MoEMLPModel(obs, obs_groups, obs_set, output_dim, hidden_dims=(64, 64), activation='elu', obs_normalization=False, distribution_cfg=None, num_experts: int = 4, k: int = 2)

Bases: MLPModel

MLP model whose trunk uses a mixture-of-experts layer.

The architecture mirrors the skrl counterpart: Linear → Act → MoELayer → Act → Linear(head).

Parameters:

Name Type Description Default

num_experts

int

Number of experts in the MoE layer.

4

k

int

Number of experts routed to per token (top-k).

2

Build the MoE-enhanced trunk and distribution head.