moe
Mixture-of-Experts MLP model for rsl-rl.
Mirrors lav2/runner/skrl/cfg/LAV2_base_moe.py but adapted to rsl-rl's
MLPModel interface.
Uses loss-free load balancing: expert bias terms are adjusted after each PPO update based on expert utilisation, without an auxiliary loss term.
References
- https://kexue.fm/archives/10699 (Math details)
- https://kexue.fm/archives/10757 (Auxiliary-Loss-Free Load Balancing)
- https://github.com/ambisinister/lossfreebalance (Loss-Free impl)
Classes:
| Name | Description |
|---|---|
MoELayer |
Mixture-of-experts layer with sigmoid top-k routing. |
MoEMLPModel |
MLP model whose trunk uses a mixture-of-experts layer. |
MoELayer
MoELayer(input_size: int, output_size: int, num_experts: int = 4, k: int = 2, bias_update_speed: float = 0.001)
Bases: Module
Mixture-of-experts layer with sigmoid top-k routing.
Uses unbiased sigmoid gate values for expert weighting and adds a
learned bias to routing logits for load balancing. Bias corrections
are accumulated in :attr:bias_updates and applied externally
(e.g. by :class:~lav2.runner.rsl_rl.algorithms.moe.PPOMoE).
Initialize experts, gate network, and load-balancing bias.
Methods:
| Name | Description |
|---|---|
forward |
Compute routed expert outputs. |
forward
forward(x: Tensor) -> torch.Tensor
Compute routed expert outputs.
MoEMLPModel
MoEMLPModel(obs, obs_groups, obs_set, output_dim, hidden_dims=(64, 64), activation='elu', obs_normalization=False, distribution_cfg=None, num_experts: int = 4, k: int = 2)
Bases: MLPModel
MLP model whose trunk uses a mixture-of-experts layer.
The architecture mirrors the skrl counterpart:
Linear → Act → MoELayer → Act → Linear(head).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Number of experts in the MoE layer. |
4
|
|
int
|
Number of experts routed to per token (top-k). |
2
|
Build the MoE-enhanced trunk and distribution head.