Skip to content

Sim2Sim

This guide covers the workflow for taking a policy trained in an RL backend and replaying it in MuJoCo to check whether the policy still behaves correctly outside its original training environment.

That workflow is especially useful because training backends such as Isaac Lab often require significantly more compute than a typical personal workstation can provide. In practice, many users train on shared servers or remote GPU machines, then need a separate lightweight environment for local inspection and validation. A MuJoCo-based Sim2Sim path fills that gap: it lets you bring the exported policy back to a local machine, run quick replay tests, and debug behavior before moving on to more complex transfer or deployment steps.

This is slightly different from the usual Sim2Sim motivation in many robot-learning projects. There, the main concern is often that contact dynamics are implemented differently across simulation engines, and MuJoCo becomes the preferred replay target because its contact modeling is accurate and widely trusted. LAV2 also benefits from that property, especially for tasks where contact behavior matters, but it is not the primary reason for this workflow. Here, the central goal is still to provide a lightweight local validation environment after training has happened in a heavier backend.

In LAV2, the practical sim-to-sim path has three parts:

  1. export the trained policy into a deployable model file
  2. run the policy in MuJoCo through the common play runner
  3. iterate on observation, command, and timing alignment until the transferred behavior is trustworthy
sequenceDiagram
  autonumber
  participant Trainer as RL backend
  participant Exporter as Exporter
  participant Runner as MuJoCo play runner
  participant Policy as Exported policy
  participant Sim as MuJoCo

  Trainer->>Exporter: Export trained checkpoint
  Exporter-->>Runner: Load exported model file
  Runner->>Sim: Read state and sensors
  Runner->>Runner: Reconstruct observation
  Runner->>Policy: Run inference
  Policy-->>Runner: Return normalized action
  Runner->>Runner: Apply mapping and command interpretation
  Runner->>Sim: Step rollout with mapped command
  Runner->>Runner: Check timing and behavior alignment

1. Export The Trained Policy

The first step is to export the trained policy into a model format that can be loaded independently of the original training runtime.

In practice this is usually straightforward if the training library already provides an exporter:

  • Isaac Lab workflows using rsl_rl can export trained policies through the corresponding tooling in that stack
  • LAV2's skrl path already integrates export support inside lav2.runner.skrl.isaaclab

The LAV2 skrl runner supports exporting both TorchScript (.pt) and ONNX (.onnx) policies. Those exported files are the artifacts expected by the MuJoCo-side play scripts and by lav2.runner.skrl.eval.

2. Replay In MuJoCo

Once you have an exported policy file, the next step is to test it in MuJoCo using the common play utilities under lav2.runner.common.mujoco.

The repository already provides two entry scripts for this:

  • scripts/sim2sim/LAV2_base_play.py
  • scripts/sim2sim/LAV2_base_vel_play.py

These scripts wrap MuJoCoPlayConfig, MuJoCoPlayRunner, and the policy loader in lav2.runner.skrl.eval.

Typical usage looks like:

uv run python scripts/sim2sim/LAV2_base_play.py --checkpoint /path/to/policy.pt
uv run python scripts/sim2sim/LAV2_base_vel_play.py --checkpoint /path/to/policy.onnx

At this stage, MuJoCo replay is not limited to one fixed command source. You can test the policy with:

  • keyboard commands
  • gamepad commands
  • specified trajectories through the trajectory hook in MuJoCoPlayConfig

That makes MuJoCo replay useful both as a quick deployment sanity check and as a richer qualitative testbed for transferred policies.

What To Check First During Replay

Start by checking that the model loads, the observation size matches the training setup, and the policy produces bounded actions. Then verify timing, command mapping, and qualitative tracking behavior.

3. Do Alignment Testing

A policy replaying in MuJoCo is not enough by itself. The more important step is to make sure the deployment-side assumptions still match what the policy saw during training.

The main alignment items to test are:

  • control frequency and decimation
  • normalized action mapping and command interpretation
  • observation construction and ordering

Control Frequency

The deployment loop must respect the trained policy's effective action rate. That includes both the MuJoCo-side control decimation in MuJoCoPlayConfig and the original training-side assumptions about control update frequency.

If the policy was trained at one control rate and replayed at another, the result can look like poor transfer even when the model itself is fine.

Normalization And Mapping

The action coming out of the exported model is usually still in the normalized space expected by the training environment. It must therefore be mapped back into the controller or actuator space used in MuJoCo.

In LAV2 this is handled through the action and command logic inside lav2.runner.common.mujoco, together with the controller-side mapping utilities such as lav2.controller.mapping.

This is one of the first places to inspect when a policy appears numerically stable but produces obviously wrong motion.

Observation Alignment

The MuJoCo replay observation must match the training observation in:

  • channel meaning
  • channel ordering
  • frame convention
  • normalization assumptions

This is why the default observation extraction inside MuJoCoPlayRunner matters so much. If the replay observation differs from the training observation, the policy is not really being tested on the same task anymore.

Goal Of Sim2Sim

The goal of this stage is not just to make the policy run. The goal is to make the policy run under a replay stack whose timing, mappings, and observations are close enough to the training setup that the result is meaningful.

That is the minimum standard before moving on to more difficult transfer or deployment work.

API Cross-References