Quadruped and humanoid policies trained with our RSL-RL-SAC, deployed on rough terrain.
Anymal D: Rough Terrain
Unitree H1: Rough Terrain
Our implementation is an extension of RSL-RL, the runner library used by most IsaacLab legged-robot baselines. It uses the same APIs, the same environment wrappers, the same training script, and the same logging; only the agent configuration file changes. If you already train with PPO on RSL-RL, switching to our RSL-RL-SAC is one config swap:
# Train with PPO (as today)
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
--task Isaac-Velocity-Rough-Anymal-D-v0 \
--agent rsl_rl_cfg_entry_point
# Train with our RSL-RL-SAC (same pipeline, different agent)
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
--task Isaac-Velocity-Rough-Anymal-D-v0 \
--agent rsl_rl_sac_cfg_entry_point
Because the integration sits inside RSL-RL, everything else the library offers carries over:
One minor additional dependency: training requires our IsaacLab fork linked in the header, which includes a small set of changes on top of the standard installation. Alternatively, the single relevant commit can be cherry-picked into your existing setup.
Code: leggedrobotics/rsl_rl_sac.
Training legged robots has converged on a single recipe: PPO, massively parallel simulation, done. It works well enough in the lab, but this approach carries a hidden cost that only becomes visible once a robot leaves the simulator. PPO is on-policy: it discards every batch of experience after a single gradient step. That's acceptable when spinning up 4096 environments costs nothing, but it's a poor foundation for the workflows that actually matter in deployment, fine-tuning on the real hardware, adapting to a new payload, learning from the few thousand interactions a robot can safely collect in the field. For those settings, you need an algorithm that squeezes value out of every transition it sees.
SAC is the obvious candidate. It maintains a replay buffer, reuses data across many updates, and has dominated benchmarks in domains where sample efficiency counts. The problem is that it has consistently failed to match PPO when both run inside the same massively parallel setup: slower to converge, less stable, occasionally not converging at all. That gap has been treated as a fundamental incompatibility, and so the sim-to-online story keeps getting deferred. We think the gap is a handful of fixable bugs, not a deep algorithmic mismatch.
The failures aren't fundamental, they're a handful of implementation details that quietly compound. Some have been pointed to before; others we identified along the way. Together, they make SAC look broken when it isn't:
None of these require rethinking SAC from scratch. Fix each one, and the same algorithm that "didn't work" in this domain matches PPO across the board.
Each of the failure modes above maps to a small, targeted change.
What's wrong: SAC squashes actions through a tanh scaled to $[a_{\min}, a_{\max}]$, so getting those bounds right matters. Depending on the environment wrapper, IsaacLab leaves them either unscaled, constraining the policy to $(-1, 1)^d$ and producing insufficiently small joint displacements, or set to arbitrarily large values that far exceed anything a trained policy would actually use. In the latter case the squashed Gaussian rescales to fill the whole declared range, pushing most probability mass to the saturation limits and turning early exploration into a near-uniform distribution.
What we do: for each joint \(j\), we compute per-joint bounds directly from the robot's soft joint limits \(q_j^{\min}, q_j^{\max}\) and default position \(q_j^0\), rescaled by the action-manager scale \(s\): \[ a_j^{\min} = -\frac{|q_j^{\min} - q_j^0|}{s}, \qquad a_j^{\max} = \frac{|q_j^{\max} - q_j^0|}{s}. \] We use soft limits rather than hard ones because using hard limits would enlarge the action space to include joint configurations the controller never reaches in practice, adding suboptimal regions that the policy must learn to avoid. Crucially, because \(q_j^0\) is rarely centered between the limits, \(|a_j^{\min}| \neq |a_j^{\max}|\) in general: the bounds are asymmetric, and computing them per-joint captures that asymmetry rather than collapsing it into a single symmetric value. The per-joint squashed action then maps exactly onto that safe range: \[ a_j = \frac{a_j^{\max} + a_j^{\min}}{2} + \frac{a_j^{\max} - a_j^{\min}}{2}\,\tanh(x_j). \]
What's wrong — and why it's worse for SAC than for PPO: simulation episodes end on a fixed step budget, not because the task failed. Treating those cutoffs as terminal states tells the critic "future value is zero", even when the robot was walking fine at the cutoff. PPO can handle this approximately by evaluating the current value function at the timeout state before discarding the rollout, because on-policy transitions are consumed immediately and the value estimate is always fresh. SAC has no such luxury: transitions sit in a replay buffer and are sampled repeatedly across many future updates, long after the policy and critic that collected them have moved on. This creates two compounding sources of staleness: the bootstrap value drifts, and the Q-function must be evaluated on actions from the current policy, not the one that was active when the transition was collected.
What we do: we separate a timeout mask \(b_t\) from a failure mask \(d_t\), and build the corrected next observation before storing the transition: \[ \tilde{s}_{t+1} = b_t \odot s_{t+1}^{\mathrm{p}} + (1 - b_t) \odot s_{t+1}, \] where \(s_{t+1}^{\mathrm{p}}\) is the observation just before the environment resets. The replay buffer stores the tuple \((s_t, a_t, r_t, \tilde{s}_{t+1}, d_t, b_t)\) with this corrected \(\tilde{s}_{t+1}\) already in place. At sampling time, both the value estimate and the action used for bootstrapping are recomputed with the current critic and policy, so the target is always consistent: \[ Q^{\mathrm{target}} = r_t + \gamma\, m_t\, \mathbb{E}\bigl[V_{\bar\theta}(\tilde{s}_{t+1})\bigr], \quad m_t = b_t + 1 - d_t. \]
What's wrong (empirically): one-step TD targets are noisy in this setting, and training on rough terrain is noticeably less stable without this fix. In practice, moving to \(n\)-step returns consistently improves stability and convergence speed here.
What we do: sample short windows from the replay buffer and build an \(n\)-step discounted return on the fly, with masking that respects the timeout vs. failure distinction from the previous trick.
Let \(S_k = \prod_{j=0}^{k-1}(1 - d_{t+j})\) be the survival indicator at step \(k\) within the window. The masked \(n\)-step target is: \[ Q^{\mathrm{target}} = \sum_{k=0}^{n-1} \gamma^k S_k\, r_{t+k} + \gamma^n S_n\, V(\tilde{s}_{t+n}) + \sum_{k=0}^{n-1} \gamma^{k+1} S_k\, b_{t+k}\, V(\tilde{s}_{t+k+1}). \] The first term accumulates discounted rewards, the second bootstraps at the horizon, and the third handles timeout transitions within the window, consistent with the timeout handling above.
What's wrong: even with the correct action bounds, a large initial std \(\sigma_0\) pushes the squashed Gaussian into its saturated regime, almost-uniform sampling across the whole range, with very little gradient signal early on. In locomotion tasks this is particularly damaging: near-random actions prevent the agent from collecting long rollouts, and since long rollouts are essential for learning stable gaits, the policy never acquires the signal it needs to improve.
What we do: start small and centered. We initialize the policy with a small \(\sigma_0\) (≈ 0.15), so initial actions concentrate around the default joint configuration; the mean head is initialized near zero and the log-std head with zero weights and a constant bias, so \(\sigma_0\) acts as a single, interpretable control parameter for early exploration.
The initial policy std \(\sigma_0\) directly shapes exploration through the tanh squashing function. Large values push probability mass into the saturation region, producing near-uniform exploration over the action range; smaller values concentrate mass in the linear region, yielding exploration close to the default joint configuration. The actor's final linear layer maps the last hidden feature \(h \in \mathbb{R}^{d_h}\) to a \(2d\)-dimensional output \(z = Wh + b\), split into a mean head \(z_\mu\) and a log-std head \(z_{\log\sigma}\), each of dimension \(d\). We initialize the two heads as: \[ [W_\mu]_{ij} \sim \mathcal{N}(0,\, \varepsilon),\quad b_\mu = \mathbf{0}_d, \qquad W_{\log\sigma} = \mathbf{0}_{d \times d_h},\quad b_{\log\sigma} = \log(\sigma_0)\,\mathbf{1}_d. \] Zero weights in the log-std head with a constant bias set to \(\log(\sigma_0)\) make \(\sigma_0\) a direct and interpretable control parameter for early exploration, independent of the input. Small random weights in the mean head anchor initial actions near the default joint configuration. Empirically, this leads to faster and more stable convergence compared to standard initializations.
$\sigma_0 = 1.0$: near-uniform exploration
$\sigma_0 = 0.2$: concentrated exploration
We evaluate on velocity tracking over rough terrain across 7 legged platforms — quadrupeds and humanoids — using a single shared hyperparameter set and PPO's unmodified reward functions. For each robot, the left plot shows sample efficiency (return vs. environment steps) and the right plot shows wall-clock time, both compared against RSL-RL's PPO baseline.
The headline: RSL-RL-SAC matches PPO across all platforms, and on humanoids (H1, G1) it consistently surpasses PPO; we attribute this to the denser reward shaping common in humanoid configs, which lets entropy-driven exploration pay off. A wall-clock gap remains; we discuss why below.
The final return gap with PPO is closed, but that doesn't mean every problem is solved. A few things still need work:
Our claim is narrower than "SAC is now better than PPO": it's that RSL-RL-SAC is now a viable off-policy choice in this domain, which opens the door to the sim-to-online and on-robot fine-tuning workflows that motivated this work in the first place. At the same time, out implementation is made of simple fixes, with the flexibility and modularity of RSL-RL underneath
We're not the only ones trying to make off-policy RL work at scale on legged robots, and our work builds on a number of others. If you're picking a tool for your own project, it's worth knowing what's out there.
@misc{sabatini2026bridginggapenablingsoft,
title={Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion},
author={Gianluca Sabatini and Chenhao Li and Marco Hutter},
year={2026},
eprint={2605.24975},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.24975},
}