Bridging the Gap: Enabling Soft Actor-Critic for High Performance Legged Locomotion

Quadruped and humanoid policies trained with our RSL-RL-SAC, deployed on rough terrain.

Anymal D: Rough Terrain

Unitree H1: Rough Terrain

TL;DR

A simple yet effective set of fixes enables RSL-RL-SAC to match PPO's final return across several legged platforms - quadrupeds and humanoids — with a single shared hyperparameter set and unmodified reward functions.
The root causes of SAC's failures in this setting come down to four things: miscalibrated action bounds, poor actor initialization, incorrect timeout handling, and noisy reward propagation. Fix these, and the gap closes.
Released as a drop-in extension of RSL-RL: same APIs, same training pipeline, same wrappers; swap the agent config and you're running RSL-RL-SAC. RND, symmetry augmentation, and multi-GPU support are all inherited for free.

Built on RSL-RL

Our implementation is an extension of RSL-RL, the runner library used by most IsaacLab legged-robot baselines. It uses the same APIs, the same environment wrappers, the same training script, and the same logging; only the agent configuration file changes. If you already train with PPO on RSL-RL, switching to our RSL-RL-SAC is one config swap:

# Train with PPO (as today)
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
                --task Isaac-Velocity-Rough-Anymal-D-v0 \
                --agent rsl_rl_cfg_entry_point

# Train with our RSL-RL-SAC (same pipeline, different agent)
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
                --task Isaac-Velocity-Rough-Anymal-D-v0 \
                --agent rsl_rl_sac_cfg_entry_point

Because the integration sits inside RSL-RL, everything else the library offers carries over:

RND intrinsic-motivation exploration bonuses.
Symmetry augmentation for bilateral robots.
Multi-GPU training out of the box.

One minor additional dependency: training requires our IsaacLab fork linked in the header, which includes a small set of changes on top of the standard installation. Alternatively, the single relevant commit can be cherry-picked into your existing setup.

Code: leggedrobotics/rsl_rl_sac.

Motivation

The PPO monoculture, and why it matters

Training legged robots has converged on a single recipe: PPO, massively parallel simulation, done. It works well enough in the lab, but this approach carries a hidden cost that only becomes visible once a robot leaves the simulator. PPO is on-policy: it discards every batch of experience after a single gradient step. That's acceptable when spinning up 4096 environments costs nothing, but it's a poor foundation for the workflows that actually matter in deployment, fine-tuning on the real hardware, adapting to a new payload, learning from the few thousand interactions a robot can safely collect in the field. For those settings, you need an algorithm that squeezes value out of every transition it sees.

SAC is the obvious candidate. It maintains a replay buffer, reuses data across many updates, and has dominated benchmarks in domains where sample efficiency counts. The problem is that it has consistently failed to match PPO when both run inside the same massively parallel setup: slower to converge, less stable, occasionally not converging at all. That gap has been treated as a fundamental incompatibility, and so the sim-to-online story keeps getting deferred. We think the gap is a handful of fixable bugs, not a deep algorithmic mismatch.

Why SAC has been failing

The failures aren't fundamental, they're a handful of implementation details that quietly compound. Some have been pointed to before; others we identified along the way. Together, they make SAC look broken when it isn't:

The action space is miscalibrated. Depending on the environment wrapper, IsaacLab action bounds are either left unscaled, constraining the policy to tiny joint displacements, or set to arbitrarily large values that far exceed anything a trained policy would ever use. In the latter case, SAC's squashed Gaussian rescales to fill the whole declared range, pushing most probability mass to the saturation limits. From the very first iteration, "exploration" is a near-uniform distribution rather than useful motion.
Exploration needs a warm start, not a cold one. Even with correctly calibrated bounds, a large initial standard deviation $\sigma_0$ in the policy spreads the squashed Gaussian nearly uniformly across the action range, with little gradient signal to guide early learning. Starting with a small $\sigma_0$, concentrating exploration around the default joint configuration, leads to noticeably faster and more stable convergence in practice.
Episode timeouts are silently treated as failures. Episodes cut off on a fixed step budget for computational reasons, not because the task failed. If treated as terminal states, the critic incorrectly learns that future value is zero at the cutoff, even when the robot was walking fine. Unlike PPO, which discards transitions immediately, SAC keeps them in a replay buffer where they get sampled repeatedly. Without a proper fix, those stale transitions keep corrupting updates long after the policy has moved on.
Rewards on rough terrain take too long to propagate. One-step TD targets are noisy when rewards are sparse.

None of these require rethinking SAC from scratch. Fix each one, and the same algorithm that "didn't work" in this domain matches PPO across the board.

Tricks that help

Each of the failure modes above maps to a small, targeted change.

Right-sizing the action space

What's wrong: SAC squashes actions through a tanh scaled to $[a_{\min}, a_{\max}]$, so getting those bounds right matters. Depending on the environment wrapper, IsaacLab leaves them either unscaled, constraining the policy to $(-1, 1)^d$ and producing insufficiently small joint displacements, or set to arbitrarily large values that far exceed anything a trained policy would actually use. In the latter case the squashed Gaussian rescales to fill the whole declared range, pushing most probability mass to the saturation limits and turning early exploration into a near-uniform distribution.

What we do: for each joint $j$, we compute per-joint bounds directly from the robot's soft joint limits $q_j^{\min}, q_j^{\max}$ and default position $q_j^0$, rescaled by the action-manager scale $s$: \[ a_j^{\min} = -\frac{|q_j^{\min} - q_j^0|}{s}, \qquad a_j^{\max} = \frac{|q_j^{\max} - q_j^0|}{s}. \] We use soft limits rather than hard ones because using hard limits would enlarge the action space to include joint configurations the controller never reaches in practice, adding suboptimal regions that the policy must learn to avoid. Crucially, because $q_j^0$ is rarely centered between the limits, $|a_j^{\min}| \neq |a_j^{\max}|$ in general: the bounds are asymmetric, and computing them per-joint captures that asymmetry rather than collapsing it into a single symmetric value. The per-joint squashed action then maps exactly onto that safe range: \[ a_j = \frac{a_j^{\max} + a_j^{\min}}{2} + \frac{a_j^{\max} - a_j^{\min}}{2}\,\tanh(x_j). \]

Treating timeouts as timeouts, not failures

What's wrong — and why it's worse for SAC than for PPO: simulation episodes end on a fixed step budget, not because the task failed. Treating those cutoffs as terminal states tells the critic "future value is zero", even when the robot was walking fine at the cutoff. PPO can handle this approximately by evaluating the current value function at the timeout state before discarding the rollout, because on-policy transitions are consumed immediately and the value estimate is always fresh. SAC has no such luxury: transitions sit in a replay buffer and are sampled repeatedly across many future updates, long after the policy and critic that collected them have moved on. This creates two compounding sources of staleness: the bootstrap value drifts, and the Q-function must be evaluated on actions from the current policy, not the one that was active when the transition was collected.

What we do: we separate a timeout mask $b_t$ from a failure mask $d_t$, and build the corrected next observation before storing the transition: \[ \tilde{s}_{t+1} = b_t \odot s_{t+1}^{\mathrm{p}} + (1 - b_t) \odot s_{t+1}, \] where $s_{t+1}^{\mathrm{p}}$ is the observation just before the environment resets. The replay buffer stores the tuple $(s_t, a_t, r_t, \tilde{s}_{t+1}, d_t, b_t)$ with this corrected $\tilde{s}_{t+1}$ already in place. At sampling time, both the value estimate and the action used for bootstrapping are recomputed with the current critic and policy, so the target is always consistent: \[ Q^{\mathrm{target}} = r_t + \gamma\, m_t\, \mathbb{E}\bigl[V_{\bar\theta}(\tilde{s}_{t+1})\bigr], \quad m_t = b_t + 1 - d_t. \]

Smoother targets via n-step returns

What's wrong (empirically): one-step TD targets are noisy in this setting, and training on rough terrain is noticeably less stable without this fix. In practice, moving to $n$-step returns consistently improves stability and convergence speed here.

What we do: sample short windows from the replay buffer and build an $n$-step discounted return on the fly, with masking that respects the timeout vs. failure distinction from the previous trick.

Let $S_k = \prod_{j=0}^{k-1}(1 - d_{t+j})$ be the survival indicator at step $k$ within the window. The masked $n$-step target is: \[ Q^{\mathrm{target}} = \sum_{k=0}^{n-1} \gamma^k S_k\, r_{t+k} + \gamma^n S_n\, V(\tilde{s}_{t+n}) + \sum_{k=0}^{n-1} \gamma^{k+1} S_k\, b_{t+k}\, V(\tilde{s}_{t+k+1}). \] The first term accumulates discounted rewards, the second bootstraps at the horizon, and the third handles timeout transitions within the window, consistent with the timeout handling above.

Starting exploration where it should

What's wrong: even with the correct action bounds, a large initial std $\sigma_0$ pushes the squashed Gaussian into its saturated regime, almost-uniform sampling across the whole range, with very little gradient signal early on. In locomotion tasks this is particularly damaging: near-random actions prevent the agent from collecting long rollouts, and since long rollouts are essential for learning stable gaits, the policy never acquires the signal it needs to improve.

What we do: start small and centered. We initialize the policy with a small $\sigma_0$ (≈ 0.15), so initial actions concentrate around the default joint configuration; the mean head is initialized near zero and the log-std head with zero weights and a constant bias, so $\sigma_0$ acts as a single, interpretable control parameter for early exploration.

The initial policy std $\sigma_0$ directly shapes exploration through the tanh squashing function. Large values push probability mass into the saturation region, producing near-uniform exploration over the action range; smaller values concentrate mass in the linear region, yielding exploration close to the default joint configuration. The actor's final linear layer maps the last hidden feature $h \in \mathbb{R}^{d_h}$ to a $2d$-dimensional output $z = Wh + b$, split into a mean head $z_\mu$ and a log-std head $z_{\log\sigma}$, each of dimension $d$. We initialize the two heads as: \[ [W_\mu]_{ij} \sim \mathcal{N}(0,\, \varepsilon),\quad b_\mu = \mathbf{0}_d, \qquad W_{\log\sigma} = \mathbf{0}_{d \times d_h},\quad b_{\log\sigma} = \log(\sigma_0)\,\mathbf{1}_d. \] Zero weights in the log-std head with a constant bias set to $\log(\sigma_0)$ make $\sigma_0$ a direct and interpretable control parameter for early exploration, independent of the input. Small random weights in the mean head anchor initial actions near the default joint configuration. Empirically, this leads to faster and more stable convergence compared to standard initializations.

Squashed Gaussian PDF with sigma_0 = 1.0

$\sigma_0 = 1.0$: near-uniform exploration

Squashed Gaussian PDF with sigma_0 = 0.2

$\sigma_0 = 0.2$: concentrated exploration

Results

We evaluate on velocity tracking over rough terrain across 7 legged platforms — quadrupeds and humanoids — using a single shared hyperparameter set and PPO's unmodified reward functions. For each robot, the left plot shows sample efficiency (return vs. environment steps) and the right plot shows wall-clock time, both compared against RSL-RL's PPO baseline.

The headline: RSL-RL-SAC matches PPO across all platforms, and on humanoids (H1, G1) it consistently surpasses PPO; we attribute this to the denser reward shaping common in humanoid configs, which lets entropy-driven exploration pay off. A wall-clock gap remains; we discuss why below.

Anymal D

Unitree A1

A1 sample efficiency and wall-clock time

Unitree H1

H1 sample efficiency and wall-clock time

Unitree G1

G1 sample efficiency and wall-clock time

Unitree Go1

Go1 sample efficiency and wall-clock time

Unitree Go2

Go2 sample efficiency and wall-clock time

Anymal B

What still doesn't work

The final return gap with PPO is closed, but that doesn't mean every problem is solved. A few things still need work:

Wall-clock time still favors PPO. SAC's larger network architectures, higher update-to-data ratio, and multiple optimization objectives all add up. Closing this gap is an engineering and algorithms challenge, not a tuning one.
Entropy-driven exploration is not uniformly beneficial. Whether the entropy bonus helps or hurts, and by how much, depends on the task, and understanding this better remains an open question.
Seed-to-seed variance is higher than PPO's. Most seeds train cleanly, but variability across runs is more noticeable than with PPO.
High returns don't always mean natural motion, at least on the G1. On this platform, some policies achieve good velocity-tracking scores but exhibit unnatural movement patterns, suggesting the reward function may need revisiting for more natural behavior. We didn't observe this consistently on other platforms.

Our claim is narrower than "SAC is now better than PPO": it's that RSL-RL-SAC is now a viable off-policy choice in this domain, which opens the door to the sim-to-online and on-robot fine-tuning workflows that motivated this work in the first place. At the same time, out implementation is made of simple fixes, with the flexibility and modularity of RSL-RL underneath

Related work & acknowledgments

We're not the only ones trying to make off-policy RL work at scale on legged robots, and our work builds on a number of others. If you're picking a tool for your own project, it's worth knowing what's out there.

Antonin Raffin's blog post on SAC in massively parallel simulation, the original "the action bounds are wrong" diagnosis, and the post that helped seed this line of work. Recommended reading. His SAC implementation in SB3 and SBX also features n-step returns and proper timeout handling
FastSAC and FastTD3: these works showed that off-policy methods can scale to massively parallel sim. They also introduced n-step returns in this setting and suggested deriving action bounds from joint limits. Strong humanoid results, with task-specific reward shaping in some configurations.

In addition, we want to highlight FlashSAC. Developed concurrently with our work, this SAC variant is built around distributional critics and norm-bounding, with strong humanoid sim-to-real results.

BibTeX

@misc{sabatini2026bridginggapenablingsoft,
      title={Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion},
      author={Gianluca Sabatini and Chenhao Li and Marco Hutter},
      year={2026},
      eprint={2605.24975},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.24975},
}