Lab 8: Reinforcement Learning for Robotics

1. Lab idea

In this lab, you will train a quadruped robot called Pupper in simulation using Reinforcement Learning.

The robot will be trained inside NVIDIA Isaac Gym, a GPU-based physics simulator that can run many environments in parallel. Instead of training one robot at a time, we can train hundreds or thousands of simulated robots at the same time.

The learning algorithm used in this lab is PPO - Proximal Policy Optimization.

The main goal of the lab is not to install Isaac Gym. The simulator and training pipeline are already prepared. Your task is to complete the reward functions used by the Pupper robot.

Initially, the reward functions return zero, so the robot has no useful learning signal. You will implement reward terms that encourage the robot to:

  • move forward;
  • keep a stable body height;
  • avoid using unnecessarily large motor torques.

At the end of the lab, you will compare the training results before and after modifying the reward function.

2. Learning objectives

After this lab, you should be able to:

  • explain why reinforcement learning needs a reward function;
  • understand why many simulated environments are used in parallel;
  • run a training job on the HPC cluster using SLURM;
  • identify and modify reward functions in a Legged Gym environment;
  • compare multiple training runs using logs;
  • explain how reward shaping influences the behavior learned by a robot;
  • understand the basic idea of transferring a trained policy from simulation to the real Pupper robot.

3. Background

A reinforcement learning agent learns by interacting with an environment.

For a robot, the environment contains:

  • the robot body;
  • the physics simulation;
  • gravity;
  • contacts with the ground;
  • joint positions and velocities;
  • actions applied to the motors.

At every step, the agent receives an observation and outputs an action. The simulator applies the action and returns a reward.

The reward tells the agent whether its behavior is good or bad.

For example:

  • if the robot moves forward, it should receive a positive reward;
  • if it falls, it should receive a penalty;
  • if it uses too much torque, it should receive a penalty;
  • if it keeps a stable body height, it should receive a better score.

A bad reward function can make the robot learn nothing. A good reward function can make the robot learn useful locomotion.

4. Important note about this lab

The HPC environment, container, Isaac Gym, PyTorch and Legged Gym are already prepared for you.

You should not try to reinstall Isaac Gym manually during the lab.

The important part of this lab is inside the file:

~/pupper_lab8/leggedgym/legged_gym/envs/pupper/pupper.py

The initial version contains TODO functions similar to this:

def _reward_base_height(self):
    return 0.0
 
def _reward_forward_velocity(self):
return 0
 
def _reward_torques(self):
return 0 

As long as these functions return zero, the robot has no meaningful learning signal.

5. Files used in this lab

Download the starter archive from OCW:

Download Lab 8 HPC starter pack

The starter pack contains the SLURM scripts needed for running the training jobs.

Expected working directory:

~/pupper_lab8

Expected repository structure:

~/pupper_lab8/
├── isaacgym/
├── leggedgym/
├── rsl_rl/
├── pytorch_isaacgym.sif
├── pyuser_isaac/
├── conda_tools/
├── local_include/
├── torch_extensions/
├── logs/
├── lab8_gpu_check.slurm
├── lab8_import_check.slurm
├── lab8_train_isaacgym.slurm
└── lab8_reward_check.sh

The archive contains only the small helper scripts. It does not contain the large simulator files, the container image or the full repositories.

6. Connect to the HPC cluster

Connect to the frontend node:

ssh your_username@fep.grid.pub.ro

Go to the lab directory:

cd ~/pupper_lab8

Create the logs directory if it does not already exist:

mkdir -p logs

7. Check that the GPU is available

Before running Isaac Gym, check that your SLURM job can access the GPU.

Submit the GPU check job:

cd ~/pupper_lab8
sbatch lab8_gpu_check.slurm

Check the queue:

squeue -u $USER

After the job finishes, inspect the output:

ls logs
cat logs/gpu_check_<JOB_ID>.out
cat logs/gpu_check_<JOB_ID>.err

Replace `<JOB_ID>` with the job id printed by `sbatch`.

A successful run should show an NVIDIA GPU through `nvidia-smi`.

8. Check that Isaac Gym and Legged Gym work

Before changing the reward function, check that the simulator imports correctly.

Submit the import check job:

cd ~/pupper_lab8
sbatch lab8_import_check.slurm

After the job finishes:

cat logs/import_check_<JOB_ID>.out
cat logs/import_check_<JOB_ID>.err

A successful import check should contain:

isaacgym import: OK
gymtorch import: OK
rsl_rl import: OK
legged_gym import: OK
CUDA available: True

If this step fails, do not continue to the reward task. Ask the instructor or lab assistant for help.

9. Run a small debug training job

Now check that the Pupper training task can start.

Submit a small debug job:

cd ~/pupper_lab8
RUN_NAME="debug_128_${USER}" NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm

Check the queue:

squeue -u $USER

After the job finishes, check its output:

cat logs/pupper_train_<JOB_ID>.out
cat logs/pupper_train_<JOB_ID>.err

A successful run should contain messages similar to:

Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
Learning iteration 0/10
Learning iteration 9/10
Training finished.

This means that Isaac Gym, PyTorch, CUDA and the Pupper environment are working.

10. Run the initial baseline

Now run a slightly larger baseline:

cd ~/pupper_lab8
RUN_NAME="baseline_zero_reward_${USER}" NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm

After the job finishes, inspect the reward values:

cat logs/pupper_train_<JOB_ID>.out | grep -E "Learning iteration|Mean reward|rew_forward_velocity|episode length" | tail -80

You should observe that the reward is not useful. In the initial version, the relevant reward functions return zero.

Example:

Mean reward: 0.00
Mean episode rew_forward_velocity: 0.0000

This is expected before solving the lab.

11. Inspect the reward functions

Open the Pupper environment file:

cd ~/pupper_lab8/leggedgym
nano legged_gym/envs/pupper/pupper.py

Find the following functions:

def _reward_base_height(self):
    return 0.0
 
def _reward_forward_velocity(self):
return 0
 
def _reward_torques(self):
return 0 

These functions are the main TODOs of this lab.

You can also use the helper script:

cd ~/pupper_lab8
./lab8_reward_check.sh

This script prints the reward functions and checks whether there are still `return 0` statements inside `pupper.py`.

12. Inspect the reward configuration

Open the configuration file:

cd ~/pupper_lab8/leggedgym
nano legged_gym/envs/pupper/pupper_config.py

Look for the reward section. It should contain values similar to:

class rewards:
    forward_velocity_clip = 1.0
 
```
class scales:
    forward_velocity = 3.0
```

The exact values may differ depending on the starter code version.

The important idea is that the config file contains the coefficients used to scale the reward terms.

For example:

  • a positive scale means the term increases the reward;
  • a negative scale means the term becomes a penalty;
  • a zero scale disables the term.

13. Task 1 - Implement forward velocity reward

The robot should receive a positive reward when it moves forward.

In Legged Gym, the forward velocity of the robot base is usually stored in:

self.base_lin_vel[:, 0]

This is the x-axis linear velocity of the robot base.

Replace the initial function:

def _reward_forward_velocity(self):
    return 0

with:

def _reward_forward_velocity(self):
    return torch.clip(
        self.base_lin_vel[:, 0],
        min=0.0,
        max=self.cfg.rewards.forward_velocity_clip
    )

This reward gives positive values only when the robot moves forward. Negative velocity is clipped to zero, so moving backwards is not rewarded.

If `torch` is not imported at the top of the file, add:

import torch

14. Task 2 - Implement torque penalty

A robot should not learn to move by using extremely large motor torques. Large torques are inefficient and can produce unstable behavior.

The torque values are stored in:

self.torques

Replace:

def _reward_torques(self):
    return 0

with:

def _reward_torques(self):
    return torch.sum(torch.square(self.torques), dim=1)

This function returns a positive value representing how much torque the robot uses.

Important: this function returns a positive value, but it becomes a penalty if the scale in the config file is negative.

For example:

torques = -0.0002

means that large torques reduce the final reward.

15. Task 3 - Implement base height penalty

The robot should keep its body at a reasonable height. If the body is too low or too high, the behavior is probably unstable.

The base height is stored in:

self.root_states[:, 2]

Replace:

def _reward_base_height(self):
    return 0.0

with:

def _reward_base_height(self):
    base_height = self.root_states[:, 2]
    return torch.square(base_height - self.cfg.rewards.base_height_target)

This function returns the squared error between the current base height and the target base height.

Again, this becomes a penalty if its scale in the config file is negative.

16. Check your code

After editing `pupper.py`, check the relevant lines:

cd ~/pupper_lab8/leggedgym
nl -ba legged_gym/envs/pupper/pupper.py | sed -n '80,120p'

You should see the implemented reward functions, not `return 0`.

You can also run:

cd ~/pupper_lab8
./lab8_reward_check.sh

If the script still shows `return 0` inside the reward functions, your implementation is not complete.

17. Train again after implementing the rewards

Run the training job again:

cd ~/pupper_lab8
RUN_NAME="reward_fixed_512_${USER}" NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm

Check the job:

squeue -u $USER

After it finishes:

cat logs/pupper_train_<JOB_ID>.out | grep -E "Learning iteration|Mean reward|rew_forward_velocity|episode length" | tail -100
cat logs/pupper_train_<JOB_ID>.err

Compare the new output with the initial baseline.

You should focus on:

  • `Mean reward`;
  • `Mean episode rew_forward_velocity`;
  • `Mean episode length`;
  • total timesteps;
  • whether the job completed successfully.

18. Larger training runs

After the small run works, you can try larger experiments.

Run with 1000 environments:

cd ~/pupper_lab8
RUN_NAME="reward_fixed_1000_${USER}" NUM_ENVS=1000 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm

If that works, try 2000 environments:

cd ~/pupper_lab8
RUN_NAME="reward_fixed_2000_${USER}" NUM_ENVS=2000 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm

For a longer training run:

cd ~/pupper_lab8
RUN_NAME="reward_fixed_long_${USER}" NUM_ENVS=2000 MAX_ITERATIONS=300 sbatch lab8_train_isaacgym.slurm

Do not start with the largest experiment. First check that the short run works.

19. Useful SLURM commands

What you want to do Command
Submit GPU check sbatch lab8_gpu_check.slurm
Submit import check sbatch lab8_import_check.slurm
Submit training sbatch lab8_train_isaacgym.slurm
Submit debug training RUN_NAME=“debug_128_${USER}” NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm
Check your jobs squeue -u $USER
Check finished job status sacct -j <JOB_ID> –format=JobID,JobName,Partition,State,Elapsed,ExitCode,MaxRSS,ReqMem
Show output log cat logs/pupper_train_<JOB_ID>.out
Show error log cat logs/pupper_train_<JOB_ID>.err
Show only reward lines cat logs/pupper_train_<JOB_ID>.out | grep -E “Mean reward|rew_forward_velocity|episode length”
Inspect reward functions ./lab8_reward_check.sh

20. Common problems

Problem 1 - The reward stays zero

If the output still shows:

Mean reward: 0.00
Mean episode rew_forward_velocity: 0.0000

check that you actually modified:

~/pupper_lab8/leggedgym/legged_gym/envs/pupper/pupper.py

and that the reward functions no longer return zero.

Use:

cd ~/pupper_lab8
./lab8_reward_check.sh

or:

grep -R "return 0" ~/pupper_lab8/leggedgym/legged_gym/envs/pupper/pupper.py

Problem 2 - The job is killed with OOM

If the job fails with:

Detected 1 oom_kill event
Some of the step tasks have been OOM Killed

then the job used too much RAM.

Use fewer environments:

NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm

or:

NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm

If needed, the SLURM memory limit can be increased by the instructor in the training script:

#SBATCH --mem=128G

Problem 3 - Task not registered

If you see:

ValueError: Task with name: ... was not registered

check that the task name is:

pupper_flat

You can inspect registered tasks with:

cd ~/pupper_lab8/leggedgym
grep -R "task_registry.register" -n legged_gym/envs

Problem 4 - You edited the wrong task

This lab uses:

pupper_flat

Do not confuse it with:

pupper_standup

The standup task may have different reward scales, including zero forward velocity reward.

Problem 5 - The output stops after gymtorch

If the output stops around:

Building extension module gymtorch...
ninja: no work to do.

check the error file:

cat logs/pupper_train_<JOB_ID>.err

If the job completed successfully, use:

sacct -j <JOB_ID> --format=JobID,JobName,Partition,State,Elapsed,ExitCode,MaxRSS,ReqMem

21. Questions

Answer the following questions in your report:

  • Why did the initial training run produce zero reward?
  • Why is forward velocity a useful reward term for locomotion?
  • Why should large torques be penalized?
  • Why can body height be used as a stability-related reward term?
  • What is the purpose of training many environments in parallel?
  • What changed after you implemented the reward functions?
  • Did the robot learn better behavior after more iterations? Explain using the log values.
  • Why might a policy that works in simulation behave differently on the real robot?

22. Deliverables

Submit a short report containing:

  • your implemented reward functions;
  • a screenshot or copied log section from the baseline run;
  • a screenshot or copied log section from the improved reward run;
  • a small comparison table;
  • short answers to the lab questions.

Example comparison table:

Run name NUM_ENVS MAX_ITERATIONS Mean reward rew_forward_velocity Observation
baseline_zero_reward 512 50 0.00 0.0000 No useful learning signal
reward_fixed_512 512 50 Reward functions implemented
reward_fixed_1000 1000 50 More parallel environments

23. What to remember

The most important idea in this lab is that reinforcement learning does not magically learn the behavior we want.

The agent learns what the reward function encourages.

If the reward function is zero, the robot has no reason to improve.

If the reward function encourages forward movement but also penalizes unstable or inefficient behavior, the robot has a better chance of learning useful locomotion.

24. Optional final step - Upload the trained policy to the real Pupper robot

After training a policy in simulation, the next step is to test it on the real Pupper robot.

This step should only be done under instructor supervision.

Before uploading anything to the real robot, make sure that:

  • the robot battery is charged;
  • the robot is placed on the floor in a safe open area;
  • the emergency stop is available;
  • the policy was tested in simulation;
  • the correct configuration file is used on the robot;
  • the instructor or lab assistant is present.

24.1 Find the trained policy

After training, the policy is saved inside the Legged Gym logs directory.

Use:

cd ~/pupper_lab8/leggedgym
find logs -name "    *.pt" | tail -20

Look for a file similar to:

model_300.pt
model_1500.pt

The exact name depends on the number of training iterations.

24.2 Copy the policy to a deployment folder

Go to the deployment repository or folder provided by the instructor.

Example:

cd ~/pupper_lab8
mkdir -p deploy_policy

Copy the trained model:

cp ~/pupper_lab8/leggedgym/logs/<experiment_folder>/model_<iteration>.pt ~/pupper_lab8/deploy_policy/

Replace `<experiment_folder>` and `<iteration>` with the real names from your training output.

24.3 Convert or rebuild the neural controller

Some Pupper deployment code does not use the raw `.pt` file directly. It may require rebuilding or exporting the neural controller.

If the deployment folder contains a script such as:

rebuild_neural_controller.py

run it according to the instructor’s instructions.

Example:

cd ~/pupper_lab8/deploy
python rebuild_neural_controller.py

The exact command may differ depending on the deployment package used in the lab.

24.4 Upload the controller to Pupper

Connect to the Pupper robot using SSH.

Example:

ssh pi@pupper.local

or, if the robot has a fixed IP address:

ssh pi@<PUPPER_IP_ADDRESS>

From your local or HPC environment, copy the generated controller or policy files to the robot:

scp -r ~/pupper_lab8/deploy_policy/    * pi@<PUPPER_IP_ADDRESS>:~/pupper_deploy/policies/

Replace `<PUPPER_IP_ADDRESS>` with the real IP address of the robot.

24.5 Run the policy on the robot

On the Pupper robot:

cd ~/pupper_deploy
python launch.py

or use the command provided by the instructor for the specific robot setup.

Observe the robot carefully.

Stop the program immediately if:

  • the robot moves violently;
  • the joints oscillate strongly;
  • the robot falls repeatedly;
  • the motors overheat;
  • the emergency stop is needed.

24.6 Reflection question

Compare the behavior in simulation with the behavior on the real robot.

Answer:

  • Did the robot behave the same in simulation and reality?
  • What differences did you observe?
  • Why can a policy trained in simulation fail on a real robot?
  • What is the sim-to-real gap?
  • How could domain randomization help?

25. Instructor notes

This section is for the instructor or lab assistant.

The tested working stack on the HPC cluster was:

  • SLURM job on the `dgxa100` partition;
  • NVIDIA A100-SXM4-80GB GPU;
  • Apptainer with `–nv`;
  • PyTorch 1.10.0 CUDA 11.3 container;
  • Isaac Gym Preview 4;
  • Python 3.7;
  • Legged Gym;
  • `pupper_flat` task.

The final import check must confirm:

isaacgym import: OK
gymtorch import: OK
rsl_rl import: OK
legged_gym import: OK
CUDA available: True

The final training script must export the same environment variables used during the successful import check, especially:

export PYTHONUSERBASE=$LAB_DIR/pyuser_isaac
export TOOL_PREFIX=$LAB_DIR/conda_tools
export PATH=$TOOL_PREFIX/bin:$PYTHONUSERBASE/bin:$PATH
export PYTHONPATH=$PYTHONUSERBASE/lib/python3.7/site-packages:${PYTHONPATH:-}
export LD_LIBRARY_PATH=$TOOL_PREFIX/lib:/opt/conda/lib:${LD_LIBRARY_PATH:-}
export LD_PRELOAD=$TOOL_PREFIX/lib/libstdc++.so.6:$TOOL_PREFIX/lib/libgcc_s.so.1
export CPATH=$LAB_DIR/local_include:$TOOL_PREFIX/include:${CPATH:-}
export CC=$TOOL_PREFIX/bin/x86_64-conda-linux-gnu-gcc
export CXX=$TOOL_PREFIX/bin/x86_64-conda-linux-gnu-c++
export TORCH_EXTENSIONS_DIR=$LAB_DIR/torch_extensions
export MAX_JOBS=1

The final `lab8_train_isaacgym.slurm` should pass the task variables into Apptainer using:

export APPTAINERENV_LAB_DIR="$LAB_DIR"
export APPTAINERENV_TASK="$TASK"
export APPTAINERENV_NUM_ENVS="$NUM_ENVS"
export APPTAINERENV_MAX_ITERATIONS="$MAX_ITERATIONS"
export APPTAINERENV_RUN_NAME="$RUN_NAME"

A safe starting point for students is:

RUN_NAME="debug_128_${USER}" NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm

A tested larger run is:

RUN_NAME="baseline_zero_reward_${USER}" NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm

If `NUM_ENVS=2000` causes an OOM kill, reduce the number of environments or increase the requested memory in the SLURM script.

rasb/lab/08.txt · Last modified: 2026/06/19 10:44 by vlad.radulescu2901
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0