In this lab, you will train a quadruped robot called Pupper in simulation using Reinforcement Learning.
The robot will be trained inside NVIDIA Isaac Gym, a GPU-based physics simulator that can run many environments in parallel. Instead of training one robot at a time, we can train hundreds or thousands of simulated robots at the same time.
The learning algorithm used in this lab is PPO - Proximal Policy Optimization.
The main goal of the lab is not to install Isaac Gym. The simulator and training pipeline are already prepared. Your task is to complete the reward functions used by the Pupper robot.
Initially, the reward functions return zero, so the robot has no useful learning signal. You will implement reward terms that encourage the robot to:
At the end of the lab, you will compare the training results before and after modifying the reward function.
After this lab, you should be able to:
A reinforcement learning agent learns by interacting with an environment.
For a robot, the environment contains:
At every step, the agent receives an observation and outputs an action. The simulator applies the action and returns a reward.
The reward tells the agent whether its behavior is good or bad.
For example:
A bad reward function can make the robot learn nothing. A good reward function can make the robot learn useful locomotion.
The HPC environment, container, Isaac Gym, PyTorch and Legged Gym are already prepared for you.
You should not try to reinstall Isaac Gym manually during the lab.
The important part of this lab is inside the file:
~/pupper_lab8/leggedgym/legged_gym/envs/pupper/pupper.py
The initial version contains TODO functions similar to this:
def _reward_base_height(self): return 0.0 def _reward_forward_velocity(self): return 0 def _reward_torques(self): return 0
As long as these functions return zero, the robot has no meaningful learning signal.
Download the starter archive from OCW:
Download Lab 8 HPC starter pack
The starter pack contains the SLURM scripts needed for running the training jobs.
Expected working directory:
~/pupper_lab8
Expected repository structure:
~/pupper_lab8/ ├── isaacgym/ ├── leggedgym/ ├── rsl_rl/ ├── pytorch_isaacgym.sif ├── pyuser_isaac/ ├── conda_tools/ ├── local_include/ ├── torch_extensions/ ├── logs/ ├── lab8_gpu_check.slurm ├── lab8_import_check.slurm ├── lab8_train_isaacgym.slurm └── lab8_reward_check.sh
The archive contains only the small helper scripts. It does not contain the large simulator files, the container image or the full repositories.
Connect to the frontend node:
ssh your_username@fep.grid.pub.ro
Go to the lab directory:
cd ~/pupper_lab8
Create the logs directory if it does not already exist:
mkdir -p logs
Before running Isaac Gym, check that your SLURM job can access the GPU.
Submit the GPU check job:
cd ~/pupper_lab8 sbatch lab8_gpu_check.slurm
Check the queue:
squeue -u $USER
After the job finishes, inspect the output:
ls logs cat logs/gpu_check_<JOB_ID>.out cat logs/gpu_check_<JOB_ID>.err
Replace `<JOB_ID>` with the job id printed by `sbatch`.
A successful run should show an NVIDIA GPU through `nvidia-smi`.
Before changing the reward function, check that the simulator imports correctly.
Submit the import check job:
cd ~/pupper_lab8 sbatch lab8_import_check.slurm
After the job finishes:
cat logs/import_check_<JOB_ID>.out cat logs/import_check_<JOB_ID>.err
A successful import check should contain:
isaacgym import: OK gymtorch import: OK rsl_rl import: OK legged_gym import: OK CUDA available: True
If this step fails, do not continue to the reward task. Ask the instructor or lab assistant for help.
Now check that the Pupper training task can start.
Submit a small debug job:
cd ~/pupper_lab8 RUN_NAME="debug_128_${USER}" NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm
Check the queue:
squeue -u $USER
After the job finishes, check its output:
cat logs/pupper_train_<JOB_ID>.out cat logs/pupper_train_<JOB_ID>.err
A successful run should contain messages similar to:
Physics Engine: PhysX Physics Device: cuda:0 GPU Pipeline: enabled Learning iteration 0/10 Learning iteration 9/10 Training finished.
This means that Isaac Gym, PyTorch, CUDA and the Pupper environment are working.
Now run a slightly larger baseline:
cd ~/pupper_lab8 RUN_NAME="baseline_zero_reward_${USER}" NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm
After the job finishes, inspect the reward values:
cat logs/pupper_train_<JOB_ID>.out | grep -E "Learning iteration|Mean reward|rew_forward_velocity|episode length" | tail -80
You should observe that the reward is not useful. In the initial version, the relevant reward functions return zero.
Example:
Mean reward: 0.00 Mean episode rew_forward_velocity: 0.0000
This is expected before solving the lab.
Open the Pupper environment file:
cd ~/pupper_lab8/leggedgym nano legged_gym/envs/pupper/pupper.py
Find the following functions:
def _reward_base_height(self): return 0.0 def _reward_forward_velocity(self): return 0 def _reward_torques(self): return 0
These functions are the main TODOs of this lab.
You can also use the helper script:
cd ~/pupper_lab8 ./lab8_reward_check.sh
This script prints the reward functions and checks whether there are still `return 0` statements inside `pupper.py`.
Open the configuration file:
cd ~/pupper_lab8/leggedgym nano legged_gym/envs/pupper/pupper_config.py
Look for the reward section. It should contain values similar to:
class rewards: forward_velocity_clip = 1.0 ``` class scales: forward_velocity = 3.0 ```
The exact values may differ depending on the starter code version.
The important idea is that the config file contains the coefficients used to scale the reward terms.
For example:
The robot should receive a positive reward when it moves forward.
In Legged Gym, the forward velocity of the robot base is usually stored in:
self.base_lin_vel[:, 0]
This is the x-axis linear velocity of the robot base.
Replace the initial function:
def _reward_forward_velocity(self): return 0
with:
def _reward_forward_velocity(self): return torch.clip( self.base_lin_vel[:, 0], min=0.0, max=self.cfg.rewards.forward_velocity_clip )
This reward gives positive values only when the robot moves forward. Negative velocity is clipped to zero, so moving backwards is not rewarded.
If `torch` is not imported at the top of the file, add:
import torch
A robot should not learn to move by using extremely large motor torques. Large torques are inefficient and can produce unstable behavior.
The torque values are stored in:
self.torques
Replace:
def _reward_torques(self): return 0
with:
def _reward_torques(self): return torch.sum(torch.square(self.torques), dim=1)
This function returns a positive value representing how much torque the robot uses.
Important: this function returns a positive value, but it becomes a penalty if the scale in the config file is negative.
For example:
torques = -0.0002
means that large torques reduce the final reward.
The robot should keep its body at a reasonable height. If the body is too low or too high, the behavior is probably unstable.
The base height is stored in:
self.root_states[:, 2]
Replace:
def _reward_base_height(self): return 0.0
with:
def _reward_base_height(self): base_height = self.root_states[:, 2] return torch.square(base_height - self.cfg.rewards.base_height_target)
This function returns the squared error between the current base height and the target base height.
Again, this becomes a penalty if its scale in the config file is negative.
After editing `pupper.py`, check the relevant lines:
cd ~/pupper_lab8/leggedgym nl -ba legged_gym/envs/pupper/pupper.py | sed -n '80,120p'
You should see the implemented reward functions, not `return 0`.
You can also run:
cd ~/pupper_lab8 ./lab8_reward_check.sh
If the script still shows `return 0` inside the reward functions, your implementation is not complete.
Run the training job again:
cd ~/pupper_lab8 RUN_NAME="reward_fixed_512_${USER}" NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm
Check the job:
squeue -u $USER
After it finishes:
cat logs/pupper_train_<JOB_ID>.out | grep -E "Learning iteration|Mean reward|rew_forward_velocity|episode length" | tail -100 cat logs/pupper_train_<JOB_ID>.err
Compare the new output with the initial baseline.
You should focus on:
After the small run works, you can try larger experiments.
Run with 1000 environments:
cd ~/pupper_lab8 RUN_NAME="reward_fixed_1000_${USER}" NUM_ENVS=1000 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm
If that works, try 2000 environments:
cd ~/pupper_lab8 RUN_NAME="reward_fixed_2000_${USER}" NUM_ENVS=2000 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm
For a longer training run:
cd ~/pupper_lab8 RUN_NAME="reward_fixed_long_${USER}" NUM_ENVS=2000 MAX_ITERATIONS=300 sbatch lab8_train_isaacgym.slurm
Do not start with the largest experiment. First check that the short run works.
| What you want to do | Command |
|---|---|
| Submit GPU check | sbatch lab8_gpu_check.slurm |
| Submit import check | sbatch lab8_import_check.slurm |
| Submit training | sbatch lab8_train_isaacgym.slurm |
| Submit debug training | RUN_NAME=“debug_128_${USER}” NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm |
| Check your jobs | squeue -u $USER |
| Check finished job status | sacct -j <JOB_ID> –format=JobID,JobName,Partition,State,Elapsed,ExitCode,MaxRSS,ReqMem |
| Show output log | cat logs/pupper_train_<JOB_ID>.out |
| Show error log | cat logs/pupper_train_<JOB_ID>.err |
| Show only reward lines | cat logs/pupper_train_<JOB_ID>.out | grep -E “Mean reward|rew_forward_velocity|episode length” |
| Inspect reward functions | ./lab8_reward_check.sh |
If the output still shows:
Mean reward: 0.00 Mean episode rew_forward_velocity: 0.0000
check that you actually modified:
~/pupper_lab8/leggedgym/legged_gym/envs/pupper/pupper.py
and that the reward functions no longer return zero.
Use:
cd ~/pupper_lab8 ./lab8_reward_check.sh
or:
grep -R "return 0" ~/pupper_lab8/leggedgym/legged_gym/envs/pupper/pupper.py
If the job fails with:
Detected 1 oom_kill event Some of the step tasks have been OOM Killed
then the job used too much RAM.
Use fewer environments:
NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm
or:
NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm
If needed, the SLURM memory limit can be increased by the instructor in the training script:
#SBATCH --mem=128G
If you see:
ValueError: Task with name: ... was not registered
check that the task name is:
pupper_flat
You can inspect registered tasks with:
cd ~/pupper_lab8/leggedgym grep -R "task_registry.register" -n legged_gym/envs
This lab uses:
pupper_flat
Do not confuse it with:
pupper_standup
The standup task may have different reward scales, including zero forward velocity reward.
If the output stops around:
Building extension module gymtorch... ninja: no work to do.
check the error file:
cat logs/pupper_train_<JOB_ID>.err
If the job completed successfully, use:
sacct -j <JOB_ID> --format=JobID,JobName,Partition,State,Elapsed,ExitCode,MaxRSS,ReqMem
Answer the following questions in your report:
Submit a short report containing:
Example comparison table:
| Run name | NUM_ENVS | MAX_ITERATIONS | Mean reward | rew_forward_velocity | Observation |
|---|---|---|---|---|---|
| baseline_zero_reward | 512 | 50 | 0.00 | 0.0000 | No useful learning signal |
| reward_fixed_512 | 512 | 50 | … | … | Reward functions implemented |
| reward_fixed_1000 | 1000 | 50 | … | … | More parallel environments |
The most important idea in this lab is that reinforcement learning does not magically learn the behavior we want.
The agent learns what the reward function encourages.
If the reward function is zero, the robot has no reason to improve.
If the reward function encourages forward movement but also penalizes unstable or inefficient behavior, the robot has a better chance of learning useful locomotion.
After training a policy in simulation, the next step is to test it on the real Pupper robot.
This step should only be done under instructor supervision.
Before uploading anything to the real robot, make sure that:
After training, the policy is saved inside the Legged Gym logs directory.
Use:
cd ~/pupper_lab8/leggedgym find logs -name " *.pt" | tail -20
Look for a file similar to:
model_300.pt model_1500.pt
The exact name depends on the number of training iterations.
Go to the deployment repository or folder provided by the instructor.
Example:
cd ~/pupper_lab8 mkdir -p deploy_policy
Copy the trained model:
cp ~/pupper_lab8/leggedgym/logs/<experiment_folder>/model_<iteration>.pt ~/pupper_lab8/deploy_policy/
Replace `<experiment_folder>` and `<iteration>` with the real names from your training output.
Some Pupper deployment code does not use the raw `.pt` file directly. It may require rebuilding or exporting the neural controller.
If the deployment folder contains a script such as:
rebuild_neural_controller.py
run it according to the instructor’s instructions.
Example:
cd ~/pupper_lab8/deploy python rebuild_neural_controller.py
The exact command may differ depending on the deployment package used in the lab.
Connect to the Pupper robot using SSH.
Example:
ssh pi@pupper.local
or, if the robot has a fixed IP address:
ssh pi@<PUPPER_IP_ADDRESS>
From your local or HPC environment, copy the generated controller or policy files to the robot:
scp -r ~/pupper_lab8/deploy_policy/ * pi@<PUPPER_IP_ADDRESS>:~/pupper_deploy/policies/
Replace `<PUPPER_IP_ADDRESS>` with the real IP address of the robot.
On the Pupper robot:
cd ~/pupper_deploy python launch.py
or use the command provided by the instructor for the specific robot setup.
Observe the robot carefully.
Stop the program immediately if:
Compare the behavior in simulation with the behavior on the real robot.
Answer:
This section is for the instructor or lab assistant.
The tested working stack on the HPC cluster was:
The final import check must confirm:
isaacgym import: OK gymtorch import: OK rsl_rl import: OK legged_gym import: OK CUDA available: True
The final training script must export the same environment variables used during the successful import check, especially:
export PYTHONUSERBASE=$LAB_DIR/pyuser_isaac export TOOL_PREFIX=$LAB_DIR/conda_tools export PATH=$TOOL_PREFIX/bin:$PYTHONUSERBASE/bin:$PATH export PYTHONPATH=$PYTHONUSERBASE/lib/python3.7/site-packages:${PYTHONPATH:-} export LD_LIBRARY_PATH=$TOOL_PREFIX/lib:/opt/conda/lib:${LD_LIBRARY_PATH:-} export LD_PRELOAD=$TOOL_PREFIX/lib/libstdc++.so.6:$TOOL_PREFIX/lib/libgcc_s.so.1 export CPATH=$LAB_DIR/local_include:$TOOL_PREFIX/include:${CPATH:-} export CC=$TOOL_PREFIX/bin/x86_64-conda-linux-gnu-gcc export CXX=$TOOL_PREFIX/bin/x86_64-conda-linux-gnu-c++ export TORCH_EXTENSIONS_DIR=$LAB_DIR/torch_extensions export MAX_JOBS=1
The final `lab8_train_isaacgym.slurm` should pass the task variables into Apptainer using:
export APPTAINERENV_LAB_DIR="$LAB_DIR" export APPTAINERENV_TASK="$TASK" export APPTAINERENV_NUM_ENVS="$NUM_ENVS" export APPTAINERENV_MAX_ITERATIONS="$MAX_ITERATIONS" export APPTAINERENV_RUN_NAME="$RUN_NAME"
A safe starting point for students is:
RUN_NAME="debug_128_${USER}" NUM_ENVS=128 MAX_ITERATIONS=10 sbatch lab8_train_isaacgym.slurm
A tested larger run is:
RUN_NAME="baseline_zero_reward_${USER}" NUM_ENVS=512 MAX_ITERATIONS=50 sbatch lab8_train_isaacgym.slurm
If `NUM_ENVS=2000` causes an OOM kill, reduce the number of environments or increase the requested memory in the SLURM script.