This example shows how to model a humanoid robot using Simscape Multibody™ and train it using either a genetic algorithm (which requires a Global Optimization Toolbox license) or reinforcement learning (which requires Deep Learning Toolbox™ and Reinforcement Learning Toolbox™ licenses).

The humanoid walker model is based on the example Import a URDF Humanoid Model. Each leg has torque-actuated revolute joints in the frontal hip, knee, and ankle. Each arm has two passive revolute joints in the frontal and sagittal shoulder. The contact forces, position and orientation of the torso, joint states, and forward position are all sensed by the system.

The Spatial Contact Forces blocks are used to model contact between the feet and ground. Contact is simplified and modeled as six spherical contact points matching the shape of the foot.

A stiffness-based feedback controller is used to control each joint [1]. Joints are treated as first-order systems with an associated stiffness (*K*) and damping (*B*), which can be set to make the joint behavior critically damped. Consequently, the torque applied when the set-point ${\theta}_{0}$ differs from the current joint position $\theta $ is:

$$\mathit{T}=\mathit{B}\stackrel{\u2022}{\theta}+\mathit{K}\left({\theta}_{0}-\theta \right)$$.

You can vary the spring set-point ${\theta}_{0}$ to elicit a feedback response to move the joint. This can be implemented in simulink as:

You can use various methods to train the humanoid to walk.

An objective function is used to evaluate different walking styles. A reward (${\mathit{r}}_{\mathit{t}}$) is given at each timestep, based on the following factors:

Forward velocity (${\mathit{v}}_{\mathit{y}}$ ) is rewarded.

Not falling over is rewarded.

Power consumption ($\mathit{p}$) is penalized.

Vertical displacement ($\Delta \mathit{z}$) is penalized.

Lateral displacement ($\Delta \mathit{x}$) is penalized.

The total reward at each timestep is:

${\mathit{r}}_{\mathit{t}}={\mathit{w}}_{1\text{\hspace{0.17em}}}{\mathit{v}}_{\mathit{y}\text{\hspace{0.17em}}}+{\mathit{w}}_{2}{\mathit{t}}_{\mathit{s}}-{\mathit{w}}_{3\text{\hspace{0.17em}}}\mathit{p}-{\mathit{w}}_{4\text{\hspace{0.17em}}}\Delta \mathit{z}-{\mathit{w}}_{5\text{\hspace{0.17em}}}\Delta \mathit{x}$

where ${\mathit{w}}_{1,...,5}$ are weights that represent the relative importance of each term in the reward function. Consequently, the total reward ($\mathit{R}$) for a walking trial is:

$$\mathit{R}=\sum _{\mathit{t}=0}^{\mathit{T}}{\mathit{r}}_{\mathit{t}\text{\hspace{0.17em}}}$$

where $\mathit{T}$ is the time at which the simulation terminates. The reward weights can be changed in the `sm_humanoid_walker_rl_parameters`

script. The simulation terminates when the simulation time is reached or the robot falls. Falling is defined as:

The robot dropping below 0.5 m.

The robot moving laterally by more than 1 m.

The robot torso rotating by more than 30 degrees.

To optimize walking, you can use a genetic algorithm. For more information, see `ga`

(Global Optimization Toolbox) that solves optimization problems based on a natural selection process that mimics biological evolution. Genetic algorithms are especially suited to problems when the objective function is discontinuous, nondifferentiable, stochastic, or highly nonlinear.

The angular demand for each joint is set to a repeating pattern that is analogous to the central pattern generators seen in nature [2]. This is an open-loop controller. The periodicity of these signals is the gait period, which is the time taken to complete one full step. During each gait period the signal switches between different angular demand values. Ideally, the humanoid robot walks symmetrically. Consequently, the control pattern for each joint in the right leg is transmitted to the corresponding joint in the left leg, with a delay of half a gait period. The pattern generator aims to determine the optimum control pattern for each joint and to maximize the walking objective function.

To train using a genetic algorithm, open the `sm_humanoid_walker_ga_train`

script. By default, this example uses a pretrained humanoid walker. To train the humanoid walker, set `trainWalker`

to `true`

.

We can also train using a deep deterministic policy gradient (DDPG) reinforcement learning agent. A DDPG agent is an actor-critic reinforcement learning agent that computes an optimal policy that maximizes the long-term reward. DDPG can be used in systems with continuous actions and states. For details about DDPG agents, click `rlDDPGAgent`

.

To train using reinforcement learning, open the `sm_humanoid_walker_rl_train`

script. By default, this example uses a pretrained humanoid walker. To train the humanoid walker, set `trainWalker`

to `true`

.

[1] Kalveram, Karl T., Thomas Schinauer, Steffen Beirle, Stefanie Richter, and Petra Jansen-Osmann. “Threading Neural Feedforward into a Mechanical Spring: How Biology Exploits Physics in Limb Control.” Biological Cybernetics 92, no. 4 (April 2005): 229–40. https://doi.org/10.1007/s00422-005-0542-6.

[2] Jiang Shan, Cheng Junshi, and Chen Jiapin. “Design of Central Pattern Generator for Humanoid Robot Walking Based on Multi-Objective GA.” In Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113), 3:1930–35. Takamatsu, Japan: IEEE, 2000. https://doi.org/10.1109/IROS.2000.895253.