Why do industrial robots require teams of engineers and thousands of lines of code to perform even the most basic, repetitive tasks while giraffes, horses, and many other animals can walk within minutes of their birth?
My colleagues and I at the USC Brain-Body Dynamics Lab began to address this question by creating a robotic limb that learned to move, with no prior knowledge of its own structure or environment [1,2]. Within minutes, G2P, our reinforcement learning algorithm implemented in MATLAB®, learned how to move the limb to propel a treadmill (Figure 1).
Tendon-Driven Limb Control Challenges
The robotic limb has an architecture resembling the muscle and tendon structure that powers human and vertebrate movement [1,2]. Tendons connect muscles to bones, making it possible for the biological motors (muscles) to exert force on bones from a distance [3,4]. (The dexterity of the human hand is achieved through a tendon-driven system; there are no muscles in the fingers themselves!)
While tendons have mechanical and structural advantages, a tendon-driven robot is significantly more challenging to control than a traditional robot, where a simple PID controller to control joint angles directly is often sufficient. In a tendon-driven robotic limb, multiple motors may act on a single joint, which means that a given motor may act on multiple joints. As a result, the system is simultaneously nonlinear, over-determined, and under-determined, greatly increasing the control design complexity and calling for a new control design approach.
The G2P Algorithm
The learning process for the G2P (general-to-particular) algorithm has three phases: motor babbling, exploration, and exploitation. Motor babbling is a five-minute period in which the limb performs a series of random movements similar to the movements a baby vertebrate uses to learn the capabilities of its body.
During the motor babbling phase, the G2P algorithm randomly produces a series of step changes to the current of the limb’s three DC motors (Figure 2), and encoders at each limb joint measure joint angles, angular velocities, and angular accelerations.
The algorithm then generates a multilayer perceptron artificial neural network (ANN) using Deep Learning Toolbox™. Trained with the angular measurements as input data and the motor currents as output data, the ANN serves as an inverse map linking the limb kinematics to the motor currents that produce them (Figure 3).
Next, the G2P algorithm enters an exploration phase, the first of two phases of reinforcement learning. In the exploration phase, the algorithm directs the robotic limb to repeat a series of cyclic movements and then the G2P algorithm measures how far the treadmill moved. For the cyclic movements, the algorithm uses a uniform random distribution to generate 10 points, with each point representing a pair of joint angles. These 10 points will be interpolated to create a complete trajectory in joint space for the cyclical move. The algorithm then calculates the angular velocities and accelerations for these trajectories and uses the inverse map to obtain the associated motor activation values for the complete cycle. The algorithm feeds these values to the limb’s three motors, repeating the cycle 20 times before checking how far the treadmill moved.
The distance that the limb propels the treadmill is the reward for that attempt: the greater the distance, the higher the reward. If the reward is small or nonexistent, then the algorithm generates a new random cycle and makes another attempt. The algorithm updates the inverse map with the new kinematic information gathered during each attempt. If, however, the reward exceeds a baseline performance threshold (an empirically determined 64 mm), then the algorithm enters its second reinforcement learning phase: exploitation.
In this phase, having identified a series of movements that works reasonably well, the algorithm begins looking for a better solution in the vicinity of the trajectory it previously tested. It does this by using a Gaussian distribution to generate random values near the values used in the previous attempt. If the reward for this new set of values is higher than the previous set, it keeps going, recentering the Gaussian distribution on the new best set of values. When an attempt produces a reward that is lower than the current best, those values are rejected in favor of the “best-so-far” values (Figure 4).
The Emergence of Unique Gaits
Each time the G2P algorithm runs, it starts to learn anew, exploring the dynamics of the robotic limb with a new randomized set of movements. When, by chance, the motor babbling or exploration phases are particularly effective, the algorithm learns faster and needs fewer attempts to reach the exploitation phase (Figure 5). The algorithm does not seek the optimal set of movements for propelling the treadmill, only movements that are good enough. Humans and other organisms also learn to use their bodies “well enough” because there is a cost associated with each practice attempt, including risk of injury, fatigue, and the expenditure of time and energy that could be applied to learning other skills.
One remarkable consequence of starting with random movements and searching for a “good enough” solution is that the algorithm produces a different gait every time it is run. We’ve seen the G2P algorithm produce a wide variety of gait patterns, from heavy stomping to dainty tip-toeing. We call these unique gaits "motor personalities" that our robot can develop. We believe such approaches will enable robots to have more anthropomorphic features and traits in the future.
Adding Feedback and Future Enhancements
The initial implementation of G2P was entirely feed-forward. As a result, it had no way of responding to perturbations, such as a collision, other than the system’s passive response. To address this issue, we implemented a version of G2P that incorporates minimal feedback . Even in the presence of reasonably lengthy sensory delays (100 ms), we found that the addition of simple feedback enabled this new G2P algorithm to compensate for errors arising from impacts or imperfections in the inverse map. We also found that feedback accelerates learning and requires shorter motor babbling sessions, or fewer exploration/exploitation attempts.
We plan to extend the principles embodied in the G2P algorithm to the development of biped and quadruped robots, as well as robotic manipulation.
Our team decided to use MATLAB for this project over other available software packages for a number of reasons. Firstly, our research is multidisciplinary, and involves neuroscientists and computer scientists as well as biomedical, mechanical, and electrical engineers. Whatever their discipline, every member of the team knows MATLAB, making it a common language and an effective means of collaborating.
Another reason for choosing MATLAB is that it makes the work easier for other researchers to replicate and extend. The code we wrote can be run on any version of MATLAB. If we apply zero-phase digital filtering using filtfilt() in MATLAB for example, we can be confident that others will be able to use that same function and get the same results. Moreover, in Python or C, there would be packages or library versions to worry about, as well as dependencies requiring updates or even downgrades to other packages already in use. In my experience, MATLAB has no such limitations.
Lastly, I want to mention the outstanding customer support that comes with MATLAB. The customer support team helped us with some problems that we were having with our data acquisition. Their response time and the level of expertise they had on the topic were impressive.
I gratefully acknowledge my colleagues Darío Urbina-Meléndez and Brian Cohn, as well as Dr. Francisco Valero-Cuevas, director of the Brain-Body Dynamics Lab (ValeroLab.org) and PI, who collaborated with me on the projects described in this article. I also want to thank our sponsors including DoD, DARPA, NIH, and USC graduate school for their support for this project.