sample
Sample experiences from replay memory buffer
Syntax
Description
returns a mini-batch of N experiences from the replay memory
experience
= sample(buffer
,batchSize
)buffer
, where N is specified using
batchSize
.
specifies additional sampling options using one or more name-value pair arguments.experience
= sample(buffer
,batchSize
,Name=Value
)
[
specifies additional sampling options using one or more name-value pair arguments.experience
,Mask
] = sample(buffer
,batchSize
,Name=Value
)
Examples
Create Experience Buffer
Define observation specifications for the environment. For this example, assume that the environment has a single observation channel with three continuous signals in specified ranges.
obsInfo = rlNumericSpec([3 1],... LowerLimit=0,... UpperLimit=[1;5;10]);
Define action specifications for the environment. For this example, assume that the environment has a single action channel with two continuous signals in specified ranges.
actInfo = rlNumericSpec([2 1],... LowerLimit=0,... UpperLimit=[5;10]);
Create an experience buffer with a maximum length of 20,000.
buffer = rlReplayMemory(obsInfo,actInfo,20000);
Append a single experience to the buffer using a structure. Each experience contains the following elements: current observation, action, next observation, reward, and is-done.
For this example, create an experience with random observation, action, and reward values. Indicate that this experience is not a terminal condition by setting the IsDone
value to 0.
exp.Observation = {obsInfo.UpperLimit.*rand(3,1)}; exp.Action = {actInfo.UpperLimit.*rand(2,1)}; exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)}; exp.Reward = 10*rand(1); exp.IsDone = 0;
Append the experience to the buffer.
append(buffer,exp);
You can also append a batch of experiences to the experience buffer using a structure array. For this example, append a sequence of 100 random experiences, with the final experience representing a terminal condition.
for i = 1:100 expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)}; expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)}; expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)}; expBatch(i).Reward = 10*rand(1); expBatch(i).IsDone = 0; end expBatch(100).IsDone = 1; append(buffer,expBatch);
After appending experiences to the buffer, you can sample mini-batches of experiences for training your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.
miniBatch = sample(buffer,50);
You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive experiences with a discount factor of 0.95.
horizonSample = sample(buffer,1,... NStepHorizon=10,... DiscountFactor=0.95);
The returned sample includes the following information.
Observation
andAction
are the observation and action from the first experience in the horizon.NextObservation
andIsDone
are the next observation and termination signal from the final experience in the horizon.Reward
is the cumulative reward across the horizon using the specified discount factor.
You can also sample a sequence of consecutive experiences. In this case, the structure fields contain arrays with values for all sampled experiences.
sequenceSample = sample(buffer,1,...
SequenceLength=20);
Create Experience Buffer with Multiple Observation Channels
Define observation specifications for the environment. For this example, assume that the environment has a two observations channel: one channel with two continuous observations and one channel with a three-valued discrete observation.
obsContinuous = rlNumericSpec([2 1],... LowerLimit=0,... UpperLimit=[1;5]); obsDiscrete = rlFiniteSetSpec([1 2 3]); obsInfo = [obsContinuous obsDiscrete];
Define action specifications for the environment. For this example, assume that the environment has a single action channel with one continuous action in a specified range.
actInfo = rlNumericSpec([2 1],... LowerLimit=0,... UpperLimit=[5;10]);
Create an experience buffer with a maximum length of 5,000.
buffer = rlReplayMemory(obsInfo,actInfo,5000);
Append a sequence of 50 random experiences to the buffer.
for i = 1:50 exp(i).Observation = ... {obsInfo(1).UpperLimit.*rand(2,1) randi(3)}; exp(i).Action = {actInfo.UpperLimit.*rand(2,1)}; exp(i).NextObservation = ... {obsInfo(1).UpperLimit.*rand(2,1) randi(3)}; exp(i).Reward = 10*rand(1); exp(i).IsDone = 0; end append(buffer,exp);
After appending experiences to the buffer, you can sample mini-batches of experiences for training your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.
miniBatch = sample(buffer,10);
Input Arguments
buffer
— Experience buffer
rlReplayMemory
object
Experience buffer, specified as an rlReplayMemory
object.
batchSize
— Batch size
positive integer
Batch size of experiences to sample, specified as a positive integer.
If batchSize
is greater than the current length of the buffer,
then sample
returns no experiences.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: DiscountFactor=0.95
SequenceLength
— Sequence length
1
(default) | positive integer
Sequence length, specified as a positive integer. For each batch element, sample
up to SequenceLength
consecutive experiences. If a sampled
experience has a nonzero IsDone
value, stop the sequence at that
experience.
NStepHorizon
— N-step horizon length
1
(default) | positive integer
N-step horizon length, specified as a positive integer. For each batch element,
sample up to NStepHorizon
consecutive experiences. If a sampled
experience has a nonzero IsDone
value, stop the horizon at that
experience. Return the following experience information based on the sampled
horizon.
Observation
andAction
values from the first experience in the horizonNextObservation
andIsDone
values from the final experience in the horizon.Cumulative reward across the horizon using the specified discount factor,
DiscountFactor
.
If an experience in the horizon has a nonzero IsDone
value,
Sampling an n-step horizon is not supported when sampling sequences. Therefore, if
SequenceLength
> 1
, then
NStepHorizon
must be 1
.
DiscountFactor
— Discount factor
0.99
(default) | nonnegative scalar less than or equal to one
Discount factor, specified as a nonnegative scalar less than or equal to one. When
you sample a horizon of experiences (NStepHorizon
>
1
), sample
returns the cumulative reward
R computed as follows.
Here:
γ is the discount factor.
N is the sampled horizon length, which can be less than
NStepHorizon
.Ri is the reward for the ith horizon step.
DiscountFactor
applies only when
NStepHorizon
is greater than one.
DataSourceID
— Data source index
-1
(default) | nonnegative integer
Data source index, specified as an one of the following:
-1
— Sample from the experiences of all data sources.Nonnegative integer — Sample from the experiences of only the data source specified by
DataSourceID
.
Output Arguments
experience
— Experience sampled from the buffer
structure
Experience sampled from the buffer, returned as a structure with the following fields.
Observation
— Starting state
cell array
Starting state, returned as a cell array with length equal to the number of
observation specifications specified when creating the buffer. Each element of
Observation
contains a
DO-by-batchSize
-by-SequenceLength
array, where DO is the dimension of the
corresponding observation specification.
Action
— Agent action from starting state
cell array
Agent action from starting state, returned as a cell array with length equal
to the number of action specifications specified when creating the buffer. Each
element of Action
contains a
DA-by-batchSize
-by-SequenceLength
array, where DA is the dimension of the
corresponding action specification.
Reward
— Reward value
scalar | array
Reward value obtained by taking the specified action from the starting state,
returned as a 1-by-1-by-SequenceLength
array.
NextObservation
— Next state
cell array
Next state reached by taking the specified action from the starting state,
returned as a cell array with the same format as
Observation
.
IsDone
— Termination signal
integer | array
Termination signal, returned as a
1-by-1-by-SequenceLength
array of integers. Each element of
IsDone
has one of the following values.
0
— This experience is not the end of an episode.1
— The episode terminated because the environment generated a termination signal.2
— The episode terminated by reaching the maximum episode length.
Mask
— Sequence padding mask
logical array
Sequence padding mask, returned as a logical array with length equal to
SequenceLength
. When the sampled sequence length is less than
SequenceLength
, the data returned in
experience
is padded. Each element of Mask
is true
for a real experience and false
for a
padded experience.
You can ignore Mask
when SequenceLength
is
1.
Version History
See Also
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)