mdp_value_iteration description
mdp_value_iteration
Solves discounted MDP with value iteration algorithm.
Syntax
[V, policy, iter, cpu_time] = mdp_value_iteration (P, R, discount)
[V, policy, iter, cpu_time] = mdp_value_iteration (P, R, discount, epsilon)
[V, policy, iter, cpu_time] = mdp_value_iteration (P, R, discount, epsilon, max_iter)
[V, policy, iter, cpu_time] = mdp_value_iteration (P, R, discount, epsilon, max_iter, V0)
Description
mdp_value_iteration applies the value iteration algorithm to
solve discounted MDP. The algorithm consists in
solving Bellman's equation iteratively.
Iterating is stopped when an epsilon-optimal policy is found or after a specified number (max_iter) of iterations.
This function uses verbose and silent modes. In verbose mode, the function
displays the variation of V (value function) for each iteration and the condition which stopped iterations: epsilon-policy found or maximum number of iterations reached.
Arguments
- P : transition probability array.
P can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS).
R can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
- discount : discount factor.
discount is a real which belongs to [0; 1].
For discount equals to 1, a warning recalls to check conditions of convergence.
- epsilon (optional) : search for an epsilon-optimal policy.
epsilon is a real in ]0; 1].
By default, epsilon = 0.01.
- max_iter (optional) : maximum number of iterations.
max_iter is an integer greater than 0.
If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be considered.
By default, if discount is not egal to 1, a bound for max_iter is computed, if not max_iter = 1000.
- V0 (optional) : starting value function.
V0 is a (Sx1) vector.
By default, V0 is only composed of 0 elements.
Evaluations
- V : optimal value fonction.
V is a vector composed by S elements.
policy is a (Sx1) vector. Each element is an integer
corresponding to an action which maximizes the value function.
- iter : number of done iterations.
- cpu_time : CPU time used to run the program.
Example
In grey, verbose mode display.
>> P(:,:,1) = [ 0.5 0.5;   0.8 0.2 ];
>> P(:,:,2) = [ 0 1;   0.1 0.9 ];
>> R = [ 5 10;   -1 2 ];
>> [V, policy, iter, cpu_time] = mdp_value_iteration(P, R, 0.9)
   Iteration   V_variation
     1      8
     2      2.76
     3      1.9872
     4      1.4308
     5      1.0302
     6      0.74172
     7      0.53404
     8      0.38451
     9      0.27684
     10      0.19933
     11      0.14352
     12      0.10333
     13      0.074399
     14      0.053567
     15      0.038568
     16      0.027769
     17      0.019994
     18      0.014396
     19      0.010365
     20      0.0074627
     21      0.0053731
     22      0.0038686
     23      0.0027854
     24      0.0020055
     25      0.001444
     26      0.0010397
MDP Toolbox : iterations stopped, epsilon-optimal policy found
V =
   40.0486
   33.6537
policy =
   2
   1
iter =
   26
cpu_time =
   0.1200
In the above example, P can be a cell array containing sparse matrices:
>> P{1} = sparse([ 0.5 0.5;  0.8 0.2 ]);
>> P{2} = sparse([ 0 1;  0.1 0.9 ]);
The function call is unchanged.
MDPtoolbox/documentation/mdp_value_iteration.html
Page created on July 31, 2001. Last update on August 31, 2009.