mdp_value_iterationGS description
mdp_value_iterationGS
Solves discounted MDP with Gauss-Seidel's value iteration algorithm.
Syntax
[V, policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount)
[V, policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount, epsilon)
[V, policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount, epsilon, max_iter)
[V, policy, iter, cpu_time] = mdp_value_iterationGS (P, R, discount, epsilon, max_iter, V0)
Description
mdp_value_iterationGS applies Gauss-Seidel's value iteration
algorithm to solve discounted MDP. The algorithm
consists, like value iteration, in solving Bellman's equation
iteratively Vn+1(s) calculation is modified. The algorithm
uses Vn+1(s) instead of Vn(s) whenever this value has been
calculated. In this way, convergence speed is improved.
Iterating is stopped when an epsilon-optimal policy is found or
after a specified number (max_iter) of iterations.
This function uses verbose and silent modes. In verbose mode, the function
displays the variation of V (value function) for each iteration and the condition which stopped iterations: epsilon-policy found or maximum number of iterations reached.
Arguments
- P : transition probability array.
P can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS).
R can be a 3 dimensions array (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array (SxA) possibly sparse.
- discount : discount factor.
discount is a real which belongs to ]0; 1].
For discount equals to 1, a warning recalls to check conditions of convergence.
- epsilon (optional) : search of an epsilon-optimal policy.
epsilon is a real in ]0; 1].
By default, epsilon is set to 0.01.
- max_iter (optional) : maximum number of iterations to be done.
max_iter is an integer greater than 0.
If the value given in argument is greater than a computed bound, a warning informs that the computed bound will be considered.
By default, if discount is not equal to 1, a bound for max_iter is computed, if not max_iter is set to 1000.
- V0 (optional) : starting value function.
V0 is a (Sx1) vector.
By default, V0 is only composed of 0 elements.
Evaluations
- V : optimal value fonction.
V is a (Sx1) vector.
- policy : epsilon-optimal policy.
policy is a (Sx1) vector. Each element is an integer
corresponding to an action which maximizes the value function.
- iter : number of done iterations.
- cpu_time : CPU time used to run the program.
Example
In grey, verbose mode display.
>> P(:,:,1) = [ 0.5 0.5;   0.8 0.2 ];
>> P(:,:,2) = [ 0 1;   0.1 0.9 ];
>> R = [ 5 10;   -1 2 ];
>> [V, policy, iter, cpu_time] = mdp_value_iterationGS(P, R, 0.9)
   Iteration   V_variation
       1               3.8
       2               0.4464
       3               0.36962
       4               0.30604
       5               0.25341
       6               0.20982
       7               0.17373
       8               0.14385
       9               0.11911
       10               0.09862
       11               0.081658
       12               0.067613
       13               0.055983
       14               0.046354
       15               0.038381
       16               0.03178
       17               0.026314
       18               0.021788
       19               0.01804
       20               0.014937
       21               0.012368
       22               0.010241
       23               0.0084793
       24               0.0070209
       25               0.0058133
       26               0.0048134
       27               0.0039855
       28               0.0033
MDP Toolbox : iterations stopped by maximum number of iteration condition
V =
   42.2774
   35.8952
policy =
   2
   1
iter =
   28
cpu_time =
   0.1900
In the above example, P can be a cell array containing sparse matrices:
>> P{1} = sparse([ 0.5 0.5;  0.8 0.2 ]);
>> P{2} = sparse([ 0 1;  0.1 0.9 ]);
The function call is unchanged.
MDPtoolbox/documentation/mdp_value_iterationGS.html
Page created on July 31, 2001. Last update on August 31, 2009.