It is often useful to describe processes as sets of discrete states with probabilistic transitions. If a transition to a new state is conditioned only on the present state the model is called a Markov chain. An n-th order Markov chain is a generalization to include the past n states in the transition probability. To create a Markov model from data one counts the occurrence of symbols and their transitions and stores the resulting transition probabilities in matrix form.
Given the model one may wish to compute the likelihood of certain statistical properties of the observed data. If the likelihood cannot be derived analytically it can be estimated from the distribution of the statistic given many random, independent realizations of the process. Often these realizations must be generated through simulation. Here we refer to this simulated data as surrogate data. The defining feature of surrogate data in this context is that it has exactly the same transition probabilities as the observed data, but is random in all other respects.
The code in this package produces surrogates sequences iteratively. At each step a symbol is chosen randomly, weighted by the number of remaining sequences given that choice. The number of possible symbol sequences is given by Whittle's formula. This method guarantees an output sequence that is chosen with uniform probability from the set of all possible surrogates.
For a detailed explanation of the algorithm see arXiv:1302.1500 [physics.data-an].
Final version at Physica D 269, 42-47 (2014).
EXAMPLE:
>> d = [0 1 1 0 1 0 1 1 1 0 0 1];
Generate another sequence with exactly the same transition probabilities as sequence d.
First compute the transition count matrix and a list of words along with the start and end word:
>> [f w u v] = trans_count(d,1);
>> full(f)
ans =
1 4
3 3
Input this data into whittle_surrogate.m:
>> seq = whittle_surrogate(f,w,u,v)
seq = 0 1 0 1 1 0 1 1 1 0 0 1
Check transition count:
>> [f2 w2 u2 v2] = trans_count(seq,1);
>> full(f2)
ans =
1 4
3 3
Subsequence calls continue to produce independent, uniformly distributed samples from the set of all possible surrogates:
>> seq = whittle_surrogate(f,w,u,v)
seq = 0 1 0 0 1 1 1 0 1 1 0 1
Each of these sequences has exactly the same transition count and transition probability. If only the transition count is important then use whittle_surrogate_uncond.m.
Notes: There are no restrictions on the size of the symbol alphabet or the order of the Markov model. The transition count matrix is stored in sparse format for size and speed reasons. To evaluate Whittle's formula the log of a determinate is computed which in some cases may not be accurate. The use of LU decomposition avoids this problem but is much slower. No error checking is done.
MarkovOrderTests.m conducts a series of exact Markov order tests using Whittle surrogates as described in the Physica D article.
Example:
>>r = [1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0];
>>[pvalue scount] = MarkovOrderTests(r,10000,2);
>> pvalue
pvalue =
0.0001
0.4429
0.8150
These values are the zeroth, first, and second order Markov p-values. In this example zeroth order is extremely unlikely. |