## Documentation Center |

This example shows how the Parallel Computing Toolbox™ can be used to perform pairwise sequence alignment (PWSA). PWSA has multiple applications in bioinformatics, such as multiple sequence analysis and phylogenetic tree reconstruction. We look at a PWSA that uses a global dynamic programming algorithm to align each pair of sequences, and we then calculate the pairwise distances using the Tajima-Nei metric. This gives us a matrix of distances between sequences that we use for inferring the phylogeny of HIV and SIV viruses. PWSA is a computationally expensive task with complexity O(L*L*N*N), where L is the average length of the sequences and N is the number of sequences [Durbin, et al. Cambridge University Press, 1998].

For details about the computations, view the code for pctdemo_setup_hivview the code for pctdemo_setup_hiv.

Prerequisites:

Related examples:

On this page… |
---|

Analyze the Sequential Problem Load the Example Settings and the Data |

**Analyze the Sequential Problem**

First, we look at how the computations in the sequential example fit into the
model introduced in the Dividing MATLAB
Computations into Tasks example. The call to `seqpdist` is the most
computationally intensive part of the sequential example, and it basically
performs the following parameter sweep:

for all pairs (s1, s2) formed from the elements of seq distance(s1, s2) = seqpdist(s1, s2); end

The distance is, of course, symmetric, i.e., `seqpdist(s1, s2) = seqpdist(s2,
s1)`. Because `seqpdist` re-normalizes the distances based on the entire input
sequence `seq`, we have to calculate that renormalization factor ourselves in
this example, and we also need to identify the pairs between which to measure
the distance.

Since we calculate a rather large number of distances, we have each task
calculate several distances. This requires us to write a simple wrapper task
function that invokes `seqpdist`.

**Load the Example Settings and the Data**

The example uses the default profile when identifying the cluster to use. The profiles documentationprofiles documentation explains how to create new profiles and how to change the default profile. See Customizing the Settings for the Examples in the Parallel Computing Toolbox for instructions on how to change the example difficulty level or the number of tasks created.

[difficulty, myCluster, numTasks] = pctdemo_helper_getDefaults();

The `pctdemo_setup_hiv` function retrieves the protein sequence information
from the NCBI GenBank® database, and the `difficulty` parameter controls how
many protein sequences we retrieve.
You can
view the code for pctdemo_setup_hivview the code for pctdemo_setup_hiv
for full details.

[fig, pol, description] = pctdemo_setup_hiv(difficulty); numViruses = length(pol); startTime = clock;

Downloading data from the NCBI GenBank database Finished downloading

**Divide the Work into Smaller Tasks**

We use the Tajima-Nei method to measure the distance between
the POL coding regions.
Tajima-Nei distances are based on the frequency of nucleotides in
the whole group of sequences. When you call `seqpdist`
for only two sequences, the function would compute the
nucleotide count based on only the two sequences and not on the
whole group. Consequently, we calculate the frequency based on
the whole group and pass that information to `seqpdist`.

bc = basecount(strcat(pol.Sequence)); bf = [bc.A bc.C bc.G bc.T]/(bc.A + bc.C + bc.G + bc.T);

Let's find the parameter space that we want to traverse. The `seqpdist`
documentation gives us useful information in this regard:

% D = SEQPDIST(SEQS) returns a vector D containing biological distances % between each pair of sequences stored in the M elements of the cell % SEQS. D is an (M*(M-1)/2)-by-1 vector, corresponding to the M*(M-1)/2 % pairs of sequences in SEQS. The output D is arranged in the order of % ((2,1),(3,1),..., (M,1),(3,2),...(M,2),.....(M,M-1), i.e., the lower % left triangle of the full M-by-M distance matrix.

Based on this information, we create two vectors, `Aseq` and `Bseq`, that
contain the sequence pairs between which to calculate the distances.

Aseq = struct; Bseq = struct; ind = 1; for j = 1:numViruses for i = j+1:numViruses Aseq(ind).Sequence = pol(j).Sequence; Bseq(ind).Sequence = pol(i).Sequence; ind = ind + 1; end end

We want to divide the vectors `Asplit` and `Bsplit` between the tasks.

[Asplit, numTasks] = pctdemo_helper_split_vector(Aseq, numTasks); Bsplit = pctdemo_helper_split_vector(Bseq, numTasks); fprintf(['This example will submit a job with %d task(s) ' ... 'to the cluster.\n'], numTasks);

This example will submit a job with 4 task(s) to the cluster.

We create a job and the tasks in the job. Task `i` calculates
the pairwise distances between the elements in `Asplit{i}` and `Bsplit{i}`.
You can
view the code for pctdemo_task_hivview the code for pctdemo_task_hiv
for full details.

job = createJob(myCluster); for i = 1:numTasks createTask(job, @pctdemo_task_hiv, 1, {Asplit{i}, Bsplit{i}, bf}); end

We can now submit the job and wait for it to finish.

submit(job); wait(job);

When we have obtained and verified all the results, we allow the cluster
to free its resources. `fetchOutputs` will throw an error if the tasks
did not complete successfully, in which case we need to delete the job
before throwing the error.

try jobResults = fetchOutputs(job); catch err delete(job); rethrow(err); end pold = cell2mat(jobResults(:, 1))';

We have now finished all the verifications, so we can delete the job.

delete(job);

The time used for the distributed computations should be compared against the time it takes to perform the same set of calculations in the Sequential Analysis of the Origin of the Human Immunodeficiency Virus example. The elapsed time varies with the underlying hardware and network infrastructure.

```
elapsedTime = etime(clock, startTime);
fprintf('Elapsed time is %2.1f seconds\n', elapsedTime);
```

Elapsed time is 5.7 seconds

Now that we have all the distances, we can construct the phylogenetic trees for the POL proteins. You can view the code for pctdemo_plot_hivview the code for pctdemo_plot_hiv for full details.

pctdemo_plot_hiv(fig, pold, description);

Was this topic helpful?