# Demo of gaimc - 'Graph Algorithms In Matlab Code'

Matlab includes great algorithms to work with sparse matrices but does provide a reasonable set of algorithms to work with sparse matrices as graph data structures. My other project -- MatlabBGL -- provides a high-performance solution to this problem by directly interfacing the Matlab sparse matrix data structure with the Boost Graph Library. That library, however, suffers from enormous complication because it must be compiled for each platform. The Boost Graph Library heavily uses advanced C++ features that impair easy portability between platforms. In contrast, the gaimc library is implemented in pure Matlab code, making it completely portable.

The cost of the portability for this library is a 2-4x slowdown in the runtime of the algorithms as well as significantly fewer algorithms to choose from.

## Contents

## Sparse matrices as graphs

To store the connectivity structure of the graph, gaimc uses the adjacency matrix of a graph.

A graph is represented by a set of vertices and a set of edges between the vertices. Often, we write to denote the graph, the set of vertices, and the set of edges, respectively. In gaimc, like in my other package MatlabBGL, we represent graphs with their adjacency matrices. This representation is handy in Matlab because Matlab is rather efficient at working with the large sparse matrices that typically arise as adjacency matrices.

To convert from to an adjacency matrix, we identify each vertex with a row of the matrix via a bijective map. The adjacency matrix is then a matrix called A. The entry A(i,j) = 1 for any edge between in and 0 otherwise. Let's look at an example.

load_gaimc_graph('bfs_example'); graph_draw(A,xy,'labels',labels) full(A) labels'

ans = 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 ans = 'r' 's' 't' 'u' 'v' 'w' 'x' 'y'

This output means that vertex 'r' is row 1, vertex 's' is row 2 and because A(1,2) = 1, then there is an edge between them, just like in the picture.

One funny property is that A(2,1) = 1 too! So we actually have to store each edge twice in the adjacency matrix. This might seem wasteful, but its hard to avoid as I've learned while working on graph algorithms. So don't worry about it! It also makes the generalization to directed graphs (below) easy.

For more information about the adjacency matrix representation of a graph, see a standard book on graph algorithms.

## Weighted and directed graphs

Our previous case handled the situation for undirected graphs only. To encode weighted and directed graphs, we use weighted and non-symmetric adjacency matrices.

For a weighted matrix, A(i,j) = distance between i and j for most of the algorithms in gaimc. But A(i,j) = 0 means there is no edge, and so sometimes things can get a little tricky to get what you want.

For a directed graph, just set A(i,j) ~= A(j,i). The adjacency matrix won't be symmetric, but that's what you want!

To understand more, explore the examples or read up on adjacency matrices in graph theory books.

## Loading helper

To make loading our sample graphs easy, gaimc defines it's own function to load graphs.

load_gaimc_graph('dfs_example'); % loads one of our example graphs whos

Name Size Bytes Class Attributes A 9x9 416 double sparse Acc 5x5 176 double sparse As 1x1 40792464 struct At 50000x50000 20395704 double sparse D 5x5 200 double D2 5x5 200 double T 456x456 18216 double sparse X 2730x1 21840 double Y 2730x1 21840 double ai 1249731x1 9997848 double aj 71959x1 575672 double ans 1x8 912 cell ati 1249731x1 9997848 double ax 1x1 8 double cc 9x1 72 double cc1 50000x1 400000 double cc2 50000x1 400000 double cc3 50000x1 400000 double cc4 50000x1 400000 double ccfs 2642x1 21136 double ci 1249731x1 9997848 double cn 2642x1 21136 double comp_results 6x4 192 double cp 50001x1 400008 double cs1 1x1 8 double cs2 1x1 8 double cs3 1x1 8 double cs4 1x1 8 double cwd 1x28 56 char d 456x1 3648 double d1 1024x1 8192 double d2 1024x1 8192 double d3 1024x1 8192 double d4 1024x1 8192 double de 71959x1 575672 double dest 1x1 8 double dirtail 1x10 20 char dt 8x1 64 double f 9x1 9 logical faceColors 1x1 1904 struct gi 1x1 8 double graphdir 1x10 20 char graphs 1x5 642 cell i 1x1 8 double ind 1x1 8 double labels 9x1 1026 cell lat 9865x1 78920 double lax 1x1 8 double long 9865x1 78920 double mat_fast 1x1 8 double mat_std 1x1 8 double matlabbgldir 1x20 40 char mex_fast 1x1 8 double mex_std 1x1 8 double n 1x1 8 double nrep 1x1 8 double ntests 1x1 8 double path 1x3 24 double pred 1x456 3648 double rep 1x1 8 double results 1x3 2140 struct ri 1249731x1 9997848 double rp 50001x1 400008 double rst 1x1 8 double rt 71959x1 575672 double si 71959x1 575672 double source 1x83 166 char start 1x1 8 double states 49x1 190442 struct szi 1x1 8 double szs 1x6 48 double te 71959x1 575672 double ti 1x1 8 double u 1x1 8 double v 1x1 8 double val 1x1 8 double xy 9x2 144 double

This helps make our examples work regardless of where the current directory lies.

## Search algorithms

The two standard graph search algorithms are depth first search and breadth first search. This library implements both.

Load the example matrix from the Boost Graph Library

load_gaimc_graph('dfs_example'); figure(1); graph_draw(A,xy,'labels',labels);

Run a depth first search. The output records the distance to the other vertices, except where the vertices are not reachable starting from the first node A.

d=dfs(A,1)

d = 0 1 3 3 2 4 -1 -1 -1

From this example, we see that vertices a-f are reachable from vertex a, but that verice g-i are not reachable. Given the of the edges, this makes sense.

Let's look at breadth first search too, using a different example.

load_gaimc_graph('bfs_example'); figure(1); clf; graph_draw(A,xy,'labels',labels);

The breadth first search algorithm records the distance from the starting vertex to each vertex it visits in breadth first order. This means it visits all the vertices in order of their distance from the starting vertex. The d output records the distance, and the dt output records the step of the algorithm when the breadth first search saw the node.

[d dt] = bfs(A,2); % draw the graph where the label is the "discovery time" of the vertex. figure(1); clf; graph_draw(A,xy,'labels',num2str(dt));

Notice how the algorithm visits all vertices one edge away from the start vertex (0) before visiting those two edges away.

## Shortest paths

In the previous two examples, the distance between vertices was equivalent to the number of edges. Some graphs, however,
have specific weights, such as the graph of flights between airports. We can use this information to build information about
the *shortest path* between two nodes in a network.

% Find the minimum travel time between Los Angeles (LAX) and % Rochester Minnesota (RST). load_gaimc_graph('airports') A = -A; % fix funny encoding of airport data lax=247; rst=355; [d pred] = dijkstra(A,lax); % find all the shorest paths from Los Angeles. fprintf('Minimum time: %g\n',d(rst)); % Print the path fprintf('Path:\n'); path =[]; u = rst; while (u ~= lax) path=[u path]; u=pred(u); end fprintf('%s',labels{lax}); for i=path; fprintf(' --> %s', labels{i}); end, fprintf('\n');

Minimum time: 244 Path: Los Angeles, CA --> Minneapolis/St Paul, MN --> Rochester, MN

## Minimum spanning trees

A minimum spanning tree is a set of edges from a graph that ...

This demo requires the mapping toolbox for maximum effect, but we'll do okay without it.

Our data comes from a graph Brendan Frey prepared for his affinity propagation clustering tool. For 456 cities in the US, we have the mean travel time between airports in those cities, along with their latitude and longitude.

```
load_gaimc_graph('airports')
```

For some reason, the data is stored with the negative travel time between cities. (I believe this is so that closer cities have larger edges between them.) But for a minimum spanning tree, we want the actual travel time between cities.

A = -A;

Now, we just call MST and look at the result. T = mst_prim(A); This command means we can't run the demo, so it's commented out.

Oops, travel time isn't symmetric! Let's just pick the longest possible time.

```
A = max(A,A');
T = mst_prim(A);
sum(sum(T))/2 % total travel time in tree
```

ans = (1,1) 24550

Well, the total weight isn't that helpful, let's *look* at the data instead.

clf; gplot(T,xy);

Hey! That looks like the US! You can see regional airports and get some sense of the overall connectivity.

## Connected components

The connected components of a network determine which parts of the network are reachable from other parts. One of your first questions about any network should generally be: is it connected?

There are two types of connected components: components and strongly connected components. gaimc only implements an algorithm for the latter case, but that's okay! It turns out it computes exactly the right thing for connected components as well. The difference only occurs when the graph is undirected vs. directed.

```
load_gaimc_graph('dfs_example')
graph_draw(A,xy)
```

This picture shows there are 3 strongly connected components and 2 connected components

```
% get the number of strongly connected components
max(scomponents(A))
```

ans = 3

% get the number of connected components max(scomponents(A|A')) % we make the graph symmetric first by "or"ing each entry

ans = 2

Let's look at the vertices in the strongly connected components

cc = scomponents(A)

cc = 2 2 2 1 2 2 3 3 3

The output tells us that vertices 1,2,3,5,6 are in one strong component, vertex 4 is it's own strong component, and vertices 7,8,9 are in another one. Remember that a strong component is all the vertices mutually reachable from a given vertex. If you start at vertex 4, you can't get anywhere else! That's why it is in a different component than vertices 1,2,3,5,6.

We also have a largest_component function that makes it easy to just get the largest connected component.

clf; [Acc,f] = largest_component(A); graph_draw(Acc,xy(f,:))

The filter variable f, tells us which vertices in the original graph made it into the largest strong component. We can just apply that filter to the coordinates xy and reuse them for drawing the graph!

## Statistics

Graph statistics are just measures that indicate a property of the graph at every vertex, or at every edges. Arguably the simplest graph statistic would be the average vertex degree. Because such statistics are easy to compute with the adjaceny matrix in Matlab, they do not have special functions in gaimc.

Load a road network to use for statistical computations

```
load_gaimc_graph('minnesota');
gplot(A,xy);
```

Average vertex degree

d = sum(A,2); mean(d)

ans = (1,1) 2.5034

So the average number of roads at any intersection is 2.5. My guess is that many roads have artificial intersections in the graph structure that do not correspond to real intersections. Try validating that hypothesis using the library!

Average clustering coefficients

ccfs = clustercoeffs(A); mean(ccfs) % The average clustering coefficient is a measure of the edge density % throughout the graph. A small value indicates that the network has few % edges and they are well distributed throughout the graph.

ans = 0.0160

Average core numbers

cn = corenums(A); mean(cn)

ans = 1.9463

## Efficient repetition

Every time a gaimc function runs, it converts the adjacency matrix into a set of compressed sparse row arrays. These arrays yield efficient access to the edges of the graph starting at a particular vertex. For many function calls on the same graph, this conversion process slows the algorithms. Hence, gaimc also accepts pre-converted input, in which case it skips it's conversion.

Let's demonstrate how this works by calling Dijkstra's algorithm to compute the shortest paths between all vertices in the graph. The Floyd Warshall algorithm computes these same quantities more efficiently, but that would just be one more algorithm to implement and maintain.

Load and convert the graph.

load_gaimc_graph('all_shortest_paths_example'); A = spfun(@(x) x-min(min(A))+1,A); % remove the negative edges As = convert_sparse(A);

Now, we'll run Dijkstra's algorithm for every vertex and save the result On my 2GHz laptop, this takes 0.000485 seconds.

n = size(A,1); D = zeros(n,n); tic for i=1:n D(i,:) = dijkstra(As,i); end toc

Elapsed time is 0.000304 seconds.

Let's try it without the conversion to see if we can notice the difference in speed. On my 2GHz laptop, this takes 0.001392 seconds.

D2 = zeros(n,n); tic for i=1:n D2(i,:) = dijkstra(A,i); end toc

Elapsed time is 0.000541 seconds.

And just to check, let's make sure the output is the same.

isequal(D,D2)

ans = 1