MATLAB Examples

A simple example using the bvgraph class

This sample will demonstrate the bvgraph class. It will show a few features and limitations of the class for working with these large graph files in Matlab.

Contents

bvgraph background

The BV graph compression scheme and the *.graph/BVGraph file type are data structures and techniques created to work with extremely large web graph matrices. Many researchers observed that these matrices have a self-similar structure and exhibit considerable index locality. Both of these features are exploited by the BV graph compression scheme to store these large webgraph matrices at only a few bits per edge.

Paolo Boldi and Sebastiano Vigna wrote the initial BVGraph implementation in Java. While Matlab has nice features to work with Java, these features do not scale to working with extremely large vectors. Consequently, this Matlab package re-implements the procedures necessary to read a BVGraph file in C. Further, it provides routines to operate on these files and makes them look like a Matlab matrix. The goal is to be as efficient as possible and support computing the PageRank vector for extremely large graphs in Matlab.

Load a graph

Loading a graph with the bvgraph class is simple.

G = bvgraph('../data/wb-cs.stanford')
G = 
   bvgraph: 9914-by-9914

There is a slight caveat about loading. The file with most of the data for the bvgraph file is ../data/wb-cs.stanford.graph. However, you must load the file without the .graph extension. For example, you'll get a slightly helpful error if you accidentally forget this. (This may be changed in a future version.)

G2 = bvgraph('../data/wb-cs.stanford.graph');
Error using ==> bvgraph.bvgraph at 67
The file ../data/wb-cs.stanford.graph.graph does not seem to exist!

Loading offline or online

There are two ways of loading a bvgraph file: streaming mode (offline) or inmemory (online). The difference is that when a file is loaded in streaming mode, the class does not actually load the .graph file into memory, but iterates over the file on the disk for every operation. This is slower, but hardly uses any memory on the computer. For example,

G = bvgraph('../data/wb-cs.stanford');
Goff = bvgraph('../data/wb-cs.stanford',struct('load_type','offline'));

whos
  Name         Size              Bytes  Class      Attributes

  G         9914x9914            27263  bvgraph              
  Goff      9914x9914             2296  bvgraph              

Here we find that Goff takes only 3k whereas G takes 28k of memory. In this case, it seems silly to avoid loading 25k of memory to get the offline operations. However, for larger graphs, the difference is much more pronounced. Goff always takes about 3k, where G might takes hundreds of megabytes.

Some simple operations

The bvgraph class in Matlab supports the following simple operations

size_G = size(G)
nzcount = nnz(G)
d = diag(G);
s1 = sum(G);
s2 = sum(G,2);
size_G =

        9914        9914


nzcount =

       36854

Let's use an operation to get the size of the matrix.

n = size(G,1);

Converting to a native Matlab matrix

If the bvgraph file is small enough to fit into Matlab's memory as a Matlab sparse matrix, the sparse call will convert it.

A = sparse(G);

Transposes

The bvgraph class implements an implicit transform. That is, taking the transpose of a matrix does nothing to the underlying data and only changes the interpretation of the data in the Matlab interface.

Gt = G';
isequal(sum(Gt,1)',sum(G,2))
ans =

     1

Because only the interpretation of the data is transposed, the istrans function reveals this state

istrans(G)
istrans(G')
ans =

     0


ans =

     1

PageRank opertions

The main goal with implementing the BVGraph format inside of Matlab is to support computing PageRank on really large graphs. The key computation in PageRank is the sub-stochastic matrix vector multiplication

where x is the current iterate, D is the diagonal matrix of degrees, and

is the substochastic matrix corresponding to a random walk on the graph for adjacency matrix A. The bvgraph class supports a specialized method to compute this product without storing a vector for the diagonal of D. However, the substochastic_mult function is a little quirky.

% create the substochastic matrix corresponding to a random walk on A
[i j v] = find(A);
d = sum(A,2);
P = sparse(i, j, v(i)./d(i), size(A,1), size(A,2));
% now P = D^+ A;
x = ones(n,1);
y_G = substochastic_mult(G,x);
y_P = P*x;
norm(y_G-y_P,inf)
ans =

     0

When applied to G', the substochastic_mult function computes the transpose of the sub-stochastic product.

y_Gt = substochastic_mult(G',x);
y_Pt = P'*x;
norm(y_Gt-y_Pt,inf)
ans =

     0

Since one goal is PageRank, the bvgraph class has a specific PageRank method.

help bvpagerank

x = bvpagerank(G);
 --- help for bvgraph/bvpagerank.m ---

  BVGRAPH/BVPAGERANK A simple implementation of PageRank for a bvgraph object.
 
  x = bvpagerank(G) computes an approximate PageRank vector x such that 
    ||alpha*(P+d*v')*x + (1-alpha)*(e*v')*x - x||_1 < tol
  where alpha = 0.85, v = 1/n for all vertices, tol = 1e-8, and maxiter =
  1000.
 
  x = bvpagerank(G,alpha,v,tol,maxiter) specifies all of the options of the
  PageRank system.  The type of PageRank system is a strongly personalized
  PageRank system.  To use the default value of an option, specify [].  
 
  By default, the pagerank function computes the PageRank operation for the
  matrix as stored _on disk_.  For example, bvpagerank(G) and bvpagerank(G')
  will return the same vector.  We use this behavior because computing
  bvpagerank(G') takes additional memory with respect to pagerank(G).  Since
  the goal of the bvgraph library is to support large-scale PageRank, we
  chose the memory-minimizing interpretation.  This implementation of
  PageRank will compute the vector for G', however, you must set a final
  parameter to 0 and call x = pagerank(G',[],[],[],[],0); 
 
  The algorithm used is the Power method for the PageRank algorithm.
 
  Example:
    G = bvgraph('data/wb-cs.stanford');
    x = bvpagerank(G);
    x = bvpagerank(G,0.9,[],1e-5,10);
    y = bvpagerank(G',[],[],[],[],0); % turn off native optimizations

iter    1            0.588174  0.01
iter    2            0.286471  0.01
iter    3            0.153072  0.01
iter    4           0.0889456  0.01
iter    5           0.0542172  0.01
iter    6           0.0363644  0.01
iter    7           0.0252051  0.01
iter    8           0.0184926  0.01
iter    9           0.0139464  0.01
iter   10           0.0107167  0.01
iter   11          0.00839594  0.01
iter   12          0.00659468  0.01
iter   13          0.00522328  0.01
iter   14          0.00414234  0.01
iter   15          0.00330867  0.01
iter   16          0.00262997  0.01
iter   17          0.00211307  0.01
iter   18          0.00168965  0.01
iter   19          0.00135813  0.01
iter   20           0.0010915  0.01
iter   21         0.000880287  0.01
iter   22         0.000706975  0.01
iter   23         0.000572634  0.01
iter   24         0.000461322  0.01
iter   25         0.000373298  0.01
iter   26         0.000302176  0.01
iter   27         0.000245035  0.01
iter   28         0.000198232  0.01
iter   29         0.000161401  0.01
iter   30         0.000130781  0.01
iter   31         0.000106367  0.01
iter   32        8.65371e-005  0.01
iter   33        7.04309e-005  0.01
iter   34        5.72682e-005  0.01
iter   35        4.67709e-005  0.01
iter   36        3.80577e-005  0.01
iter   37        3.10717e-005  0.01
iter   38        2.53809e-005  0.01
iter   39        2.07268e-005  0.01
iter   40        1.69279e-005  0.01
iter   41        1.38672e-005  0.01
iter   42        1.13277e-005  0.01
iter   43         9.2783e-006  0.01
iter   44        7.60369e-006  0.01
iter   45        6.22754e-006  0.01
iter   46        5.10692e-006  0.01
iter   47        4.19651e-006  0.01
iter   48        3.44086e-006  0.01
iter   49        2.82904e-006  0.01
iter   50        2.32741e-006  0.01
iter   51        1.91483e-006  0.01
iter   52        1.57779e-006  0.01
iter   53        1.30177e-006  0.01
iter   54        1.07226e-006  0.01
iter   55        8.85194e-007  0.01
iter   56        7.30781e-007  0.01
iter   57        6.03263e-007  0.01
iter   58        4.98316e-007  0.01
iter   59        4.12097e-007  0.01
iter   60        3.40546e-007  0.01
iter   61        2.81734e-007  0.01
iter   62        2.33177e-007  0.01
iter   63        1.93033e-007  0.01
iter   64        1.59868e-007  0.01
iter   65        1.32535e-007  0.01
iter   66        1.09897e-007  0.01
iter   67        9.11464e-008  0.01
iter   68        7.56805e-008  0.01
iter   69        6.28511e-008  0.01
iter   70        5.22238e-008  0.01
iter   71        4.34037e-008  0.01
iter   72        3.60912e-008  0.01
iter   73        3.00383e-008  0.01
iter   74        2.49936e-008  0.01
iter   75        2.08175e-008  0.01
iter   76        1.73507e-008  0.01
iter   77         1.4479e-008  0.01
iter   78        1.20793e-008  0.01
iter   79        1.00859e-008  0.01

As noted in the documentation, this method works on the data as on disk and removes any transpose operations performed on the matrix.

y = bvpagerank(G');

norm(x-y,inf)
iter    1            0.588174  0.01
iter    2            0.286471  0.01
iter    3            0.153072  0.01
iter    4           0.0889456  0.01
iter    5           0.0542172  0.01
iter    6           0.0363644  0.01
iter    7           0.0252051  0.01
iter    8           0.0184926  0.01
iter    9           0.0139464  0.01
iter   10           0.0107167  0.01
iter   11          0.00839594  0.01
iter   12          0.00659468  0.01
iter   13          0.00522328  0.01
iter   14          0.00414234  0.01
iter   15          0.00330867  0.01
iter   16          0.00262997  0.01
iter   17          0.00211307  0.01
iter   18          0.00168965  0.01
iter   19          0.00135813  0.01
iter   20           0.0010915  0.01
iter   21         0.000880287  0.01
iter   22         0.000706975  0.01
iter   23         0.000572634  0.01
iter   24         0.000461322  0.01
iter   25         0.000373298  0.01
iter   26         0.000302176  0.01
iter   27         0.000245035  0.01
iter   28         0.000198232  0.01
iter   29         0.000161401  0.01
iter   30         0.000130781  0.01
iter   31         0.000106367  0.01
iter   32        8.65371e-005  0.01
iter   33        7.04309e-005  0.01
iter   34        5.72682e-005  0.01
iter   35        4.67709e-005  0.01
iter   36        3.80577e-005  0.01
iter   37        3.10717e-005  0.01
iter   38        2.53809e-005  0.01
iter   39        2.07268e-005  0.01
iter   40        1.69279e-005  0.01
iter   41        1.38672e-005  0.01
iter   42        1.13277e-005  0.01
iter   43         9.2783e-006  0.01
iter   44        7.60369e-006  0.01
iter   45        6.22754e-006  0.01
iter   46        5.10692e-006  0.01
iter   47        4.19651e-006  0.01
iter   48        3.44086e-006  0.01
iter   49        2.82904e-006  0.01
iter   50        2.32741e-006  0.01
iter   51        1.91483e-006  0.01
iter   52        1.57779e-006  0.01
iter   53        1.30177e-006  0.01
iter   54        1.07226e-006  0.01
iter   55        8.85194e-007  0.01
iter   56        7.30781e-007  0.01
iter   57        6.03263e-007  0.01
iter   58        4.98316e-007  0.01
iter   59        4.12097e-007  0.01
iter   60        3.40546e-007  0.01
iter   61        2.81734e-007  0.01
iter   62        2.33177e-007  0.01
iter   63        1.93033e-007  0.01
iter   64        1.59868e-007  0.01
iter   65        1.32535e-007  0.01
iter   66        1.09897e-007  0.01
iter   67        9.11464e-008  0.01
iter   68        7.56805e-008  0.01
iter   69        6.28511e-008  0.01
iter   70        5.22238e-008  0.01
iter   71        4.34037e-008  0.01
iter   72        3.60912e-008  0.01
iter   73        3.00383e-008  0.01
iter   74        2.49936e-008  0.01
iter   75        2.08175e-008  0.01
iter   76        1.73507e-008  0.01
iter   77         1.4479e-008  0.01
iter   78        1.20793e-008  0.01
iter   79        1.00859e-008  0.01

ans =

     0

To call PageRank on the tranpose (which takes additional memory, see

What isn't implemented (yet)

Currently, we do not support indexed accesses to the bvgraph type, so the following feature will fail

G(1,2)
Error using ==> evalin
Index exceeds matrix dimensions.

Additionally, the multiplication operation is only supported for a double type.

x = true(size(G,1),1);
G*x
the vector x is not a double

Finally, there is no way to output a Matlab matrix as a bvgraph file.