SC1

Statistical Computing 1

Sparse Matrices


Matrix package

The R Matrix package provides additional functions for both dense and sparse matrices that extend the basic matrix data type. In the Matrix package, dense matrices are stored as dgeMatrix objects. One useful function it adds for dense matrices is rankMatrix, which returns the rank of the input matrix, up to a certain tolerance if desired. Another useful function is rcond, which gives the condition number of a square matrix, defined to be the product of the norm of the matrix and the norm of its inverse.

library(Matrix)
A <- Matrix(c(1,2,1,2), nrow=2, ncol=2)
B <- Matrix(c(1,2,1,2) + rnorm(4)/1e8, nrow=2, ncol=2)
C <- Matrix(rnorm(4), nrow=2, ncol=2)
c(rankMatrix(A), rankMatrix(B), rankMatrix(B, tol=1e-7))
## [1] 1 2 1
c(rcond(A), rcond(B), rcond(C), rcond(C/1e5))
## [1] 0.000000e+00 2.882359e-09 6.976583e-01 6.976583e-01

A sparse matrix is one where most entries will be 0. It is inefficient to store all these 0’s in memory.

set.seed(7)
nrows <- 1000
ncols <- 1000
vals <- sample(x=c(0, 1, 2), prob=c(0.98, 0.01, 0.01), size=nrows*ncols, replace=TRUE)
m1 <- matrix(vals, nrow=nrows, ncol=ncols)
m2 <- Matrix(vals, nrow=nrows, ncol=ncols, sparse=TRUE)
m1[1:2, 1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    0    0    0    0    0    0    0    0     0
## [2,]    0    0    0    0    0    0    0    0    0     0
m2[1:2, 1:10]
## 2 x 10 sparse Matrix of class "dgCMatrix"
##                         
## [1,] 2 . . . . . . . . .
## [2,] . . . . . . . . . .
c(object.size(m1), object.size(m2))
## [1] 8000216  245856

By default, a sparse matrix in R is stored as a dgCMatrix, “d” for digit, “g” for general (i.e. not triangular or symmetric), “C” for column (not row or triplet). If one converts a sparse matrix to a dense dgeMatrix, then one loses the advantage in terms of memory storage. A dgCMatrix is stored using compressed sparse column (CSC) format. An alternative is the compressed sparse row (CSR) format, dgRMatrix, or triplets dgTMatrix. One can coerce a dgCMatrix into a dgTMatrix (and from dgTMatrix to dgCMatrix), but not a dgRMatrix from either dgCMatrix or dgTMatrix .

str(m2, max.level=1)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
str(m2)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:20029] 0 18 22 24 158 360 362 513 524 550 ...
##   ..@ p       : int [1:1001] 0 23 46 68 90 110 127 142 165 188 ...
##   ..@ Dim     : int [1:2] 1000 1000
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:20029] 2 2 1 2 1 2 1 1 1 1 ...
##   ..@ factors : list()
which(m2[,1]>0)
##  [1]   1  19  23  25 159 361 363 514 525 551 561 610 645 652 719 737 754 762 763
## [20] 816 878 910 963
object.size(as(m2, 'dgeMatrix'))
## 'as(<dgCMatrix>, "dgeMatrix")' is deprecated.
## Use 'as(., "unpackedMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").
## 8001176 bytes
m3 <- as(m1, 'dgRMatrix')
object.size(m3)
## 245856 bytes
m3[1:2, 1:10]
## 2 x 10 sparse Matrix of class "dgRMatrix"
##                         
## [1,] 2 . . . . . . . . .
## [2,] . . . . . . . . . .

Note that even though m3 is a dgRMatrix, m3[1:2, 1:10] returns a dgTMatrix.

Basic Operations of sparse matrices

A <- as(matrix(c(1,0,0,0,2,0), nrow=3, ncol=2), 'dgCMatrix')
A / 10
## 3 x 2 sparse Matrix of class "dgCMatrix"
##             
## [1,] 0.1 .  
## [2,] .   0.2
## [3,] .   .
A + 10 # + or - operation turns a dgCMatrix into a dgeMatrix
## 3 x 2 Matrix of class "dgeMatrix"
##      [,1] [,2]
## [1,]   11   10
## [2,]   10   12
## [3,]   10   10
A %*% c(1, 0) # matrix multiplication by a dense vector outputs a dgeMatrix
## 3 x 1 Matrix of class "dgeMatrix"
##      [,1]
## [1,]    1
## [2,]    0
## [3,]    0
A %*% Matrix(c(1, 0), nrow=2, ncol=1, sparse=TRUE)
## 3 x 1 sparse Matrix of class "dgCMatrix"
##       
## [1,] 1
## [2,] .
## [3,] .
A %*% t(A)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##           
## [1,] 1 . .
## [2,] . 4 .
## [3,] . . .
rbind(A, A)
## 6 x 2 sparse Matrix of class "dgCMatrix"
##         
## [1,] 1 .
## [2,] . 2
## [3,] . .
## [4,] 1 .
## [5,] . 2
## [6,] . .

Obtaining the dependency graph of R packages

We use the miniCRAN package to build an R package dependency graph. A graph \(G=(V,E)\), where \(V\) denotes the set of vertices, and \(E\) denotes the set of edges, can also be represented by an adjacency matrix \(M\). The matrix \(M\) is square with each row and column corresponding to a vertex. If there is an edge \((i,j)\), then \(M_{ij}=1\); all other entries in \(M\) are 0. Thus one can often expect \(M\) to be quite sparse. The following piece of code reads the package dependency information and constructs the corresponding adjacency matrix.

library(miniCRAN) # for obtaining dependency relations of R packages
load('matrix_packages.RData')
pkg <- 'igraph'
pkgDep(pkg, suggests=FALSE, enhances=FALSE, includeBasePkgs=TRUE)
##  [1] "igraph"    "methods"   "cli"       "graphics"  "grDevices" "lifecycle"
##  [7] "magrittr"  "Matrix"    "pkgconfig" "rlang"     "stats"     "utils"    
## [13] "vctrs"     "cpp11"     "glue"      "grid"      "lattice"
plot(g1, vertex.size=20)
## This graph was created by an old(er) igraph version.
##   Call upgrade_graph() on it to use with the current igraph version
##   For now we convert it on the fly...

m_pkg1
## 11 x 11 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 11 column names 'igraph', 'methods', 'graphics' ... ]]
##                                
## igraph    . . . . . . . . . . .
## methods   1 . . . . . . . . 1 .
## graphics  1 . . . . . 1 . . 1 .
## grid      . . . . . . 1 . . 1 .
## stats     1 . . . . . 1 . . 1 .
## utils     1 . . . . . 1 . . 1 1
## lattice   . . . . . . . . . 1 .
## grDevices 1 . . . . . 1 . . . .
## magrittr  1 . . . . . . . . . .
## Matrix    1 . . . . . . . . . .
## pkgconfig 1 . . . . . . . . . .

The file matrix_packages.RData is generated by the following piece of code.

library(Matrix)
library(miniCRAN)
library(igraph) # for turning a graph into an adjacency matrix
#pkgs_db = available.packages('https://mran.revolutionanalytics.com/snapshot/2019-08-19/src/contrib',
pkgs_db = available.packages('https://cran.microsoft.com/snapshot/2020-11-17/src/contrib/', type='source', filters=NULL)
g1 <- makeDepGraph('igraph', includeBasePkgs=TRUE, suggests=FALSE, enhances=FALSE)
# here we obtain graphs of "enhancement", e.g. Rcpp is enhanced by many packages but enhances none
g2 <- makeDepGraph(pkgs_db[, 'Package'], availPkgs=pkgs_db, includeBasePkgs=FALSE, suggests=FALSE, enhances=TRUE)
m_pkg1 <- get.adjacency(g1, type="both", attr=NULL, names=TRUE, sparse=TRUE)
# take transpose since our convention is storage by row
m_pkg2 <- t(get.adjacency(g2, type="both", attr=NULL, names=TRUE, sparse=TRUE))
save(g1, g2, m_pkg1, m_pkg2, file='matrix_packages.RData')

The Pagerank algorithm

The pagerank algorithm is an iterative algorithm that takes a (directed) graph as input and outputs a probability distribution on the set of vertices. The probability distribution is essentially the stationary distribution of the random walk on the graph, hence the higher the weight of a vertex, the longer a random walk spends at this vertex, hence the more “important” it is.

Let \(PR^{(k)}(v_i)\) be the \(k\)th iterate for the calculation of pagerank of the \(i\)th vertex, then the pagerank algorithm performs the following computation: \[PR^{(k+1)}(v_i) = \frac{1-d}{N} + d \sum_{j: M_{ij}>0} \frac{PR^{(k)}(v_j)}{L(v_j)} \]

where \(L(v_j)\) is the number of outbound links of the \(j\)th vertex. Here \(d\) is a damping factor that has the effect of the random walk uniformly picking a vertex to visit with probability \(1-d\) at each iteration. In matrix format, we can write \(PR(v_i)\) as a horizontal vector, then \[ PR^{(k+1)} = \left[\frac{1-d}{N}, \ldots, \frac{1-d}{N} \right] + d \ PR^{(k)} \ \left[ \begin{array}{llll} M_{11}/L(v_1) & M_{12}/L(v_1) & \ldots & M_{1N}/L(v_1) \\ M_{21}/L(v_2) & \ddots & & \vdots \\ \vdots & & M_{ij}/L(v_i) & \\ M_{N1}/L(v_N) & \cdots & & M_{NN}/L(v_N) \end{array} \right]\]

compute_pagerank <- function(M) {
  d <- 0.85 # damping factor
  name_pkgs <- M@Dimnames[[1]]
  N <- dim(M)[1]
  M_rowsum <- rowSums(M)
  for (i in which(M_rowsum==0)) M[i,i] = 1
  M_rowsum <- rowSums(M)
  # This operation divides each column of M element-wise by M_rowsum
  A <- M / M_rowsum # A is a stochastic matrix
  PR <- rep(1/N, N)
  for (i in 1:10) {
    PR <- (1-d)/N + d * (PR %*% A)
    # print out 7 most important packages and their corresponding weights
    i7 <- order(PR, decreasing=TRUE)[1:7]
    print(name_pkgs[i7])
    print(PR[i7])
  }
}

compute_pagerank(m_pkg1)
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.47083333 0.23901515 0.16174242 0.03295455 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.67882576 0.17333333 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph"    "Matrix"    "lattice"   "pkgconfig" "methods"   "graphics" 
## [7] "grid"     
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
compute_pagerank(m_pkg2)
## [1] "Rcpp"     "MASS"     "ggplot2"  "Matrix"   "dplyr"    "survival" "mvtnorm" 
## [1] 0.04544432 0.02491786 0.02007975 0.01328318 0.01251681 0.01162591 0.01088882
## [1] "Rcpp"     "MASS"     "Matrix"   "lattice"  "magrittr" "jsonlite" "mvtnorm" 
## [1] 0.08255496 0.03512211 0.02273513 0.02220904 0.01740401 0.01386460 0.01325651
## [1] "Rcpp"     "MASS"     "lattice"  "magrittr" "chron"    "jsonlite" "mvtnorm" 
## [1] 0.09199156 0.03805808 0.02260036 0.01925890 0.01906416 0.01447589 0.01371033
## [1] "Rcpp"     "MASS"     "magrittr" "chron"    "zoo"      "lattice"  "jsonlite"
## [1] 0.09352076 0.03858840 0.01983434 0.01938121 0.01805550 0.01619707 0.01455902
## [1] "Rcpp"     "MASS"     "lattice"  "magrittr" "zoo"      "jsonlite" "mvtnorm" 
## [1] 0.09373549 0.03870802 0.02267705 0.01991354 0.01807738 0.01456921 0.01417933
## [1] "Rcpp"     "MASS"     "lattice"  "magrittr" "chron"    "jsonlite" "mvtnorm" 
## [1] 0.09376950 0.03871963 0.02232004 0.01992365 0.01944531 0.01456987 0.01425596
## [1] "Rcpp"     "MASS"     "magrittr" "chron"    "lattice"  "zoo"      "jsonlite"
## [1] 0.09377607 0.03871999 0.01992479 0.01914185 0.01814680 0.01808562 0.01456989
## [1] "Rcpp"     "MASS"     "lattice"  "magrittr" "zoo"      "chron"    "jsonlite"
## [1] 0.09377831 0.03872004 0.02204347 0.01992488 0.01782642 0.01559460 0.01456989
## [1] "Rcpp"     "MASS"     "lattice"  "magrittr" "chron"    "zoo"      "jsonlite"
## [1] 0.09377940 0.03872004 0.02178517 0.01992489 0.01890677 0.01481098 0.01456989
## [1] "Rcpp"     "MASS"     "magrittr" "lattice"  "chron"    "zoo"      "jsonlite"
## [1] 0.09377989 0.03872005 0.01992489 0.01920662 0.01868721 0.01762628 0.01456989

As can be seen from the results, the package Rcpp is the most enhanced package. Even with a large graph with 16236 vertices, the convergence of the pagerank algorithm is rapid.

Solving large linear systems

When it comes to sparse systems, it becomes even worse if one tries to invert any sparse matrix, as the inverse of a sparse matrix is not guaranteed to be sparse. We illustrate this with a tri-diagonal matrix

n <- 100
A1 <- bandSparse(n, k=c(0,1), diag=list(rep(1,n), rep(-0.2, n)), symm=T)
A1inv <- solve(A1)
c(object.size(A1), object.size(A1inv))
## [1] 11904 62728
c(sum(abs(A1) < 1e-100), sum(abs(A1inv) < 1e-100))
## [1] 9702    0

Instead, just as in the case of dense matrix, one should try to use solve(A, b) whenever possible.