Sparse Matrices
Matrix
package
The R
Matrix
package provides additional functions for both dense and sparse matrices that extend the basic matrix data type. In the Matrix
package, dense matrices are stored as dgeMatrix
objects. One useful function it adds for dense matrices is rankMatrix
, which returns the rank of the input matrix, up to a certain tolerance if desired. Another useful function is rcond
, which gives the condition number of a square matrix, defined to be the product of the norm of the matrix and the norm of its inverse.
library(Matrix)
A <- Matrix(c(1,2,1,2), nrow=2, ncol=2)
B <- Matrix(c(1,2,1,2) + rnorm(4)/1e8, nrow=2, ncol=2)
C <- Matrix(rnorm(4), nrow=2, ncol=2)
c(rankMatrix(A), rankMatrix(B), rankMatrix(B, tol=1e-7))
## [1] 1 2 1
c(rcond(A), rcond(B), rcond(C), rcond(C/1e5))
## [1] 0.000000e+00 2.882359e-09 6.976583e-01 6.976583e-01
A sparse matrix is one where most entries will be 0. It is inefficient to store all these 0’s in memory.
set.seed(7)
nrows <- 1000
ncols <- 1000
vals <- sample(x=c(0, 1, 2), prob=c(0.98, 0.01, 0.01), size=nrows*ncols, replace=TRUE)
m1 <- matrix(vals, nrow=nrows, ncol=ncols)
m2 <- Matrix(vals, nrow=nrows, ncol=ncols, sparse=TRUE)
m1[1:2, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
m2[1:2, 1:10]
## 2 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] 2 . . . . . . . . .
## [2,] . . . . . . . . . .
c(object.size(m1), object.size(m2))
## [1] 8000216 245856
By default, a sparse matrix in R
is stored as a dgCMatrix
, “d” for digit, “g” for general (i.e. not triangular or symmetric), “C” for column (not row or triplet). If one converts a sparse matrix to a dense dgeMatrix
, then one loses the advantage in terms of memory storage. A dgCMatrix
is stored using compressed sparse column (CSC) format. An alternative is the compressed sparse row (CSR) format, dgRMatrix
, or triplets dgTMatrix
. One can coerce a dgCMatrix
into a dgTMatrix
(and from dgTMatrix
to dgCMatrix
), but not a dgRMatrix
from either dgCMatrix
or dgTMatrix
.
str(m2, max.level=1)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
str(m2)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:20029] 0 18 22 24 158 360 362 513 524 550 ...
## ..@ p : int [1:1001] 0 23 46 68 90 110 127 142 165 188 ...
## ..@ Dim : int [1:2] 1000 1000
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ x : num [1:20029] 2 2 1 2 1 2 1 1 1 1 ...
## ..@ factors : list()
which(m2[,1]>0)
## [1] 1 19 23 25 159 361 363 514 525 551 561 610 645 652 719 737 754 762 763
## [20] 816 878 910 963
object.size(as(m2, 'dgeMatrix'))
## 'as(<dgCMatrix>, "dgeMatrix")' is deprecated.
## Use 'as(., "unpackedMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").
## 8001176 bytes
m3 <- as(m1, 'dgRMatrix')
object.size(m3)
## 245856 bytes
m3[1:2, 1:10]
## 2 x 10 sparse Matrix of class "dgRMatrix"
##
## [1,] 2 . . . . . . . . .
## [2,] . . . . . . . . . .
Note that even though m3
is a dgRMatrix
, m3[1:2, 1:10]
returns a dgTMatrix
.
Basic Operations of sparse matrices
A <- as(matrix(c(1,0,0,0,2,0), nrow=3, ncol=2), 'dgCMatrix')
A / 10
## 3 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 0.1 .
## [2,] . 0.2
## [3,] . .
A + 10 # + or - operation turns a dgCMatrix into a dgeMatrix
## 3 x 2 Matrix of class "dgeMatrix"
## [,1] [,2]
## [1,] 11 10
## [2,] 10 12
## [3,] 10 10
A %*% c(1, 0) # matrix multiplication by a dense vector outputs a dgeMatrix
## 3 x 1 Matrix of class "dgeMatrix"
## [,1]
## [1,] 1
## [2,] 0
## [3,] 0
A %*% Matrix(c(1, 0), nrow=2, ncol=1, sparse=TRUE)
## 3 x 1 sparse Matrix of class "dgCMatrix"
##
## [1,] 1
## [2,] .
## [3,] .
A %*% t(A)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 . .
## [2,] . 4 .
## [3,] . . .
rbind(A, A)
## 6 x 2 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 .
## [2,] . 2
## [3,] . .
## [4,] 1 .
## [5,] . 2
## [6,] . .
Obtaining the dependency graph of R
packages
We use the miniCRAN
package to build an R
package dependency graph. A graph \(G=(V,E)\), where \(V\) denotes the set of vertices, and \(E\) denotes the set of edges, can also be represented by an adjacency matrix \(M\). The matrix \(M\) is square with each row and column corresponding to a vertex. If there is an edge \((i,j)\), then \(M_{ij}=1\); all other entries in \(M\) are 0. Thus one can often expect \(M\) to be quite sparse. The following piece of code reads the package dependency information and constructs the corresponding adjacency matrix.
library(miniCRAN) # for obtaining dependency relations of R packages
load('matrix_packages.RData')
pkg <- 'igraph'
pkgDep(pkg, suggests=FALSE, enhances=FALSE, includeBasePkgs=TRUE)
## [1] "igraph" "methods" "cli" "graphics" "grDevices" "lifecycle"
## [7] "magrittr" "Matrix" "pkgconfig" "rlang" "stats" "utils"
## [13] "vctrs" "cpp11" "glue" "grid" "lattice"
plot(g1, vertex.size=20)
## This graph was created by an old(er) igraph version.
## Call upgrade_graph() on it to use with the current igraph version
## For now we convert it on the fly...
m_pkg1
## 11 x 11 sparse Matrix of class "dgCMatrix"
## [[ suppressing 11 column names 'igraph', 'methods', 'graphics' ... ]]
##
## igraph . . . . . . . . . . .
## methods 1 . . . . . . . . 1 .
## graphics 1 . . . . . 1 . . 1 .
## grid . . . . . . 1 . . 1 .
## stats 1 . . . . . 1 . . 1 .
## utils 1 . . . . . 1 . . 1 1
## lattice . . . . . . . . . 1 .
## grDevices 1 . . . . . 1 . . . .
## magrittr 1 . . . . . . . . . .
## Matrix 1 . . . . . . . . . .
## pkgconfig 1 . . . . . . . . . .
The file matrix_packages.RData
is generated by the following piece of code.
library(Matrix)
library(miniCRAN)
library(igraph) # for turning a graph into an adjacency matrix
#pkgs_db = available.packages('https://mran.revolutionanalytics.com/snapshot/2019-08-19/src/contrib',
pkgs_db = available.packages('https://cran.microsoft.com/snapshot/2020-11-17/src/contrib/', type='source', filters=NULL)
g1 <- makeDepGraph('igraph', includeBasePkgs=TRUE, suggests=FALSE, enhances=FALSE)
# here we obtain graphs of "enhancement", e.g. Rcpp is enhanced by many packages but enhances none
g2 <- makeDepGraph(pkgs_db[, 'Package'], availPkgs=pkgs_db, includeBasePkgs=FALSE, suggests=FALSE, enhances=TRUE)
m_pkg1 <- get.adjacency(g1, type="both", attr=NULL, names=TRUE, sparse=TRUE)
# take transpose since our convention is storage by row
m_pkg2 <- t(get.adjacency(g2, type="both", attr=NULL, names=TRUE, sparse=TRUE))
save(g1, g2, m_pkg1, m_pkg2, file='matrix_packages.RData')
The Pagerank algorithm
The pagerank algorithm is an iterative algorithm that takes a (directed) graph as input and outputs a probability distribution on the set of vertices. The probability distribution is essentially the stationary distribution of the random walk on the graph, hence the higher the weight of a vertex, the longer a random walk spends at this vertex, hence the more “important” it is.
Let \(PR^{(k)}(v_i)\) be the \(k\)th iterate for the calculation of pagerank of the \(i\)th vertex, then the pagerank algorithm performs the following computation: \[PR^{(k+1)}(v_i) = \frac{1-d}{N} + d \sum_{j: M_{ij}>0} \frac{PR^{(k)}(v_j)}{L(v_j)} \]
where \(L(v_j)\) is the number of outbound links of the \(j\)th vertex. Here \(d\) is a damping factor that has the effect of the random walk uniformly picking a vertex to visit with probability \(1-d\) at each iteration. In matrix format, we can write \(PR(v_i)\) as a horizontal vector, then \[ PR^{(k+1)} = \left[\frac{1-d}{N}, \ldots, \frac{1-d}{N} \right] + d \ PR^{(k)} \ \left[ \begin{array}{llll} M_{11}/L(v_1) & M_{12}/L(v_1) & \ldots & M_{1N}/L(v_1) \\ M_{21}/L(v_2) & \ddots & & \vdots \\ \vdots & & M_{ij}/L(v_i) & \\ M_{N1}/L(v_N) & \cdots & & M_{NN}/L(v_N) \end{array} \right]\]
compute_pagerank <- function(M) {
d <- 0.85 # damping factor
name_pkgs <- M@Dimnames[[1]]
N <- dim(M)[1]
M_rowsum <- rowSums(M)
for (i in which(M_rowsum==0)) M[i,i] = 1
M_rowsum <- rowSums(M)
# This operation divides each column of M element-wise by M_rowsum
A <- M / M_rowsum # A is a stochastic matrix
PR <- rep(1/N, N)
for (i in 1:10) {
PR <- (1-d)/N + d * (PR %*% A)
# print out 7 most important packages and their corresponding weights
i7 <- order(PR, decreasing=TRUE)[1:7]
print(name_pkgs[i7])
print(PR[i7])
}
}
compute_pagerank(m_pkg1)
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.47083333 0.23901515 0.16174242 0.03295455 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.67882576 0.17333333 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
## [1] "igraph" "Matrix" "lattice" "pkgconfig" "methods" "graphics"
## [7] "grid"
## [1] 0.78583239 0.06632670 0.03585227 0.01653409 0.01363636 0.01363636 0.01363636
compute_pagerank(m_pkg2)
## [1] "Rcpp" "MASS" "ggplot2" "Matrix" "dplyr" "survival" "mvtnorm"
## [1] 0.04544432 0.02491786 0.02007975 0.01328318 0.01251681 0.01162591 0.01088882
## [1] "Rcpp" "MASS" "Matrix" "lattice" "magrittr" "jsonlite" "mvtnorm"
## [1] 0.08255496 0.03512211 0.02273513 0.02220904 0.01740401 0.01386460 0.01325651
## [1] "Rcpp" "MASS" "lattice" "magrittr" "chron" "jsonlite" "mvtnorm"
## [1] 0.09199156 0.03805808 0.02260036 0.01925890 0.01906416 0.01447589 0.01371033
## [1] "Rcpp" "MASS" "magrittr" "chron" "zoo" "lattice" "jsonlite"
## [1] 0.09352076 0.03858840 0.01983434 0.01938121 0.01805550 0.01619707 0.01455902
## [1] "Rcpp" "MASS" "lattice" "magrittr" "zoo" "jsonlite" "mvtnorm"
## [1] 0.09373549 0.03870802 0.02267705 0.01991354 0.01807738 0.01456921 0.01417933
## [1] "Rcpp" "MASS" "lattice" "magrittr" "chron" "jsonlite" "mvtnorm"
## [1] 0.09376950 0.03871963 0.02232004 0.01992365 0.01944531 0.01456987 0.01425596
## [1] "Rcpp" "MASS" "magrittr" "chron" "lattice" "zoo" "jsonlite"
## [1] 0.09377607 0.03871999 0.01992479 0.01914185 0.01814680 0.01808562 0.01456989
## [1] "Rcpp" "MASS" "lattice" "magrittr" "zoo" "chron" "jsonlite"
## [1] 0.09377831 0.03872004 0.02204347 0.01992488 0.01782642 0.01559460 0.01456989
## [1] "Rcpp" "MASS" "lattice" "magrittr" "chron" "zoo" "jsonlite"
## [1] 0.09377940 0.03872004 0.02178517 0.01992489 0.01890677 0.01481098 0.01456989
## [1] "Rcpp" "MASS" "magrittr" "lattice" "chron" "zoo" "jsonlite"
## [1] 0.09377989 0.03872005 0.01992489 0.01920662 0.01868721 0.01762628 0.01456989
As can be seen from the results, the package Rcpp
is the most enhanced package. Even with a large graph with 16236 vertices, the convergence of the pagerank algorithm is rapid.
Solving large linear systems
When it comes to sparse systems, it becomes even worse if one tries to invert any sparse matrix, as the inverse of a sparse matrix is not guaranteed to be sparse. We illustrate this with a tri-diagonal matrix
n <- 100
A1 <- bandSparse(n, k=c(0,1), diag=list(rep(1,n), rep(-0.2, n)), symm=T)
A1inv <- solve(A1)
c(object.size(A1), object.size(A1inv))
## [1] 11904 62728
c(sum(abs(A1) < 1e-100), sum(abs(A1inv) < 1e-100))
## [1] 9702 0
Instead, just as in the case of dense matrix, one should try to use solve(A, b)
whenever possible.