Matrix computation
Matrices are ubiquitous in statistics. But most statistical models and methods are only useful once turned into computer code, and this often means computing with matrices. It is easy to ruin a beautiful piece of theory by poor use of numerical linear algebra. This section covers some basic numerical linear algebra, including the important topics of efficiency and stability.
1 Efficiency in matrix computation
Recall the simple example
<- 2000
n <- matrix(runif(n*n),n,n)
A <- matrix(runif(n*n),n,n)
B <- runif(n)
y system.time(f0 <- A%*%B%*%y) # case 1
user system elapsed
2.424 0.046 2.482
system.time(f1 <- A%*%(B%*%y)) # case 2
user system elapsed
0.005 0.000 0.004
What happened here?
The answer is all to do with how the multiplications were ordered in the two cases, and the number of floating point operations (flops) required by the two alternative orderings.
In the first case \({\bf AB}\) was formed first, and the resulting matrix used to pre-multiply the vector \({\bf y}\).
In the second case, the vector \({\bf By}\) was formed first, and was then pre-multiplied by \({\bf A}\).
Now we need to count the floating point operations. Try a couple of simple examples first.
- How many flops \((+, -, *, /)\) are involved in the multiplication \[ \left ( \begin{array}{ccc} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23}\\ b_{31} & b_{32} & b_{33} \end{array}\right ) \left ( \begin{array}{c} y_1 \\ y_2 \\ y_3 \end{array}\right )? \]
How many operations are involved in the multiplication \[ \left ( \begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{array}\right ) \left ( \begin{array}{ccc} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23}\\ b_{31} & b_{32} & b_{33} \end{array}\right )? \]
Generalizing
In terms of \(n\), what is the flop cost of forming \({\bf AB}\)?
What is the cost of forming \({\bf By}\)?
Now write down the flop cost of the two ways of forming \({\bf ABy}\).
Another simple matrix operation is the evaluation of the trace of a matrix product. Again different ways of computing the same quantity lead to radically different computation times. Consider the computation of \({\rm tr}({\bf AB})\) where \(\bf A\) is \(2000 \times 1000\) and \(\bf B\) is \(1000 \times 2000\).
library(rbenchmark)
<- 2000; m <- 1000
n <- matrix(runif(n*m), n, m)
A <- matrix(runif(n*m), m, n)
B benchmark("sdAB" = { sum(diag(A %*% B)) },
"sdBA" = { sum(diag(B %*% A)) },
"sABt" = { sum(A*t(B)) },
replications = 10, order=NULL,
columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
1 sdAB 10 12.303 153.787 12.162 0.133
2 sdBA 10 6.246 78.075 6.184 0.058
3 sABt 10 0.080 1.000 0.072 0.007
The first method forms \(\bf AB\), at a flop cost of \(2n^2m\) and then extracts the leading diagonal and sums it.
The second method uses the fact that \({\rm tr}({\bf AB}) = {\rm tr}({\bf BA})\). It forms \(\bf BA\) at flop cost of \(2nm^2\) and then extracts the leading diagonal of the result and sums it.
The third method makes direct use of the fact that \({\rm tr}({\bf AB}) = \sum_{ij} A_{ij} B_{ji}\), at a cost of \(2nm\) (in principle, but the R code above leads to an additional \(nm\) copy operations due to the transpose). It is the fastest because no effort is wasted in calculating the off diagonal elements of the target matrix.
Notice that method 1 is not just wasteful of flops, it also requires storage of an \(n \times n\) matrix, which is much larger than either \(\bf A\) or \(\bf B\).
1.1 Flop counting in general
Clearly it is important to count flops, but it is also tedious to do a complete count for complex algorithms. In many cases an exact count is not necessary. Instead we simply need to know how the computation scales with problem size, as it becomes large. For this reason the leading order operation count is all that is usually reported. For example the naive computation of \({\rm tr}({\bf AB})\), above, required at least \(n^2m + n^2(m-1) + n - 1\) operations. As \(n\) and \(m\) tend to infinity this is dominated by a term proportional to \(n^2m\), so we write that the computational cost is \(O(n^2m)\), i.e. order \(n^2m\).
Note also that flops are not always defined in the same way. For some authors one flop is one floating point addition, subtraction, division or multiplication, but for others one flop is one multiplication/division with one addition/subtraction.
1.1.1 There is more to efficiency than flop counting
Flop count is clearly important, but it is not the only aspect of efficiency. With current computing technology (driven initially by the computer games market) a floating point operation often takes the same amount of time as an operation on an integer or a memory address. Hence how matrices are stored and addressed in memory is often as important as the flop count in determining efficiency.
As an example consider the multiplication of the elements of two matrices as part of a matrix multiplication performed in C. The code
C[i][j] += A[i][k]*B[k][j]
requires at least 6 address/integer calculations in order to figure out where the relevant elements of A
, B
and C
are located, before the 2 floating point operations can be performed. To avoid this, efficient C
code stores pointers which keep a record of the memory locations of the ‘current’ relevant elements of A
, B
and C
. The multiplication loops are then structured so as to minimise the work needed to update such pointers, for each new element multiplication.
To further complicate matters, computer processors often contain features that can be exploited to increase speed. For example, fetching data from the processor memory cache is much quicker than fetching data from general memory (i.e. Random Access Memory, RAM, not to be confused with the yet slower disk). An algorithm structured to break a problem into chunks that fit entirely into cache, and are re-used, can therefore be very quick. Similarly, some processors can execute multiple floating point operations simultaneously, but careful coding is required to use such a feature.
To avoid having to re-write whole matrix algebra libraries to maximise speed on every new processor that comes along, standard numerical linear algebra libraries, such as LAPACK (used by R under the hood), adopt a highly structured design. Wherever possible, the high level routines in the library (e.g. algorithms for finding eigenvalues), are written so that most of the flop intensive work is performed by calling a relatively small set of routines which implement comparatively simple basic matrix operations (e.g. multiplying a matrix by a vector), known as the Basic Linear Algebra Subprograms (BLAS). The BLAS can then be tuned to make use of particular features of particular computers, without the library that depends on it needing to change. This provides one way of exploiting multi-core computers. Multi-threaded BLAS can gain efficiency by using multiple CPU cores, without any of the higher level code having to change.
Whatever language or library you are using for statistical computing, you should be using native BLAS and LAPACK libraries “under–the–hood” for your matrix computations. But it is important to be aware that not all BLAS and LAPACK libraries are the same, and “default” versions shipped with Linux distributions and similar operating systems are not always the best. You can very often swap out a default BLAS library and replace it with an optimized BLAS for your system without changing or recompiling any code. This can easily make a factor of 2 or more difference to the running time of codes for which matrix computation is a bottleneck, so it is well worth investigating how to do this for an easy way of speeding up your code.
On Linux or MacOS, you can check which LAPACK and BLAS are being used by R using sessionInfo()
or
La_library()
[1] "/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib"
extSoftVersion()["BLAS"]
BLAS
"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib"
but these commands don’t work on Windows.
1.1.2 Efficiency conclusions
In conclusion:
- Think about the flop count when computing with matrices (or computing in general).
- If you are writing compiled code for manipulating matrices (or other arrays), then be aware of memory management efficiency as well as flops.
- For optimal performance try and use libraries such as LAPACK, with a fast BLAS, wherever possible.
2 Some terminology
Now consider some more interesting things to do with matrices. First recall the following basic terminology.
\[ \begin{array}{|c|c|c|} \hline ~&~&~\\ \left ( \begin{array}{ccc} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \\ \end{array}\right )& \left ( \begin{array}{ccc} x_{11} & x_{12} & x_{13} \\ x_{12} & x_{22} & x_{23} \\ x_{13} & x_{23} & x_{33} \\ \end{array}\right )& {\begin{array}{c}{\bf y}^{\sf T}{\bf Xy} > 0 ~ \forall~{\bf y}\ne {\bf 0}\\ \left ({\bf y}^{\sf T}{\bf Xy} \ge 0 ~ \forall~{\bf y} \ne {\bf 0}\right )\end{array}} \\ {\bf square} & {\bf symmetric} & {\bf positive ~(semi-) definite} \\ \hline ~&~&~\\ \left ( \begin{array}{ccc} x_{11} & 0 & 0 \\ x_{21} & x_{22} & 0 \\ x_{31} & x_{32} & x_{33} \\ \end{array}\right )& \left ( \begin{array}{ccc} x_{11} & x_{12} & x_{13} \\ 0 & x_{22} & x_{23} \\ 0 & 0 & x_{33} \\ \end{array}\right )& \left ( \begin{array}{ccc} x_{11} & 0 & 0 \\ 0 & x_{22} & 0 \\ 0 & 0 & x_{33} \\ \end{array}\right )\\ {\bf lower~triangular} & {\bf upper~triangular} & {\bf diagonal}\\ \hline \end{array} \]
\(\bf Q\) is an orthogonal matrix iff \({\bf Q}^{\sf T}{\bf Q} = {\bf QQ}^{\sf T}= {\bf I}\).
3 Cholesky decomposition: a matrix square root
Positive definite matrices are the ‘positive real numbers’ of matrix algebra. They have particular computational advantages and occur frequently in statistics, since covariance matrices are usually positive definite (and always positive semi-definite). So let’s start with positive definite matrices, and their matrix square roots. To see why matrix square roots might be useful, consider the following.
Example. Generating multivariate normal random variables. Suppose that there exist very quick and reliable methods for simulating i.i.d. \(N(0,1)\) random deviates, but we need to simulate random vectors from \(N(\boldsymbol{\mu}, \boldsymbol{\Sigma})\). Clearly we can generate vectors \({\bf z}\) from \(N({\bf 0},{\bf I})\). If we could find a square root matrix \(\bf R\) such that \({\bf R}^{\sf T}{\bf R} = \boldsymbol{\Sigma}\) then \[{\bf y} \equiv {\bf R}^{\sf T}{\bf z} + \boldsymbol{\mu} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}),\] since the covariance matrix of \(\bf y\) is \({\bf R}^{\sf T}{\bf I} {\bf R} = {\bf R}^{\sf T}{\bf R} = \boldsymbol{\Sigma}\) and \(\mathbb{E}({\bf y}) = \mathbb{E}({\bf R}^{\sf T}{\bf z} + \boldsymbol{\mu}) = \boldsymbol{\mu}\).
In general the square root of a positive definite matrix is not uniquely defined, but there is a unique upper triangular square root of any positive definite matrix: its Cholesky factor. The algorithm for finding the Cholesky factor is easily derived. Consider a \(4 \times 4\) example first. The defining matrix equation is \[
\left ( \begin{array}{cccc}
R_{11} & 0 & 0 & 0 \\
R_{12} & R_{22} & 0 & 0 \\
R_{13} & R_{23} & R_{33} & 0 \\
R_{14} & R_{24} & R_{34} & R_{44}
\end{array}\right )\left ( \begin{array}{cccc}
R_{11} & R_{12} & R_{13} & R_{14} \\
0 & R_{22} & R_{23} & R_{24} \\
0& 0& R_{33} & R_{34} \\
0 & 0 & 0 & R_{44}
\end{array}\right )=
\left ( \begin{array}{cccc}
A_{11} & A_{12} & A_{13} & A_{14}\\
A_{12} & A_{22} & A_{23} & A_{24}\\
A_{13} & A_{23} & A_{33} & A_{34}\\
A_{14} & A_{24} & A_{34} & A_{44}
\end{array}\right )
\] If the component equations of this expression are written out and solved in the right order, then each contains only one unknown, as the following illustrates (unknown in grey) \[\begin{eqnarray*}
A_{11} &=& \textcolor{gray}{R_{11}}^2 \\
A_{12} &=& R_{11} \textcolor{gray}{R_{12}} \\
A_{13} &=& R_{11} \textcolor{gray}{R_{13}} \\
A_{14} &=& R_{11} \textcolor{gray}{R_{14}} \\
A_{22} &=& R_{12}^2 + \textcolor{gray}{R_{22}}^2\\
A_{23} &=& R_{12} R_{13} + R_{22} \textcolor{gray}{R_{23}}\\
&.&\\
&.&
\end{eqnarray*}\] Generalising to the \(n \times n\) case, and using the convention that \(\sum_{k=1}^0 x_i \equiv 0\), we have \[
R_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} R_{ki}^2 }, {\rm ~~and~~}
R_{ij} = \frac{A_{ij} - \sum_{k=1}^{i-1} R_{ki}R_{kj}}{R_{ii}}, ~~ j> i.
\] Working through these equations in row order, from row one, and starting each row from its leading diagonal component ensures that all r.h.s. quantities are known at each step. Cholesky decomposition requires \(n^3/3\) flops and \(n\) square roots. In R it is performed by function chol
, which calls routines in LAPACK or LINPACK.
Example (continued). The following simulates a random draw from \[ N\left (\left ( \begin{array}{c} 1 \\ -1 \\ 3 \end{array}\right ), \left ( \begin{array}{ccc} 2 & -1 & 1 \\ -1 & 2 & -1 \\ 1 & -1 & 2 \end{array}\right )\right ) \]
<- matrix(c(2,-1,1,-1,2,-1,1,-1,2),3,3) # Covariance matrix
V <- c(1,-1,3)
mu <- chol(V) ## Cholesky factor of V (upper triangle)
R <- rnorm(3) ## iid N(0,1) deviates
z <- t(R)%*%z + mu ## N(mu,V) deviates
y y
[,1]
[1,] 0.3503822
[2,] -2.6983111
[3,] 6.9109113
The following simulates 1000 such vectors, and checks their observed mean and covariance.
<- matrix(rnorm(3000),3,1000) ## 1000 N(0,I) 3-vectors
Z <- t(R)%*%Z + mu ## 1000 N(mu,V) vectors
Y ## and check that they behave as expected...
rowMeans(Y) ## Empirical mu
[1] 0.9527966 -0.9956858 2.9462114
-mu)%*%t(Y-mu)/1000 ## Empirical V (Y
[,1] [,2] [,3]
[1,] 2.133478 -1.073804 1.035891
[2,] -1.073804 2.129278 -1.106442
[3,] 1.035891 -1.106442 2.182957
As a second application of the Cholesky decomposition, consider evaluating the log likelihood of \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) \[ l = -\frac{n}{2} \log(2 \pi) - \frac{1}{2} \log|\boldsymbol{\Sigma}| - \frac{1}{2} ({\bf y} - \boldsymbol{\mu})^{\sf T}\boldsymbol{\Sigma}^{-1} ({\bf y} - \boldsymbol{\mu}). \] If we were simply to invert \(\boldsymbol{\Sigma}\) to evaluate the final term, it would cost \(2n^3\) flops, and we would still need to evaluate the determinant. A Cholesky based approach is much better. It is easy to see that \(\boldsymbol{\Sigma}^{-1} = {\bf R}^{-1} {\bf R} ^{\sf -T}\), where \(\bf R\) is the Cholesky factor of \(\boldsymbol{\Sigma}\). So the final term in the log likelihood can be written as \({\bf z}^{\sf T}{\bf z}\) where \({\bf z} = {\bf R}^{\sf -T}({\bf y} - \boldsymbol{\mu})\). Notice that we don’t actually need to evaluate \({\bf R}^{\sf -T}\), but simply to solve \({\bf R}^{\sf T}{\bf z} = {\bf y} - \boldsymbol{\mu}\) for \(\bf z\). To see how this is done, consider a \(4 \times 4\) case again: \[ \left ( \begin{array}{cccc} R_{11} & 0 & 0 & 0 \\ R_{12} & R_{22} & 0 & 0 \\ R_{13} & R_{23} & R_{33} & 0 \\ R_{14} & R_{24} & R_{34} & R_{44} \end{array}\right ) \left ( \begin{array}{c} z_1 \\ z_2 \\ z_3 \\ z_4 \end{array}\right )= \left ( \begin{array}{c} y_1 - \mu_1 \\ y_2 - \mu_2 \\ y_3 - \mu_3 \\ y_4 - \mu_4 \end{array}\right ) \] If this system of equations is solved from the top down, then there is only one unknown (shown in grey) at each stage: \[\begin{eqnarray*} R_{11} \textcolor{gray}{z_1} &=& y_1 - \mu_1 \\ R_{12} z_1 + R_{22} \textcolor{gray}{z_2} &=& y_2 - \mu_2 \\ R_{13} z_1 + R_{23} z_2 + R_{33} \textcolor{gray}{z_3} &=& y_3 - \mu_3 \\ R_{14 } z_1 + R_{24} z_2 + R_{34} z_3 + R_{44} \textcolor{gray}{z_{4}} &=& y_4 - \mu_4 \end{eqnarray*}\] The generalisation of this forward substitution process to \(n\) dimensions is obvious, as is the fact that it cost \(O(n^2)\) flops — much cheaper than explicit formation of \({\bf R}^{\sf -T}\), which would involve applying forward substitution to find each column of the unknown \({\bf R}^{\sf -T}\) in the equation \({\bf R}^{\sf T}{\bf R}^{\sf -T}= {\bf I}\), at \(O(n^3)\) cost.
In R there is a routine forwardsolve
for doing forward substitution with a lower triangular matrix (and a routine backsolve
for performing the equivalent back substitution with upper triangular matrices). Before using it, we still need to consider \(|\boldsymbol{\Sigma}|\). Again the Cholesky factor helps. From general properties of determinants we know that \(|{\bf R}^{\sf T}|| {\bf R}| =|\boldsymbol{\Sigma}|\), but because \(\bf R\) is triangular \(|{\bf R}^{\sf T}| = |{\bf R}| = \prod_{i=1}^n R_{ii}\). So given the Cholesky factor the calculation of the determinant is \(O(n)\).
Example. The following evaluates the log likelihood of the covariance matrix V
and mean vector mu
, from the previous example, given an observed \({\bf y}^{\sf T}= (1,2,3)\).
<- 1:3
y <- forwardsolve(t(R),y-mu)
z <- -length(y)*log(2*pi)/2 - sum(log(diag(R))) - sum(z*z)/2
logLik logLik
[1] -6.824963
Note that Cholesky decomposition of a matrix that is not positive definite will fail. Even positive semi-definite will not work, since in that case a leading diagonal element of the Cholesky factor will become zero, so that computation of the off diagonal elements on the same row is impossible. Positive semi-definite matrices are reasonably common, so this is a practical problem. For the positive semi-definite case, it is possible to modify the Cholesky decomposition by pivoting: that is by re-ordering the rows/columns of the original matrix so that the zeroes end up at the end of the leading diagonal of the Cholesky factor, in rows that are all zero. This will not be pursued further here. Rather we will consider a more general matrix decomposition, which provides matrix square roots along with much else.
4 Eigen-decomposition (spectral-decomposition)
Any symmetric matrix, \(\bf A\) can be written as \[ {\bf A} = {\bf U}\boldsymbol{\Lambda}{\bf U}^{\sf T} \tag{1}\] where the matrix \(\bf U\) is orthogonal and \(\boldsymbol{\Lambda}\) is a diagonal matrix, with \(i^{th}\) leading diagonal element \(\lambda_i\) (conventionally \(\lambda_i\ge\lambda_{i+1}\)). Post-multiplying both sides of the decomposition by \(\bf U\) we have \[ {\bf AU} = {\bf U}\boldsymbol{\Lambda}. \] Considering this system one column at a time and writing \({\bf u}_i\) for the \(i^{\rm th}\) column of \(\bf U\) we have: \[ {\bf Au}_i = \lambda_i {\bf u}_i. \] i.e. the \(\lambda_i\) are the eigenvalues of \(\bf A\), and the columns of \(\bf U\) are the corresponding eigenvectors. Equation 1 is the eigen-decomposition or spectral decomposition of \(\bf A\).
We will not go over the computation of the eigen-decomposition in detail, but it is not done using the determinant and characteristic equation of \(\bf A\). One practical scheme is as follows: (i) the matrix \(\bf A\) is first reduced to tri-diagonal form using repeated pre and post multiplication by simple rank-one orthogonal matrices called Householder rotations, (ii) an iterative scheme called QR-iteration then pre-and post-multiplies the tri-diagonal matrix by even simpler orthogonal matrices, in order to reduce it to diagonal form. At this point the diagonal matrix contains the eigenvalues, and the product of all the orthogonal matrices gives \(\bf U\). Eigen-decomposition is \(O(n^3)\), but in practice a efficient eigen
routine for symmetric matrices is around 10 times as computationally costly as a Cholesky routine.
An immediate use of the eigen-decomposition is to provide an alternative characterisation of positive (semi-) definite matrices. All the eigenvalues of a positive (semi-)definite matrix must be positive (non-negative) and real. This is easy to see. Were some eigenvalue, \(\lambda_i\) to be negative (zero) then the corresponding eigenvector \({\bf u}_i\) would result in \({\bf u}_i^{\sf T}{\bf A} {\bf u}_i\) being negative (zero). At the same time the existence of an \(\bf x\) such that \({\bf x}^{\sf T}{\bf Ax}\) is negative (zero) leads to a contradiction unless at least one eigenvalue is negative (zero). Indeed, we can write \({\bf x} = {\bf Ub}\) for some vector \(\bf b\). So \({\bf x}^{\sf T}{\bf Ax}<0\) \(\Rightarrow {\bf b}^{\sf T}\boldsymbol{\Lambda}{\bf b}<0\) \(\Rightarrow \sum b_i^2 \Lambda_i <0\) \(\Rightarrow \Lambda_i<0\) for some \(i\).
4.1 Powers of matrices
Consider raising \(\bf A\) to the power \(m\). \[ {\bf A}^m = {\bf AAA} \cdots {\bf A} = {\bf U}\boldsymbol{\Lambda}{\bf U}^{\sf T}{\bf U}\boldsymbol{\Lambda}{\bf U}^{\sf T}\cdots {\bf U}\boldsymbol{\Lambda}{\bf U}^{\sf T}= {\bf U} \boldsymbol{\Lambda}\boldsymbol{\Lambda} \cdots \boldsymbol{\Lambda} {\bf U}^{\sf T}= {\bf U}\boldsymbol{\Lambda}^m{\bf U}^{\sf T} \] where \(\boldsymbol{\Lambda}^m\) is just the diagonal matrix with \(\lambda_i^m\) as the \(i^{\rm th}\) leading diagonal element. This suggests that any real valued function, \(f\), of a real valued argument, which has a power series representation, has a natural generalisation to a symmetric matrix valued function of a symmetric matrix argument, i.e. \[ f^\prime({\bf A}) \equiv {\bf U}f^\prime(\boldsymbol{\Lambda}){\bf U} ^{\sf T} \] where \(f^\prime(\boldsymbol{\Lambda})\) denotes the diagonal matrix with \(i^{\rm th}\) leading diagonal element \(f(\lambda_i)\). E.g. \(\exp({\bf A}) = {\bf U}\exp(\boldsymbol{\Lambda}){\bf U} ^{\sf T}\).
4.2 Another matrix square root
For matrices with non-negative eigenvalues we can generalise to non-integer powers. For example it is readily verified that \(\sqrt{\bf A} = {\bf U}\sqrt{\boldsymbol{\Lambda}}{\bf U}^{\sf T}\) has the property that \(\sqrt{\bf A}\sqrt{\bf A} = {\bf A}\). Notice (i) that \(\sqrt{\bf A}\) is not the same as the Cholesky factor, emphasising the non-uniqueness of matrix square roots, and (ii) that, unlike the Cholesky factor, \(\sqrt{\bf A}\) is well defined for positive semi-definite matrices (and can therefore be applied to any covariance matrix). Also, the symmetry of this square root can sometimes be convenient.
4.3 Matrix inversion, rank and condition
Continuing in the same vein we can write \[ {\bf A}^{-1} = {\bf U} \boldsymbol{\Lambda}^{-1} {\bf U}^{\sf T} \] where the diagonal matrix \(\boldsymbol{\Lambda}^{-1}\) has \(i^{\rm th}\) leading diagonal element \(\lambda_i^{-1}\). Clearly we have a problem if any of the \(\lambda_i\) are zero, for the matrix inverse will be undefined. A matrix with no zero eigen-values is termed full rank. A matrix with any zero eigenvalues is rank deficient and does not have an inverse. The number of non-zero eigenvalues is the rank of a matrix.
For some purposes it is sufficient to define a generalised inverse or pseudo-inverse when faced with rank deficiency, by finding the reciprocal of the non-zero eigenvalues, but setting the reciprocal of the zero eigenvalues to zero. We will not pursue this here.
It is important to understand the consequences of rank deficiency quite well when performing matrix operations involving matrix inversion/matrix equation solving. This is because near rank deficiency is rather easy to achieve by accident, and in finite precision arithmetic is as bad as rank deficiency. First consider trying to solve \[ {\bf Ax} = {\bf y} \] for \(\bf x\) when \({\bf A}\) is rank deficient. In terms of the eigen-decomposition the solution is \[ {\bf x} = {\bf U}\boldsymbol{\Lambda}^{-1}{\bf U}^{\sf T}{\bf y} \] i.e. \(\bf y\) is rotated to become \({\bf y}^\prime = {\bf U}^{\sf T}{\bf y}\), the elements of \({\bf y}^\prime\) are then divided by the eigenvalues, \(\lambda_i\), and the reverse rotation is applied to the result. The problem is that \(y^\prime_i/\lambda_i\) is not defined if \(\lambda_i = 0\). This is just a different way of showing something that you already know: rank deficient matrices can not be inverted. But the approach also helps in the understanding of near rank deficiency and ill conditioning.
An illustrative example highlights the problem. Suppose that \(n \times n\) symmetric matrix \(\bf A\) has \(n-1\) distinct eigenvalues ranging from 0.5 to 1, and one much smaller magnitude eigenvalue \(\epsilon\). Further suppose that we wish to compute with \(\bf A\), on a machine that can represent real numbers to an accuracy of 1 part in \(\epsilon^{-1}\). Now consider solving the system \[ {\bf Ax} = {\bf u}_1 \label{ill.cond.sys} \tag{2}\] for \(\bf x\), where \({\bf u}_1\) is the dominant eigenvector of \(\bf A\). Clearly the correct solution is \({\bf x} = {\bf u}_1\), but now consider computing the answer. As before we have a formal solution \[ {\bf x} = {\bf U}\boldsymbol{\Lambda}^{-1}{\bf U}^{\sf T}{\bf u}_1, \] but although analytically \({\bf u}^\prime_1={\bf U}^{\sf T}{\bf u}_1 = (1,0,0,\ldots,0)^{\sf T}\), the best we can hope for computationally is to get \({\bf u}^\prime_1 = (1+e_1,e_2,e_3,\ldots,e_n)^{\sf T}\) where the number \(e_j\) are of the order of \(\pm \epsilon\). For convenience, suppose that \(e_n = \epsilon\). Then, approximately, \(\boldsymbol{\Lambda}^{-1}{\bf u}^\prime_1 = (1,0,0, \ldots,0,1)^{\sf T}\), and \({\bf x} = {\bf U}\boldsymbol{\Lambda}^{-1}{\bf u}^\prime_1 = {\bf u}_1 + {\bf u}_n\), which is not correct. Similar distortions would occur if we used any of the other first \(n-1\) eigenvectors in place of \({\bf u}_1\): they all become distorted by a spurious \({\bf u}_n\) component, with only \({\bf u}_n\) itself escaping.
Now consider an arbitrary vector \(\bf y\) on the r.h.s. of Equation 2. We can always write it as some weighted sum of the eigenvectors \({\bf y} = \sum_i w_i {\bf u}_i\). This emphasises how bad the ill-conditioning problem is: all but one of \(\bf y\)s components are seriously distorted when multiplied by \({\bf A}^{-1}\). By contrast multiplication by \(\bf A\) itself would lead only to distortion of the \({\bf u}_n\) component of \(\bf y\), and not the other eigen-vectors, but the \({\bf u}_n\) component is the component that is so heavily shrunk by multiplication by \(\bf A\) that it makes almost no contribution to the result, unless we have the misfortune to choose a \(\bf y\) that is proportional to \({\bf u}_n\) and nothing else.
A careful examination of the preceding argument reveals that what really matters in determining the seriousness of the consequences of near rank deficiency is the ratio of the largest magnitude to the smallest magnitude eigenvalues, i.e. \[
\kappa = \max|\lambda_i|/\min|\lambda_i|.
\] This quantity is a condition number for \(\bf A\). Roughly speaking it is the factor by which errors in \(\bf y\) will be multiplied when solving \({\bf Ax} = {\bf y}\) for \(\bf x\). Once \(\kappa\) begins to approach the reciprocal of the machine precision we’re in trouble. Because the condition number is so important in numerical computation, there are several methods for getting an approximate condition number more cheaply than via eigen decomposition — e.g. see ?kappa
in R. A system with a large condition number is referred to as ill-conditioned. Orthogonal matrices have \(\kappa=1\), which is why numerical analysts like them so much.
Example. We are now in a position to understand what happened when we tried to use the `normal equations’ to fit the simple quadratic model in the introduction. Let’s explicitly calculate the condition number of \({\bf X}^{\sf T}{\bf X}\).
set.seed(1)
<- sort(runif(100))
xx <- xx + 100 # regenerated same x data
x <- model.matrix(~x + I(x^2)) # and the model matrix
X <- crossprod(X) # form t(X)%*%X (with half the flop count)
XtX <- eigen(XtX)$values
lambda 1]/lambda[3] # the condition number of X'X lambda[
[1] 2.426288e+18
1/.Machine$double.eps # Reciprocal of machine precision
[1] 4.5036e+15
Of course this raises a couple of obvious questions. Could we have diagnosed the problem directly from \(\bf X\)? And how does the lm
function avoid this problem? Answers soon, but first consider a trick for reducing \(\kappa\).
4.4 Preconditioning
The discussion of condition numbers given above was for systems involving unstructured matrices (albeit presented only in the context of symmetric matrices). Systems involving matrices with special structure are sometimes less susceptible to ill-conditioning than naive computation of the condition number would suggest. For example if \(\bf D\) is a diagonal matrix, then we can accurately solve \({\bf Dy} = {\bf x}\) for \(\bf y\), however large \(\kappa({\bf D})\) is: overflow or underflow are the only limits.
This basic fact can sometimes be exploited to re-scale a problem to improve computational stability. As an example consider diagonal preconditioning of the computation of \({\bf X}^{\sf T}{\bf X}\), above. For the original \({\bf X}^{\sf T}{\bf X}\) we have
solve(XtX)
Error in solve.default(XtX): system is computationally singular: reciprocal condition number = 3.98648e-19
But now suppose that we create a diagonal matrix \(\bf D\), with elements \(D_{ii} =1/ \sqrt{({\bf X}^{\sf T}{\bf X})_{ii}}\). Clearly \[ ({\bf X}^{\sf T}{\bf X})^{-1} = {\bf D}({\bf D}{\bf X}^{\sf T}{\bf X}{\bf D})^{-1} {\bf D}, \] but \(({\bf D}{\bf X}^{\sf T}{\bf X}{\bf D})^{-1}\) turns out to have a much lower condition number than \({\bf X}^{\sf T}{\bf X}\)
<- diag(1/diag(XtX)^.5)
D <- D%*%XtX%*%D
DXXD <- eigen(DXXD)$values
lambda 1]/lambda[3] lambda[
[1] 429386510115
As a result we can now compute the inverse of \({\bf X}^{\sf T}{\bf X}\)
<- D%*%solve(DXXD,D) ## computable inverse of X'X
XtXi %*% XtX ## how accurate? XtXi
(Intercept) x I(x^2)
[1,] 1.000010e+00 1.059008e-03 0.106773827
[2,] -9.431497e-08 9.999916e-01 -0.002492462
[3,] 4.270717e-10 5.214940e-08 1.000008353
not perfect, but better than no answer at all.
4.5 Asymmetric eigen-decomposition
If positive definite matrices are the positive reals of the square matrix system, and symmetric matrices are the reals, then asymmetric matrices are the complex numbers. As such they have complex eigenvectors and eigenvalues. It becomes necessary to distinguish right and left eigenvectors (one is no longer the transpose of the other), and the right and left eigenvector matrices are no longer orthogonal matrices (although they are still inverses of each other). Eigen-decomposition of asymmetric matrices is still \(O(n^3)\), but is substantially more expensive than the symmetric case. Using R on the laptop used to prepare these notes, eigen-decomposition of a symmetric \(1000 \times 1000\) matrix took 0.5 seconds. The asymmetric equivalent took 3 seconds.
The need to compute with complex numbers somewhat reduces the practical utility of the eigen-decomposition in numerical methods for statistics. It would be better to have a decomposition that provides some of the useful properties of the eigen-decomposition without the inconvenience of complex numbers. The singular value decomposition meets this need.
5 Singular value decomposition
The singular values of an \(r \times c\) matrix, \(\bf A\) with \(r \ge c\) are the positive square roots of the eigenvalues of \({\bf A}^{\sf T}{\bf A}\). If \(\bf A\) is positive semi-definite then its singular values are just its eigenvalues, of course. For symmetric \(\bf A\), eigenvalues and singular values differ only in sign, if at all. However the singular values are also well defined, and real, for matrices that are not even square, let alone symmetric.
Related to the singular values is the singular value decomposition \[ {\bf A} = {\bf UDV}^{\sf T} \] where \(\bf U\) has orthogonal columns and is the same dimension as \(\bf A\), while \(\bf D\) is a \(c \times c\) diagonal matrix of the singular values (usually arranged in descending order), and \(\bf V\) is a \(c \times c\) orthogonal matrix.
The singular value decomposition is computed using a similar approach to that used for the symmetric eigen problem: orthogonal bi-diagonalization, followed by QR iteration, and totalling an \(O(rc^2)\) cost (it does not involve forming \({\bf A}^{\sf T}{\bf A}\)) . It is more costly than symmetric eigen-decomposition, but cheaper than the asymmetric equivalent. Here are some example R timings for decomposing \(1000 \times 1000\) matrices
<- 1000
n = matrix(rnorm(n*n),ncol=n)
XA = XA %*% t(XA)
XXt system.time(eigen(XXt, symmetric=TRUE))
user system elapsed
0.969 0.011 0.980
system.time(svd(XXt))
user system elapsed
1.975 0.016 1.992
system.time(chol(XXt))
user system elapsed
0.153 0.001 0.155
system.time(eigen(XA))
user system elapsed
3.253 0.026 3.290
system.time(svd(XA))
user system elapsed
2.026 0.017 2.043
system.time(qr(XA))
user system elapsed
0.426 0.004 0.429
The number of its non-zero singular values gives the rank of a matrix, and the SVD is the most reliable method for numerical rank determination (by examination of the size of the singular values relative to the largest singular value). In a similar vein, a general definition of condition number is the ratio of largest and smallest singular values: \(\kappa = d_1/d_c\).
Example. Continuing the simple regression disaster from the introduction, consider the singular values of X
<- svd(X)$d # get the singular values of X
d d
[1] 1.010455e+05 2.662169e+00 6.474081e-05
Clearly, numerically X
is close to being rank 2, rather than rank 3. Turning to the condition number
1]/d[3] d[
[1] 1560769713
\(\kappa \approx 2 \times 10^9\) is rather large, especially since it is easy to show that the condition number of \({\bf X}^{\sf T}{\bf X}\) must be the square of this. Which explains why computing \(({\bf X}^{\sf T}{\bf X})^2\) via solve(XtX)
leads to a problem (see above).
In fact the SVD not only provides a diagnosis of the problem, but also a possible solution. We can re-write the solution to the normal equations in terms of the SVD of X. \[\begin{eqnarray*} ({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}{\bf y} &=& ({\bf V} {\bf D} {\bf U}^{\sf T}{\bf U}{\bf D} {\bf V}^{\sf T})^{-1} {\bf V}{\bf D}{\bf U}^{\sf T}{\bf y}\\ &=& ({\bf V} {\bf D}^2 {\bf V}^{\sf T})^{-1} {\bf V}{\bf D}{\bf U}^{\sf T}{\bf y} \\ &=& {\bf V} {\bf D}^{-2} {\bf V}^{\sf T}{\bf V} {\bf D}{\bf U}^{\sf T}{\bf y} \\ &=& {\bf V} {\bf D}^{-1} {\bf U}^{\sf T}{\bf y} \end{eqnarray*}\] Notice two things
The condition number of the system that we have ended up with is exactly the condition number of \(\bf X\), i.e. the square root of the condition number involved in the direct solution of the normal equations.
Comparing the final RHS expression to the representation of an inverse in terms of its eigen-decomposition, it is clear that \({\bf V} {\bf D}^{-1} {\bf U}^{\sf T}\) is a sort of pseudo-inverse of \(\bf X\).
The SVD has many uses. One interesting one is low rank approximation of matrices. In a well defined sense, the best rank \(k \le {\rm rank}({\bf X})\) approximation to a matrix \(\bf X\) can be expressed in terms of the SVD of \({\bf X}\) as \[
\tilde {\bf X}= {\bf U} \tilde {\bf D} {\bf V} ^{\sf T}
\] where \(\tilde {\bf D}\) is \(\bf D\) with all but the \(k\) largest singular values set to \(0\). Using this result to find low rank approximations to observed covariance matrices is the basis for several dimension reduction techniques in multivariate statistics (although, of course, a symmetric eigen decomposition is then equivalent to SVD). One issue with this sort of approximation is that the full SVD is computed, despite the fact that part of it is then to be discarded (don’t be fooled by routines that ask you how many eigen or singular vectors to return — I saved .1 of a second, out of 13 by getting R routine svd
to only return the first columns of \(\bf U\) and \(\bf V\)). Look up Lanczos methods and Krylov subspaces for approaches that avoid this sort of waste.
6 The QR decomposition
The SVD provided a stable solution to the linear model fitting example, but at a rather high computational cost, prompting the question of whether similar stability could be obtained without the full cost of SVD? The QR decomposition provides a positive answer. We can write any \(r \times c\) rectangular matrix \({\bf X}\) (\(r \ge c\)) as the product of columns of an orthogonal matrix and an upper triangular matrix. \[ {\bf X}= {\bf QR} \] where \(\bf R\) is upper triangular and \(\bf Q\) is of the same dimension as \({\bf X}\), with orthogonal columns (so \({\bf Q}^{\sf T}{\bf Q} = {\bf I}\), but \({\bf QQ}^{\sf T}\ne {\bf I}\)). The QR decomposition has a cost of \(O(rc^2)\), but it is about 1/3 of the cost of SVD.
Consider the linear modelling example again \[\begin{eqnarray*}
({\bf X}^{\sf T}{\bf X})^{-1}{\bf X}^{\sf T}{\bf y} &=& ({\bf R}^{\sf T}{\bf Q}^{\sf T}{\bf QR})^{-1} {\bf R}^{\sf T}{\bf Q} ^{\sf T}{\bf y} \\
&=& ({\bf R}^{\sf T}{\bf R})^{-1} {\bf R}^{\sf T}{\bf Q} ^{\sf T}{\bf y} \\
&=& {\bf R}^{-1} {\bf R}^{\sf -T}{\bf R}^{\sf T}{\bf Q} ^{\sf T}{\bf y} \\
&=& {\bf R}^{-1} {\bf Q} ^{\sf T}{\bf y} \\
\end{eqnarray*}\] Now the singular values of \(\bf R\) are the same as the singular values of \({\bf X}\), so this system has the same condition number as \({\bf X}\). i.e. the QR method provides the stability of SVD without the computational cost (although it still has about twice the cost of solving the normal equations via Cholesky decomposition). In fact the QR method for solving least squares estimates can be derived from first principles, without using the normal equations, in which case it also provides a very compact way of deriving much of the distributional theory required for linear modelling. R routine lm
uses the QR approach, which is why, in the Introduction, it was able to deal with the case for which the normal equations failed. Of course it is not magic: the subsequent failure even of lm
arose because I had caused the model matrix condition number to become so bad that even the QR approach fails (I should have centred \(x\), by subtracting its mean, or used orthogonal polynomials, via poly
).
It is also worth noting the connection between the QR and Cholesky decompositions. If \({\bf X = QR}\), then \({\bf X^{\sf T}X = R^{\sf T}R}\), where \({\bf R}\) is upper triangular, so the QR decomposition of \({\bf X}\) gives the Cholesky factorisation of \({\bf X^{\sf T}X}\) “for free”. Note that most QR routines don’t bother ensuring that the diagonal elements of \({\bf R}\) are positive. If the positivity of the diagonal elements is an issue (it rarely matters), it can easily be fixed by flipping the signs of the relevant rows of \({\bf R}\). This relationship between the QR and Cholesky can be used to directly construct the Cholesky decomposition of a covariance matrix in a stable way, without ever directly constructing the covariance matrix.
Another application of the QR decomposition is determinant calculation. If \({\bf A}\) is square and \({\bf A} = {\bf QR}\) then \[ |{\bf A}| = |{\bf Q}||{\bf R}| = |{\bf R}| = \prod_i R_{ii} \] since \(\bf R\) is triangular, while \(\bf Q\) is orthogonal with determinant 1. Usually we need \[ \log |{\bf A}| = \sum_i \log |R_{ii}| \] which underflows to zero much less easily than \(|{\bf A}|\).
7 Woodbury’s formula
There are a large number of matrix identities that are useful for simplifying computations, and the matrix cookbook is an excellent source of these. One useful set of identities relates to the update of the inverse of a matrix, when it is subject to a low rank modification. There are several different variants, and each goes by different names. The Woodbury identity, Sherman-Morrison-Woodbury formula, Sherman-Morrison formula, etc., are all common names for various versions and special cases.
Suppose that \({\bf A}^{-1}\) is already known, or, more usually, that we already have a matrix decomposition allowing efficient computations involving \({\bf A}^{-1}\). Now suppose that \({\bf A}\) is subject to a low-rank update of the form \({\bf A+UV}^{\sf T}\), for \({\bf U}, {\bf V}\) both \(m\times n\) with \(n << m\). The problem is how to compute with the updated inverse \(({\bf A+UV}^{\sf T})^{-1}\) in an efficient way. Woodbury’s formula gives one answer. \[ ({\bf A+UV}^{\sf T})^{-1} = {\bf A}^{-1} - {\bf A}^{-1}{\bf U}({\bf I}_n+{\bf V}^{\sf T}{\bf A}^{-1}{\bf U})^{-1}{\bf V}^{\sf T}{\bf A}^{-1} \] Crucially, the inverse on the LHS is \(m\times m\), whereas the new inverse required on the RHS is only \(n\times n\). This is what makes the result so useful. A very commonly encountered special case is where \({\bf U}\) and \({\bf V}\) are just \(m\times 1\) vectors, \({\bf u, v}\). In that case the identity is often referred to as the Sherman-Morrison formula, and reduces to \[ ({\bf A+uv}^{\sf T})^{-1} = {\bf A}^{-1} - \frac{\bf A^{-1}uv^{\sf T}A^{-1}}{1 + {\bf v^{\sf T}A^{-1}u}}. \] Writing it this way makes it clear that no new matrix inversion is required to perform the update.
The form given above is sufficiently general for most applications in statistical computing, but it is worth noting a modest generalisation, often referred to as the Sherman-Morrison-Woodbury formula, which introduces an additional \(n\times n\) matrix, \({\bf C}\). \[ ({\bf A+UCV}^{\sf T})^{-1} = {\bf A}^{-1} - {\bf A}^{-1}{\bf U}({\bf C}^{-1}+{\bf V}^{\sf T}{\bf A}^{-1}{\bf U})^{-1}{\bf V}^{\sf T}{\bf A}^{-1} \] Each of these results is easily proved by direct multiplication.
There are numerous applications of these identities in statistical computing. Updating a precision matrix based on a low-rank update of a covariance matrix is one example, but note that in many applications we would avoid explicitly forming the modified inverse, and would simply use the expressions alongside a suitable decomposition of \(\bf A\) to compute with the modified inverse.
8 Further reading on matrices
These notes can only scratch the surface of numerical linear algebra. Hopefully you now have a clearer idea of the importance of stability and efficiency in matrix computation, and some understanding of a few key decompositions and some of their applications. My hope is that when writing any code involving matrices, you’ll at least think about flops and condition numbers. If you want to find out more, here are some starting points.
- Golub, GH and CF Van Loan (1996) Matrix Computations 3rd ed. John Hopkins. This is the reference for numerical linear algebra, but not the easiest introduction. It is invaluable if you are writing carefully optimized matrix computations, directly calling BLAS and LAPACK functions.
- Watkins, DS (2002) Fundamentals of Matrix Computations 2nd ed. Wiley. This is a more accessible book, but still aimed at those who really want to understand the material (the author encourages you to write your own SVD routine, and with this book you can).
- Press WH, SA Teukolsky, WT Vetterling and BP Flannery (2007) Numerical Recipes 3rd ed. Cambridge. Very easily digested explanation of how various linear algebra routines work, with code. But there are better versions of most of the algorithms in this book.
- The GNU scientific library is a better bet than Numerical Recipes for actual code.
- For state of the art numerical linear algebra use LAPACK.
- Matlab offers a convenient, but expensive, platform for matrix computation. For an introduction see e.g. Pratap, R (2006) Getting Started with Matlab, Oxford.
- Octave is Matlab-like free software.
- Julia is another dynamically typed scripting language aimed at scientific computing.
- Monahan, JF (2001) Numerical Methods of Statistics provides a useful treatment of numerical linear algebra for statistics.
- Harville, DA (1997) Matrix Algebra From a Statistician’s Perspective, Springer, provides a wealth of helpful theoretical material.