Don’t Solve the Normal Equations

The (ordinary) linear least squares problem is as follows: given an $m\times n$ matrix $A$ and a vector $b$ of length $m$ , find the vector $x$ such that $Ax$ is as close to $b$ as possible, when measured using the two-norm $\| \cdot \|$ . That is, we seek to

(1) $\begin{equation*} \mbox{find } x \in\mathbb{R}^n \mbox{ such that }\| b - Ax \|^2 = \sum_{i=1}^m \left(b_i - \sum_{j=1}^n A_{ij} x_j \right)^2 \mbox{ is minimized}. \end{equation*}$

From this equation, the name “least squares” is self-explanatory: we seek $x$ which minimizes the sum of the squared discrepancies between the entries of $b$ and $Ax$ .

The least squares problem is ubiquitous in science, engineering, mathematics, and statistics. If we think of each row $a_i$ of $A$ as an input and its corresponding entry $b_i$ of $b$ as an output, then the solution $x$ to the least squares model gives the coefficients of a linear model for the input–output relationship. Given a new previously unseen input $a_{\rm new}$ , our model predicts the output $b_{\rm new}$ is approximately $b_{\rm new} \approx a_{\rm new}^\top x = \sum_{i=1}^n x_i (a_{\rm new})_i$ . The vector $x$ consists of coefficients for this linear model. The least squares solution satisfies the property that the average squared difference between the output $b_i$ and the prediction $a_i^\top x$ is as small as it could possibly be for all choices of coefficient vectors $x$ .

How do we solve the least squares problem? A classical solution approach, ubiquitous in textbooks, is to solve a system of linear equations known as the normal equations. The normal equations associated with the least squares problem (1) are given by

(2) $\begin{equation*} A^\top A \,x = A^\top b. \end{equation*}$

This system of equations always has a solution. If $A$ has full column-rank, then $A^\top A$ is invertible and the unique least squares solution to (1) is given by $(A^\top A)^{-1} A^\top b$ . We assume that $A$ has full column-rankQ for the rest of this discussion. To solve the normal equations in software, we compute $A^\top A$ and $A^\top b$ and solve (2) using a linear solver like MATLAB’s “\”.¹ (As is generally true in matrix computations, it is almost never a good idea to explicitly form the inverse of the matrix $A^\top A$ , or indeed any matrix.) We also can solve the normal equations using an iterative method like (preconditioned) conjugate gradient.

The purpose of the article is to advocate against the use of the normal equations for solving the least squares problems, at least in most cases. So what’s wrong with the normal equations? The problem is not that the normal equations aren’t mathematically correct. Instead, the problem is that the normal equations often lead to poor accuracy for the least squares solution using computer arithmetic.

Most of the time when using computers, we store real numbers as floating point numbers.² At a coarse level, the right model to have in your head is that real numbers on a computer are stored in scientific notation with only 16 decimal digits after the decimal point.³ When two numbers are added, subtracted, multiplied, and divided, the answer is computed and then rounded to 16 decimal digits; any extra digits of information are thrown away. Thus, the result of our arithmetic on a computer is the true answer to the arithmetic problem plus a small rounding error. These rounding errors are small individually, but solving an even modestly sized linear algebra problem requires thousands of such operations. Making sure many small errors don’t pile up into a big error is part of the subtle art of numerical computation.

To make a gross simplification, if one solves a system of linear equations $Mx = c$ on a computer using a well-designed piece of software, one obtains an approximate solution $\hat{x}$ which is, after accounting for the accumulation of rounding errors, close to $x$ . But just how close the computed solution $\hat{x}$ and the true solution $x$ are depends on how “nice” the matrix $M$ is. The “niceness” of a matrix $M$ is quantified by a quantity known as the condition number of $M$ , which we denote $\kappa(M)$ .⁴ As a rough rule of thumb, the relative error between $x$ and $\hat{x}$ is roughly bounded as

(3) $\begin{equation*} \frac{\| \hat{x} - x \|}{\|x\|} \lessapprox \kappa(M)\times 10^{-16}. \end{equation*}$

The “ $10^{-16}$ corresponds to the fact we have roughly 16 decimal digits of accuracy in double precision floating point arithmetic. Thus, if the condition number of $M$ is roughly $10^{10}$ , then we should expect around $6$ digits of accuracy in our computed solution.

The accuracy of the least squares problem is governed by its own condition number $\kappa(A)$ . We would hope that we can solve the least squares problem with an accuracy like the rule-of-thumb error bound (3) we had for linear systems of equations, namely a bound like $\|\hat{x} - x\|/\|x\| \lessapprox \kappa(A)\times 10^{-16}$ . But this is not the kind of accuracy we get for the least squares problem when we solve it using the normal equations. Instead, we get accuracy like

(4) $\begin{equation*} \frac{\| \hat{x} - x \|}{\|x\|} \lessapprox \left(\kappa(A)\right)^2\times 10^{-16}. \end{equation*}$

By solving the normal equations we effectively square the condition number! Perhaps this is not surprising as the normal equations also more-or-less square the matrix $A$ by computing $A^\top A$ . This squared condition number drastically effects the accuracy of the computed solution. If the condition number of $A$ is $10^{8}$ , then the normal equations give us absolute nonsense for $\hat{x}$ ; we expect to get no digits of the answer $x$ correct. Contrast this to above, where we were able to get $6$ correct digits in the solution to $Mx = c$ despite the condition number of $M$ being $100$ times larger than $A$ !

All of this would be just a sad fact of life for the least squares problem if the normal equations and their poor accuracy properties were the best we could do for the least squares problem. But we can do better! One can solve linear least squares problems by computing a so-called QR factorization of the matrix $A$ .⁵ Without going into details, the upshot is that the least squares solution by a well-designed⁶ QR factorization requires a similar amount of time to solving the normal equations and has dramatically improved accuracy properties, achieving the desirable rule-of-thumb behavior⁷

(5) $\begin{equation*} \frac{\| \hat{x} - x \|}{\|x\|} \lessapprox \kappa(A)\times 10^{-16}. \end{equation*}$

I have not described how the QR factorization is accurately computed nor how to use the QR factorization to solve least squares problems nor even what the QR factorization is. All of these topics are explained excellently by the standard textbooks in this area, as well as by publicly available resources like Wikipedia. There’s much more that can be said about the many benefits of solving the least squares problem with the QR factorization,⁸ but in the interest of brevity let me just say this: TL;DR when presented in the wild with a least squares problem, the solution method one should default to is one based on a well-implemented QR factorization, not solving the normal equations.

Suppose for whatever reason we don’t have a high quality QR factorization algorithm at our disposal. Must we then resort to the normal equations? Even in this case, there is a way we can reduce the problem of solving a least squares problems to a linear system of equations without squaring the condition number! (For those interested, to do this, we recognize the normal equations as a Schur complement of a somewhat larger system of linear equations and then solve that. See Eq. (7) in this post for more discussion of this approach.)

The title of this post Don’t Solve the Normal Equations is deliberately overstated. There are times when solving the normal equations is appropriate. If $A$ is well-conditioned with a small condition number, squaring the condition number might not be that bad. If the matrix $A$ is too large to store in memory, one might want to solve the least squares problem using the normal equations and the conjugate gradient method.

However, the dramatically reduced accuracy of solving the normal equations should disqualify the approach from being the de-facto way of solving least squares problems. Unless you have good reason to think otherwise, when you see $A^\top A$ , solve a different way.

Don’t Solve the Normal Equations

2 thoughts on “Don’t Solve the Normal Equations”

Leave a Reply Cancel reply