{"In mathematics you don't understand
  things. You just get used to them."
  John von Neumann.}

Did you, like me, have to learn linear regression the hard way during your education, by means of tedious calculus? This might explain why many persons tend to a just-compute-it attitude using R or any other statistical package. However, you may have encountered the linear algebra approach to linear regression, which in my humble opinion is more intuitive. There are formal posts on this on the web; this post is my two cents. Math gurus may rightfully point out that different approaches yielding a same result can still be conceptually different. But as much as I love an esoteric discussion, I am a pragmatic kind of guy.

Suppose we have regressor matrix X. Think of X of storing observations per sample in the columns, while we use one row per sample. In ANOVA, this could be your design matrix. We would also have a set of response values organized as vector y. The idea is that once we know the observations of a future sample, we predict the response value of this sample.

Suppose we have reason to assume that the response is somehow linear, or can be made linear, in the observations. In that case we want to compute a linear model y = X β + ε. Then the model can be represented by (β, ε), and with it we can predict the outcome for a new sample by taking a dot product of the observations with β and adding ε.

We aspire to solve X β = y, an exact model with ε = 0. This means we naively would want to compute β = X-1 y. Now consider that X and y generally consist of experimental data. Then y likely is outside the range R of X, the plane of all exact predictions of form X β, and no exact relation exists between X and y. Also, if we have less samples than we have observations per sample, X is not square and X-1 does not exist. We must do something smart.

A common way out is this - a matrix multiplied by its flipped-over-the-diagonal transposed form is square, symmetric and invertible. So we multiply both sides of the equation with the transpose of X, and solve XT X β = XT y to find β = (XT X)-1 XT y and ε = y - X β. You will recognize this result since all derivations in literature, however intricate, end up here. But this still is not intuitive - why is this a least-square fit?

Note that we have replaced a possibly non-existent inverse X-1 by an existing pseudo-inverse (XT X)-1 XT which you may recognise as the Moore-Penrose inverse X+. Thus, β = X+ y. Practical linear regression is the same as seeking the best approximate solution to a set of linear equations that does not have an exact solution.

In order to see that the Moore-Penrose inverse gives a least-square fit, we will slightly change our perspective. To keep the argument more easy to envisage, we consider the special case where y is a single-column vector with N elements and X is a square N×N matrix. This simplified argument can be generalised.

Imagine y not as a vector, an array of numbers, but as a single point in N-dimensional space. Can you see y floating just above R? The closest point y' on R is the orthogonal projection of y on R; if you would shine a light perpendicularly on R then y' is the shadow of y.

The residual term ε = y - y' is the shortest vector connecting any point on R to y, so we obviously have a least-square fit since the sum of squared residuals, that is the square of the length of ε, has lowest possible value.

We solve X β = y' instead of X β = y, using fundamental properties of X. We can bypass unknown y' since ε, perpendicular to R, is in the null space of the transpose of X. This means that XT ε = 0. Substitution gives XT (y - X β) = 0 leading again to β = X+ y and ε = y - X β.

We can solve for unknown y' using above expressions trivially as y' = X X+ y. This may look familiar to you since in statistics, the matrix X X+ is the projection matrix (now you understand why) or influence matrix.

In a few intuitive steps, we have arrived at the same solution as derived from tedious calculus. All we did was replace a point off a plane by its nearest projection on it, and apply basic linear algebra.

Published in Chemometrics

More on Machine Learning, Mathematics, Statistics, Tutoring



© 2002-2025 J.M. van der Veer (jmvdveer@xs4all.nl)