By Marcel van der Veer
September 2022

Published in Mathematics

More on Artificial Intelligence, Mathematics, Statistics

{"In mathematics you don't understand
  things. You just get used to them."
  John von Neumann.}

Did you, like me, have to learn linear regression the hard way during your education, by means of tedious calculus? This might explain why many persons tend to a just-compute-it attitude using R or any other statistical package. However, you may have encountered the linear algebra approach to linear regression, which in my humble opinion is more intuitive. There are formal posts on this on the web; this post is my two cents.

Math gurus may rightfully point out that different approaches yielding a same result can still be conceptually different. But as much as I love an esoteric discussion, I am a pragmatic kind of guy.

Suppose we have regressor matrix X. In ANOVA, this could be your design matrix. We would also have a set of response values organized as vector y. We want to compute a linear model y = X β + ε.

We aspire to solve X β = y, an exact model with ε = 0. Now consider that X and y generally consist of experimental data. Then most likely y is outside the range R of X, the plane of all exact predictions of form X β. When y is outside R no exact relation exists between X and y. This means the inverse X-1 does not exist and we cannot compute β = X-1 y. We must do something smart.

A common way out is this - a matrix multiplied by its flipped-over-the-diagonal transposed form is square, symmetric and invertible. So we multiply both sides of the equation with the transpose of X, and solve XT X β = XT y to find β = (XT X)-1 XT y and ε = y - X β.

You will recognize above result since all derivations in literature, however intricate, end up here. But this still is not intuitive - why is this a least-square fit?

Note that we have replaced non-existent inverse X-1 by an existing pseudo-inverse (XT X)-1 XT which is the Moore-Penrose inverse X+. Thus, β = X+ y. Practical linear regression is the same as seeking the best approximate solution to a set of linear equations that does not have an exact solution.

In order to see that the Moore-Penrose inverse gives a least-square fit, we will slightly change our perspective. To keep the argument more easy to envisage, we consider the special case where y is a single-column vector with N elements and X is a square N×N matrix.

Imagine y not as a vector, an array of numbers, but as a single point in N-dimensional space. Can you see y floating just above R? The closest point y' on R is the orthogonal projection of y on R; if you would shine a light perpendicularly on R then y' is the shadow of y.

The residual term ε = y - y' is the shortest vector connecting any point on R to y, so we obviously have a least-square fit since the sum of squared residuals, that is the square of the length of ε, has lowest possible value.

We solve X β = y' instead of X β = y, using fundamental properties of X. We can bypass unknown y' since ε, perpendicular to R, is in the null space of the transpose of X. This means that XT ε = 0. Substitution gives XT (y - X β) = 0 leading again to β = X+ y and ε = y - X β.

We can solve for unknown y' using above expressions trivially as y' = X X+ y. This may look familiar to you since in statistics, the matrix X X+ is the projection matrix (now you know why) or influence matrix.

In a few intuitive steps, we have arrived at the same solution as derived from tedious calculus. All we did was replace a point off a plane by its nearest projection on it, and apply basic linear algebra.

All blog posts