Linear regression explained
By Marcel van der Veer
September 2022
Published in category Mathematics, Statistics
More on Mathematics
{"In mathematics you don't understand
things. You just get used to them."
John von Neumann.}

Did you, like me, have to learn linear regression the hard way during your education, by means of tedious calculus? This might explain why many persons tend to a justcomputeit attitude using R or any other statistical package. However, you may have encountered the linear algebra approach to linear regression, which in my humble opinion is more intuitive. There are formal posts on this on the web; this post is my two cents.
Math gurus may rightfully point out that different approaches yielding a same result can still be conceptually different. But as much as I love an esoteric discussion, I am a pragmatic kind of guy.
Suppose we have regressor matrix X. In ANOVA, this could be your design matrix. We would also have a set of response values organized as vector y. We want to compute a linear model y = X β + ε.
We aspire to solve X β = y, an exact model with ε = 0. Now consider that X and y generally consist of experimental data. Then most likely y is outside the range R of X, the plane of all exact predictions of form X β. So when y is outside R no exact relation exists between X and y, this means the inverse X^{1} does not exist. We must do something smart.
A common way out is this  a matrix multiplied by its flippedoverthediagonal transposed form is square, symmetric and invertable. So we multiply both sides of the equation with the transpose of X, and solve X^{T} X β = X^{T} y to find β = (X^{T} X)^{1} X^{T} y and ε = y  X β. You will recognize this result since all derivations in literature, however intricate, end up here. But this still is not intuitive  why is this a leastsquare fit?
To understand that, we will slightly change our perspective. To keep the argument more easy to envisage, we consider the special case where y is a singlecolumn vector with N elements and X is a square N×N matrix.
Imagine y not as a vector, an array of numbers, but as a single point in Ndimensional space. Can you see y floating just above R? The closest point y' on R is the orthogonal projection of y on R; if you would shine a light perpendicularly on R then y' is the shadow of y.
The residual term ε = y  y' is the shortest vector connecting any point on R to y, so we obviously have a leastsquare fit since the sum of squared residuals, that is the square of the length of ε, has lowest possible value.
We solve X β = y' instead of X β = y, using fundamental properties of X. Math says that ε, perpendicular to R, is in the null space of the transpose of X, meaning X^{T} ε = 0. Substitution gives X^{T} (y  X β) = 0 leading again to β = (X^{T} X)^{1} X^{T} y and ε = y  X β.
We can solve for y' using above expressions trivially as y' = X (X^{T} X)^{1} X^{T} y. This may look familiar to you since in statistics, the matrix P = X (X^{T} X)^{1} X^{T} is the projection matrix (now you know why) or influence matrix.
In a few intuitive steps, we have arrived at the same solution as derived from tedious calculus. All we did was replace a point off a plane by its nearest projection on it, and apply basic linear algebra.
All blog posts

Did you, like me, have to learn linear regression the hard way during your education, by means of tedious calculus … [Read more]
Published in category Mathematics, Statistics


In the eighties, science undergraduates like me took a course in structured programming at the computer science department … [Read more]
Published in category Computing history


There are many recipes around to convert an obsolete PC into a DIY NAS for use on a Windowsmachine network … [Read more]
Published in category Tech Tips


I have released the current source as version 3.0. I have also updated the documentation, Learning Algol 68 Genie … [Read more]
Published in category Algol 68


When I was a student, the university computing centre offered VM/CMS and MVS/TSO running on IBM (compatible) mainframes … [Read more]
Published in category Computing history


As the author of Algol 68 Genie I am interested in having access to other Algol 68 compilers for reference purposes … [Read more]
Published in category Computing history


Lately I needed to call on some of my now rusty electronics skills since I came into some older HiFi equipment in need of attention … [Read more]
Published in category Tech Tips


As described in an earlier post, at home I operate a modest Beowulf type cluster for embarrassingly parallel simulation runs in batch mode … [Read more]
Published in category Tech Tips


Every year my daughter's high school invites parents to partake in an evening of information sessions, to help students orient themselves on their future … [Read more]
Published in category Education


Recently I met during an event at my Alma Mater, the University of Nijmegen, my high school physics teacher … [Read more]
Published in category Science


Being the author of Algol 68 Genie, people frequently ask me why a physical chemist wrote an Algol 68 compiler … [Read more]
Published in category Algol 68


At home I operate a modest Beowulf type cluster for embarrassingly parallel simulation runs in batch mode … [Read more]
Published in category Tech Tips


This is a translation of the Algol 68 Revised Report into HTML, for distribution with Algol 68 Genie, an Open Source Algol 68 interpreter … [Read more]
Published in category Algol 68

