
{"In mathematics you don't understand
things. You just get used to them." John von Neumann.} 
Did you, like me, have to learn linear regression the hard way during your education, by means of tedious calculus? This might explain why many persons tend to a justcomputeit attitude using R or any other statistical package. However, you may have encountered the linear algebra approach to linear regression, which in my humble opinion is more intuitive. There are formal posts on this on the web; this post is my two cents.
Math gurus may rightfully point out that different approaches yielding a same result can still be conceptually different. But as much as I love an esoteric discussion, I am a pragmatic kind of guy.
Suppose we have regressor matrix X. In ANOVA, this could be your design matrix. We would also have a set of response values organized as vector y. We want to compute a linear model y = X β + ε.
We aspire to solve X β = y, an exact model with ε = 0. Now consider that X and y generally consist of experimental data. Then most likely y is outside the range R of X, the plane of all exact predictions of form X β. So when y is outside R no exact relation exists between X and y, this means the inverse X^{1} does not exist. We must do something smart.
A common way out is this  a matrix multiplied by its flippedoverthediagonal transposed form is square, symmetric and invertable. So we multiply both sides of the equation with the transpose of X, and solve X^{T} X β = X^{T} y to find β = (X^{T} X)^{1} X^{T} y and ε = y  X β. You will recognize this result since all derivations in literature, however intricate, end up here. But this still is not intuitive  why is this a leastsquare fit?
To understand that, we will slightly change our perspective. To keep the argument more easy to envisage, we consider the special case where y is a singlecolumn vector with N elements and X is a square N×N matrix.
Imagine y not as a vector, an array of numbers, but as a single point in Ndimensional space. Can you see y floating just above R? The closest point y' on R is the orthogonal projection of y on R; if you would shine a light perpendicularly on R then y' is the shadow of y.
The residual term ε = y  y' is the shortest vector connecting any point on R to y, so we obviously have a leastsquare fit since the sum of squared residuals, that is the square of the length of ε, has lowest possible value.
We solve X β = y' instead of X β = y, using fundamental properties of X. We can bypass unknown y' since ε, perpendicular to R, is in the null space of the transpose of X. This means that X^{T} ε = 0. Substitution gives X^{T} (y  X β) = 0 leading again to β = (X^{T} X)^{1} X^{T} y and ε = y  X β.
We can solve for unknown y' using above expressions trivially as y' = X (X^{T} X)^{1} X^{T} y. This may look familiar to you since in statistics, the matrix P = X (X^{T} X)^{1} X^{T} is the projection matrix (now you know why) or influence matrix.
In a few intuitive steps, we have arrived at the same solution as derived from tedious calculus. All we did was replace a point off a plane by its nearest projection on it, and apply basic linear algebra.