
{Plus ça change …}

When I was a student in the 1980's, chemometrics was a nascent discipline. To us undergraduates, it seemed an esoteric subject worked on in niche projects at analytical chemistry departments. In general, computer capacity did not allow for processing the large datasets we see today. Chemometrics could not yet show its full potential  in hindsight it was just another field where theory was well ahead of application. For serious number crunching in the epoch, you'd choose a theoretical subject in quantum chemistry or statistical physics. I did.
We could only dream of computing power to increase the way it did over the last forty years. Chemometrics can finally process the large datasets we thought were impossible to handle. The field leans heavily on multivariate statistics, which is considered a machine learning technique nowadays. Having started out in statistical chemical physics, I have ended up in chemometrics and chances are I will happily stay put. Chemometrics is an excellent option for people like me who love to work on projects where chemistry, physics, mathematics and computer science meet. In recent years a new umbrellaterm for those projects appeared  cheminformatics. I like that term.
Chemometrics algorithms are facilitated by the many tools for machine learning application development. I program most of my algorithms myself, in Python of course, which seems a modern Lingua Franca. And since my daughter, a student now, explained Jupyter Notebooks to me, I feel au courant and confident. Note that "modern" is a relative notion  isn't Python a thirtysomething, a Millenial so to speak? Still, I cannot help myself wondering what daily practice must have been for those pioneers in the 1980's.
Machine learning often relies on applied linear algebra, which Python developers may handle with numpy.linalg. But also forty years ago, mature libraries for numerical linear algebra were available, like BLAS, LINPACK or EISPACK. The latter two were assimilated by LAPACK over time, which is a numerical linear algebra reference implementation. Also Python libraries may use LAPACK under the hood.
Since LAPACK and predecessors are programmed in Fortran, I presume much of the chemometrics work back then was programmed in vintage editions of that language. Here I want to show how some of those algorithms might have looked like in oldschool Fortran. Since it is merely an illustration, the algorithms have been simplified. The code in this post can be run with VIF, my own experimental vintage Fortran compiler, probably the only implementation that implements REAL*32 and COMPLEX*64 out of the box.
The MoorePenrose inverse or pseudoinverse, is often used in chemometrics as it can give you a leastsquare solution to an underdetermined system of linear equations. For instance, a dataset containing results on less lab samples than there are observations per sample, will be underdetermined. Given a matrix A, its pseudoinverse A^{+} can readily be computed from the singular value decomposition of A.
Singular value decomposition states that any m×n matrix A is a product A = U Σ V^{T}, where U U^{T} = 1 and V V^{T} = 1. Then A^{+} = V Σ^{+} U^{T}. Σ^{+} is a diagonal matrix where elements are the reciprocal nonzero diagonal elements of Σ, a singularvalue diagonal matrix. In general, A A^{+} A = A. When A is square and invertible, A^{+} = A^{1} and A A^{+} = 1.
Below code computes the pseudoinverse of a matrix. The subroutine calls LAPACK's DGESDD, singular value decomposition by divideandconquer. Note that Python's numpy.linalg.pinv calls DGESDD. The BLAS has no routine for diagonal matrix multiplication, so that is written out in this routine.
subroutine pinv(a, m, n, ainv, u, vt, s, tol, info) ! Compute pseudoinverse of 'A'. ! 'A' is a 'm'x'n' matrix which is overwritten. ! 'Ainv' will contain the 'n'x'm' pseudoinverse. ! 'U' will contain U, and 'VT' will contain V^T. ! 'S' will contain reciprocal singular values. ! 'Tol' is the threshold ratio relative to the largest singular value. ! 'Info' is a status indicator. Zero on exit is good. implicit none integer i, j, k, m, n, lwork, info, iwork, lwmax real*8 a(m, n), ainv(n, m), u(m, m), vt(n, n), s(n), tol ! Large arrays are shared between subroutines. parameter (lwmax = 100000) common /share/ work(lwmax), iwork(lwmax) ! We sum in quad precision. real*16 sum ! Work size should be at least 8 * min(m, n). if (lwmax < 8 * min(m, n)) then info = 1 return end if ! Find optimal workspace. lwork = 1 call dgesdd('a', m, n, a, m, s, u, m, vt, n, work, lwork, iwork, info) if (lwmax < work(1)) then info = 2 return end if lwork = min(lwmax, int (work(1))) ! Compute SVD. call dgesdd('a', m, n, a, m, s, u, m, vt, n, work, lwork, iwork, info) if (info > 0) then info = 3 return end if ! Construct S^+. ! Too small singular values cause instability, so reject. smax = s(1) do i = 1, n if (s(i) > tol * smax) then s(i) = 1 / s(i) else s(i) = 0 end if end do ! Compute pseudo inverse V * S^+ * U^T ! BLAS has no routines for multiplying diagonal matrices. do i = 1, n do j = 1, m sum = 0 do k = 1, m sum = sum + vt(k, i) * s(k) * u (j, k) end do ainv(i, j) = sum end do end do info = 0 return end
Now, I would say that even to someone who never saw Fortran before, above code cannot appear alien. Note that old Fortran cannot dynamically allocate memory. Therefore all arrays must be declared in the caller and passed to the routine.
Let us program a basic application of the pseudoinverse. Ordinary Least Square (OLS) regression is solving A x = y for x, when an exact solution does not exist. In chemometrics, error of measurement typically implies the absence of an exact solution. If a dataset in the form of m×n matrix A would represent m samples with n observations per sample, and y would be the required outcome for each sample, then x would be a calibration factor. This is a trivial example of "training" a machine learning "model". The inner product of an unseen set of n observations with x would predict the outcome for that sample in subject experiment.
OLS regression can be straightforwardly programmed in Fortran as shown below. Computing the pseudoinverse is not required for OLS, but remember that this is an illustration. The subroutine calls BLAS's DGEMV, a generalized matrixvector product.
subroutine ols(a, m, n, x, y, ainv, u, vt, s, tol, info) ! Solve 'A' 'x' = 'y' by computing the pseudoinverse of 'A'. ! 'A' is a 'm'x'n' matrix which is overwritten. ! 'Ainv' will contain the 'n'x'm' pseudoinverse. ! 'U' will contain U, and 'VT' will contain V^T. ! 'S' will contain reciprocal singular values. ! 'Tol' is the threshold ratio relative to the largest singular value. ! 'Info' is a status indicator. Zero on exit is good. implicit none integer m, n, info real*8 a(m, n), ainv(n, m), x(m), y(m), u(m, m), vt(n, n), s(n), tol ! Solve A x = b. call pinv (a, m, n, ainv, u, vt, s, tol, info) if (info > 0) return call dgemv ('n', n, m, 1.0d0, ainv, n, y, 1, 0.0d0, x, 1) return end
Since chemists often produce datasets that are rife with correlations, a better approach than OLS is to base the regression on dominant eigenvectors. Long eigenvectors indicate directions of large variation in a dataset, short eigenvectors indicate directions of large correlation. Best prediction can be established by minimizing prediction error as a function of increasing number of eigenvectors considered. Prediction error typically first decreases (as we are initially underfitting) but after a minimum, rises again when correlations start wreaking havoc on the quality of prediction. This process is called dimension reduction for obvious reasons, and building a regression this way is called Principal Component Regression (PCR).
You will note that OLS is PCR with all eigenvectors included. That is why OLS is not an a priori good approach in chemometrics. If you want to do even better than PCR, you may want to use PLS. But that is the subject of a further post.
© 20022024 J.M. van der Veer (jmvdveer@xs4all.nl)
