{Plus ça change …}

When I was a student in the 1980's, chemometrics was a nascent discipline. To us undergraduates, it seemed an esoteric subject worked on in niche projects at analytical chemistry departments. In general, computer capacity did not allow for processing the large datasets we see today. Chemometrics could not yet show its full potential - in hindsight it was just another field where theory was well ahead of application. For serious number crunching in the epoch, you'd choose a theoretical subject in quantum chemistry or statistical physics. At least, that is what I did.

We could only dream of computing power to increase the way it did over the last forty years. Chemometrics can finally process the large datasets we thought were impossible to handle. The field leans heavily on multivariate statistics, which is considered a machine learning technique nowadays. Having started out in statistical chemical physics, I have ended up in chemometrics and chances are I will happily stay put. Chemometrics is an excellent option for people like me who love to work on projects where chemistry, physics, mathematics and computer science meet. In recent years a new umbrella-term for those projects appeared - cheminformatics. I like that term.

Chemometrics algorithms are facilitated by the many tools for machine learning application development. I program most of my algorithms myself, in Python of course, which seems a modern Lingua Franca. And since my daughter, a student now, explained Jupyter Notebooks to me, I feel au courant and confident. Note that "modern" is a relative notion - isn't Python a thirtysomething, a Millenial so to speak? Still, I cannot help myself wondering what daily practice must have been for those pioneers in the 1980's.

Machine learning often relies on applied linear algebra, which Python developers may handle with numpy.linalg. But also forty years ago, mature libraries for numerical linear algebra were available, like BLAS, LINPACK or EISPACK. The latter two were assimilated by LAPACK over time, which is a numerical linear algebra reference implementation. Also Python libraries may use LAPACK under the hood.

Since LAPACK and predecessors are programmed in Fortran, I presume much of the chemometrics work back then was programmed in vintage editions of that language. Here I want to show how some of those algorithms might have looked like in old-school Fortran. Since it is merely an illustration, the algorithms have been simplified. The code in this post can be run with VIF, my own experimental vintage Fortran compiler, probably the only implementation that implements REAL*32 and COMPLEX*64 out of the box.

Moore-Penrose inverse

The Moore-Penrose inverse or pseudo-inverse, is often used in chemometrics as it can give you a least-square solution to an under-determined system of linear equations. For instance, a dataset containing results on less lab samples than there are observations per sample, will be under-determined. Given a matrix A, its pseudo-inverse A+ can readily be computed from the singular value decomposition of A.

Singular value decomposition states that any m×n matrix A is a product A = U Σ VT, where U UT = 1 and V VT = 1. Then A+ = V Σ+ UT. Σ+ is a diagonal matrix where elements are the reciprocal non-zero diagonal elements of Σ, a singular-value diagonal matrix. In general, A A+ A = A. When A is square and invertible, A+ = A-1 and A A+ = 1.

Below code computes the pseudo-inverse of a matrix. The subroutine calls LAPACK's DGESDD, singular value decomposition by divide-and-conquer. Note that Python's numpy.linalg.pinv calls DGESDD. The BLAS has no routine for diagonal matrix multiplication, so that is written out in this routine.

      subroutine pinv(A, m, n, A inv, U, V T, S, tol, info)
! Compute pseudo-inverse of 'A'.
! 'A' is a 'm'x'n' matrix which is overwritten.
! 'A inv' will contain the 'n'x'm' pseudo-inverse.
! 'U' will contain U, and 'V T' will contain V^T.
! 'S' will contain reciprocal singular values.
! 'tol' is the threshold ratio relative to the largest singular value.
! 'info' is a status indicator. Zero on exit is good.
      implicit none
      integer i, j, k, m, n, lwork, info, iwork, lwmax
      real*8 A(m, n), A inv(n, m), U(m, m), V T(n, n), S(n), tol
! Large arrays are shared between subroutines.
      parameter (lwmax = 100000)
      common /share/ work(lwmax), iwork(lwmax)
      real*8 work, smax
      integer*4 iwork
! We sum in quad precision.
      real*16 sum
! Work size should be at least 8 * min(m, n).
      if (lwmax < 8 * min(m, n)) then
         info = 1
         return 
      end if
! Find optimal workspace.
      lwork = -1
      call dgesdd('a', m, n, A, m, S, U, m, V T, n, work, lwork, iwork, info)
      if (lwmax < work(1)) then
         info = 2
         return 
      end if
      lwork = min(lwmax, int (work(1)))
! Compute SVD.
      call dgesdd('a', m, n, A, m, S, U, m, V T, n, work, lwork, iwork, info)
      if (info > 0) then
         info = 3
         return
      end if
! Construct S^+.
! Too small singular values cause instability, so reject.
      smax = S(1)
      do i = 1, n
        if (S(i) > tol * smax) then
          S(i) = 1 / S(i)
        else
          S(i) = 0
        end if
      end do
! Compute pseudo inverse V * S^+ * U^T
! BLAS has no routines for multiplying diagonal matrices.
      do i = 1, n
        do j = 1, m
          sum = 0
          do k = 1, m
            sum = sum + V T(k, i) * S(k) * U(j, k)
          end do
          A inv(i, j) = dble (sum)
        end do
      end do
      info = 0
      return
      end

Now, I would say that even to someone who never saw Fortran before, above code cannot appear alien. Note that old Fortran cannot dynamically allocate memory. Therefore all arrays must be declared in the caller and passed to the routine.

Ordinary Least Square regression

Let us program a basic application of the pseudo-inverse. Ordinary Least Square (OLS) regression is solving A x = y for x, when an exact solution does not exist. In chemometrics, error of measurement typically implies the absence of an exact solution. If a dataset in the form of m×n matrix A would represent m samples with n observations per sample, and y would be the required outcome for each sample, then x would be a calibration factor. This is a trivial example of "training" a machine learning "model". The inner product of an unseen set of n observations with x would predict the outcome for that sample in subject experiment.

OLS regression can be straightforwardly programmed in Fortran as shown below. Computing the pseudo-inverse is not required for OLS, but remember that this is an illustration. The subroutine calls BLAS's DGEMV, a generalized matrix-vector product.

      subroutine ols(A, m, n, x, y, A inv, U, V T, S, tol, info)
! Solve 'A' 'x' = 'y' by computing the pseudo-inverse of 'A'.
! 'A' is a 'm'x'n' matrix which is overwritten.
! 'Ainv' will contain the 'n'x'm' pseudo-inverse.
! 'U' will contain U, and 'VT' will contain V^T.
! 'S' will contain reciprocal singular values.
! 'tol' is the threshold ratio relative to the largest singular value.
! 'info' is a status indicator. Zero on exit is good.
      implicit none
      integer m, n, info
      real*8 A(m, n), A inv(n, m), x(m), y(m), U(m, m), V T(n, n), S(n), tol
! Solve A x = b.
      call pinv (A, m, n, A inv, U, V T, S, tol, info)
      if (info > 0) return
      call dgemv ('n', n, m, 1.0d0, A inv, n, y, 1, 0.0d0, x, 1)
      return
      end

Since chemists often produce datasets that are rife with correlations, a better approach than OLS is to base the regression on dominant eigenvectors of the covariance matrix of your dataset. Long eigenvectors indicate directions of large variation in a dataset, short eigenvectors indicate directions of large correlation. Best prediction can be established by minimizing prediction error as a function of increasing number of eigenvectors considered. Prediction error typically first decreases (as we are initially underfitting) but after a minimum, rises again when correlations start wreaking havoc on the quality of prediction. This process is called dimension reduction for obvious reasons, and building a regression this way is called Principal Component Regression (PCR).

You will note that OLS is PCR with all eigenvectors included. That is why OLS is not an a priori good approach in chemometrics. If you want to do even better than PCR, you may want to use PLS. But that is the subject of a further post.

Published in Chemometrics

More on Fortran, Machine Learning, Mathematics, Python, Statistics



© 2002-2024 J.M. van der Veer (jmvdveer@xs4all.nl)