|
Recently I wrote a post on coding chemometrics algorithms in the pre-Python era, with vintage Fortran as the language of choice. I promised in that post a Partial Least Squares Regression algorithm in a pre-Python language. Instead of Fortran however, I present that algorithm in Algol 68. Its predecessor Algol 60 was mostly used for research, by computer scientists in the United States and in Europe. When it came to serious number crunching, languages in the Algol family could not compete with Fortran for various practical reasons.
In my previous post on the subject we programmed a basic Ordinary Least Square (OLS) algorithm to solve the linear equation A β = y for β, our "model" in machine-learning patois, when an exact solution does not exist. In chemometrics, measurement errors typically imply the absence of an exact solution. If A would be a dataset in the form of a m×n matrix, representing m samples with n observations per sample, and y would be the required m outcomes for each sample, then β would be a calibration factor.
Solving for β is the "training" of our machine learning "model". In a recent essay I addressed the mystifying effect of the antropomorphic language used in Artifical Intelligence. Others may associate chemometrics with machine learning, personally I associate this wonderful field with multivariate statistics and linear algebra. But that is just me.
Since datasets in chemistry usually contain correlated data, a better approach is to base the regression on dominant eigenvectors of the covariance matrix of your dataset. Long eigenvectors indicate directions of large variation in a dataset while short eigenvectors indicate directions of large correlation. One then minimizes prediction error as a function of increasing number of eigenvectors considered. This process is called dimension reduction for obvious reasons. A well-known method is Principal Component Regression (PCR).
In chemical datasets it is likely that material properties used for prediction are correlated to the property to be predicted, at a molecular level. For instance, if you use spectroscopic measurements to predict material properties like viscosity, you know that both are influenced by the behaviour of the same molecules. This is a caveat for PCR, since in such case selection of eigenvectors is not optimal. A better approach is Partial Least Square (PLS) regression that strives to account for that correlation. PLS found its way into chemometrics in the 1980's, and has become a workhorse for reasons you now understand.
Let us have a look at the PLS algorithm. PLS can be trained to predict multiple properties from one dataset, but it is generally recommended to predict only one and train a model per required property. This is called the PLS1 algorithm. Needless to say this PLS1 code can be executed with Algol 68 Genie. First we read from file a dataset including the known (feature) values.
PRAGMAT need gsl PRAGMAT # Read a training set with (single-column) constituent values. Training set entries are in a CSV file formatted per record as: sample name, constituent value, feature values # INT samples = read int, features = read int; VOID (open (standin, "training-set.csv", standin channel)); make term (standin, ","); [samples, features] REAL training set, [samples, 1] REAL constituents; FOR i TO samples DO read ((LOC STRING, space, constituents[i, 1], space)); FOR j TO features DO read (training set[i, j]); IF j < features THEN read (space) FI OD; IF i < samples THEN read (newline) FI OD;
Once we have read the data, we define required data types and mean-center the data.
# Centering columns is required for PCR and PLS, not for OLS. # MODE MATRIX = FLEX [1 : 0, 1 : 0] REAL, VECTOR = FLEX [1 : 0] REAL, COLVEC = MATRIX; # Column vector # MATRIX col mean set = MEAN training set, COLVEC col mean consts = MEAN constituents; MATRIX mean set = training set - col mean set, COLVEC mean consts = constituents - col mean consts;
Now we present a naive implementation of PLS1. We start by computing the eigenvectors, in order of descending length.
COMMENT Partial Least Squares Regression . NIPALS PLS1 following Ulf Indahl. COMMENT MATRIX e := mean set, new set, new consts, eigens, COLVEC f := mean consts; TO samples # For this demo we compute all latent variables # DO # E weight (eigen component) Eᵀf. The norm is an eigenvalue # IF REAL norm = NORM (T e * f); norm > small real THEN COLVEC eigen = (T e * f) / norm; # Compute latent variable t from E factor score # COLVEC t := (e * eigen) / NORM (e * eigen); # Use factor loadings p, q to partial out t from e, f (deflation) # COLVEC p := T e * t, q := T t * f; e -:= t * T p; f -:= t * T q; # Build matrices # eigens := eigens BEFORE eigen; new set := new set BEFORE p; # P # new consts := new consts ABOVE q # Qᵀ # FI OD;
Next we compute the vector β. Here we predict the constituent values to check accuracy of our PLS model. The inner product of the beta vector with a new set of measurements for a single sample, gives the predicted value.
MATRIX projected set = T new set * eigens; COLVEC beta = eigens * INV projected set * new consts; # Compare to PCR # MATRIX predictions := mean set * beta + col mean consts;
You will want to do something with the results, like printing them:
print matrix (pls beta, 0); print matrix (predictions, 0)
As a last step, I present a better thought-out algorithm for PLS1. The multiplication of inverted projected set by new const actually solves a linear system of equations. On paper, these two approaches are equivalent. In practice, considering the finite precision of computers, solving a linear system of equations is a sensible thing to do compared to inverting a matrix. Below is a common approach for solving linear equations, using singular value decomposition (SVD).
# Solve the linear equation for beta in eigen-space, which may be more accurate than INV # OP SOLVE = (MATRIX a, COLVEC b) COLVEC: BEGIN MATRIX u, v, VECTOR s; svd decomp (a, u, s, v); FOR k FROM 2 TO UPB s # Ignore short eigenvectors # DO IF s[k] / s[1] < 1e-15 # cf. Python numpy package # THEN s[k] := 0 FI OD; CV svd solve (u, s, v, b[:, 1]) END; PRIO SOLVE = 9; COLVEC beta := eigens * projected set SOLVE new consts; # Compare to PCR # MATRIX predictions := mean set * beta + col mean consts
That concludes our demonstration. You may have noticed that SVD, PCR and PLS routines actually are in the standard environment of Algol 68 Genie. The source code of those runtime routines are based on the GNU Scientific Library, that implements BLAS. That code is in C, so now that I have presented Partial Least Squares in not one but two pre-Python languages, I consider my promise fulfilled.
© 2002-2024 J.M. van der Veer (jmvdveer@xs4all.nl)
|