PLSR (Partial Least Squars Regression)

https://doi.org/10.1098/rsta.2015.0202 (opens in a new tab)

Partial Least Squares Regression (PLSR) is a method that combines features of both PCA and multiple regression. It is particularly effective when predicting a response variable from a large set of explanatory variables that are highly collinear or when the number of predictors exceeds the number of observations. Unlike PCR, which deals with dimensionality reduction followed by regression, PLSR constructs new predictors as linear combinations of the original ones while simultaneously considering the response variable.


Configurable Parameters:
  • Targets: Defines the dependent variable or variables PLSR aims to predict.
  • Number of components: Specifies the number of latent variables (components) to extract and use in the model.

Visual Example

We train the model using a spectral database with known values. This enables the machine to recognize subtle changes in the spectra related to the contents, thereby allowing it to predict the contents of spectra with unknown values.


PLSR implementation for predicting housing prices

Internally, the machine learning algorithm generates a calibration curve, positioning both known and unknown samples along it and quantizing them accordingly.


PLSR predicting housing prices from multiple features

Key Concepts

  • Latent Variables: New variables created from linear combinations of the original variables, optimized to explain covariance between predictors and the response.

    These variables serve as the foundation for the regression model in PLSR.


  • Covariance Explained: Measures the amount of covariance between the predicted values and actual values that each component explains.

    This metric helps in determining the effectiveness of each component in predicting the response variable.


Data example

Consider an example dataset with attributes of different tech products used for predicting sales performance.

ProductFeature 1Feature 2Sales
A6150200
B8220300
C3100150
D4180250

After applying PLSR with 2 components, the transformed data (Scores) and their relationship to sales might look like this:

ProductLV1LV2
A-1.20.5
B1.8-0.4
C-0.40.8
D0.5-0.6

Here the model might provide coefficients:

CoefficientValue
Intercept100
LV175
LV250

The predicted sales are calculated using:

Predicted Sales=100+75×LV1+50×LV2 \text{Predicted Sales} = 100 + 75 \times \text{LV1} + 50 \times \text{LV2}

Mathematical Explanation of PLSR

  1. Standardize the data:

xijstd=xijμjσj x*{ij}^{\text{std}} = \frac{x*{ij} - \mu_j}{\sigma_j}

  1. Perform the PLS algorithm to extract the number of components L\mathbf{L}, which are also known as latent variables:

    • Optimization occurs such that these variables explain a maximum of the covariance between the predictors and the response.
  2. Model Fitting:

    • Fit a linear model using these latent variables.

    Y=β0+β1LV1+β2LV2++βlLVl+ϵ Y = \beta_0 + \beta_1 \text{LV1} + \beta_2 \text{LV2} + \ldots + \beta_l \text{LVl} + \epsilon

    Where βl\beta_l are the coefficients and ϵ\epsilon is the model's error term.

  3. Prediction and Interpretation:

    • Predict the response using the regression model and interpret the coefficients to understand the impact of each component on the target.

PLSR is particularly valuable in situations where predictive accuracy is crucial and the relationship between variables and the outcome needs to be clearly understood and described. This method is frequently used in chemometrics, economics, and other fields involving large datasets with many collinear variables.