PCR (Principal Component Regression)

https://doi.org/10.1098/rsta.2015.0202 (opens in a new tab)

Principal Component Regression (PCR) is a technique that combines the features of Principal Components Analysis (PCA) and multiple linear regression. It is used primarily when there are multicollinearity issues within the predictor variables in regression analysis or when the number of predictor variables exceeds the number of observations. PCR first reduces the dimensionality of the predictor variables using PCA and then fits a linear regression model on the transformed predictors.


Configurable Parameters:
  • Targets: The response variable(s) that the model aims to predict. This could be a single variable or a multivariate target, depending on the problem at hand.
  • Number of components: Defines the number of principal components to be retained for the regression model. This parameter impacts both the performance and accuracy of the regression model, balancing between underfitting and overfitting.

Visual Example

We train the model using a spectral database with known values. This enables the machine to recognize subtle changes in the spectra related to the contents, thereby allowing it to predict the contents of spectra with unknown values.


PCR implementation for predicting wine quality

Internally, the machine learning algorithm generates a calibration curve, positioning both known and unknown samples along it and quantizing them accordingly.


PLSR predicting housing prices from multiple features

Key Concepts

  • Principal Components: Derived from the PCA step, these components are the new variables obtained from an orthogonal transformation, intended to capture the significant variance and patterns in the original variables.

    Similar to PCA, principal components in PCR serve as inputs in the subsequent regression analysis.


  • Regression Coefficients: These coefficients correspond to each principal component and indicate their contribution to the target variable.

    The coefficients are calculated during the regression phase where the dependent variable (e.g., wine quality) is predicted using the significant principal components.


  • Variance Ratio: Indicates what proportion of the variance in the original variables is retained in the selected principal components.

    This metric guides the selection of the number of components to use. It is crucial for ensuring sufficient variability is retained to maintain model accuracy while also promoting simplicity.


Data Example

Let's consider an example dataset consisting of different types of vehicles and their attributes measured for predicting fuel efficiency.

VehicleWeightEngine SizeCO2 Emissions
Car 115001.6120
Car 220002.0135
Car 318002.2150
Car 412001.4110

After applying PCA and retaining 2 PCs, the transformed data (Scores) might look like this:

VehiclePC1PC2
Car 1-2.150.30
Car 21.85-0.75
Car 30.951.20
Car 4-0.65-0.75

PCR then uses these scores to model the CO2 Emissions. Supposing the regression model gives us these coefficients:

CoefficientValue
Intercept100
PC18.0
PC21.5

The predicted CO2 emissions are estimated using the formula:

Predicted CO2=100+8.0×PC1+1.5×PC2\text{Predicted CO2} = 100 + 8.0 \times \text{PC1} + 1.5 \times \text{PC2}

Mathematical Explanation of PCR

  1. Standardize the data (same as in PCA):

xijstd=xijμjσjx_{ij}^{\text{std}} = \frac{x_{ij} - \mu_j}{\sigma_j}

  1. Apply PCA to the standardized data to obtain principal components.

  2. Select the number of components K\mathbf{K} based on the cumulative explained variance.

  3. Fit a linear regression model using the selected principal components as predictors.

    The regression equation can be represented as:

    Y=β0+β1PC1+β2PC@++βkPCk+ϵY = \beta_0 + \beta_1 \text{PC1} + \beta_2 \text{PC@} + \ldots + \beta_k \text{PCk} + \epsilon

    Where:

    • βk\beta_k are the coefficients corresponding to each principal component.
    • ϵ\epsilon is the model's error term.
  4. Use the regression model to predict and interpret results.

    The calculated regression coefficients indicate the influence of each principal component on the target variable. The intercept is the average predicted value when all principal components are zero.

PCR is especially useful when dealing with high-dimensional data, helping to prevent overfitting by reducing the number of predictors and elucidating the most impactful variables.