Principal Component Analysis (PCA)

What is Principal Component Analysis?

Principal component analysis (PCA) is a statistical technique of factor analysis and dimensionality reduction that transforms a set of possibly correlated initial features into a smaller set of linearly uncorrelated features called principal components. In this way, PCA preserves as much variance in the dataset as possible, while minimizing the number of features.

 

Why is dimensionality reduction important in data analysis?

Datasets with a high number of variables require significant computational power and memory. Furthermore, high-dimensional data can cause machine learning models to overfit, meaning they memorize the training data rather than finding general patterns. Reducing the dimensions simplifies the dataset, decreases computation time, and filters out structural noise, leading to more robust models.

 

How does PCA work in practice?

PCA mathematically calculates the covariance matrix of the entire dataset to quantify how the different variables relate to one another. From this matrix, it computes the eigenvectors and eigenvalues. The eigenvectors determine the directions of the new feature space, while the eigenvalues determine the magnitude of variance that each specific direction explains. The original data points are then mathematically projected onto these new directions.

 

What exactly is a "principal component"?

A principal component is a newly derived variable constructed as a linear combination of the original variables in the dataset. The first principal component is the direction that captures the absolute maximum variance of the data. Every subsequent principal component captures the maximum remaining variance, under the strict mathematical constraint that it must be orthogonal (completely uncorrelated) to all preceding components.

 

What are the main limitations of using PCA?

PCA assumes that the relationships between variables are linear, it cannot detect or process non-linear relationships. It is also highly sensitive to outliers, which can heavily skew the calculation of the covariance matrix. Additionally, because the principal components are combinations of original features, they lose direct interpretability. It is difficult to map a resulting principal component directly back to a specific, real-world measurement.

 

How is PCA applied in a machine learning workflow, and what tools are used?

A common application of PCA is facial recognition in computer vision. An image dataset contains thousands of pixels, where each pixel acts as an individual feature.

Feeding thousands of pixels directly into a classifier algorithm is computationally expensive. PCA is applied to reduce these thousands of pixel-features into a few dozen principal components that retain the most critical visual variance (such as contrasting edges and structural boundaries). The machine learning classifier is then trained on these few principal components instead of the raw pixels.

Programming context:

  • In Python, PCA is standardly applied using the scikit-learn machine learning library, specifically through the sklearn.decomposition.PCA module.
  • In R, it is implemented using base statistical functions like prcomp() or through external libraries such as FactoMineR.