Principal Component Analysis (PCA) belongs to the family of techniques known as unsupervised learning, which seeks to extract information using predictors, for example, to identify subgroups. PCA method ‘compresses’ the information provided by multiple variables into a few components.
“It is a statistical method that simplifies the complexity of sample spaces with many dimensions, showing their information in a less complex way with fewer components than dimensions.”
The PCA identifies patterns based on the correlation between features. It helps to find the maximum variance dimensions in high-dimensional data and project them to a new subspace in dimensions equal to or less than the original. The orthogonal axes of the new subspace (principal components) can be interpreted as the directions of maximum variance given the constraint that the new feature axes be orthogonal to each other. Figure 1 X1 and X2 are the axes of original characteristics, and PC1 and PC2 are the principal components.
Figure 1. Taken from Racshka et.al. 2019
A very interesting example of the application of PCA is the work they did in ‘Genes mirror geography within Europe’ published in the journal Nature, where they collected genetic information from 3000 European individuals.
Despite the low mean levels of genetic differentiation among Europeans, they found a close correspondence between genetic and geographic distances. The most interesting thing is that with the PCA application, a geographical map of Europe emerges as a two-dimensional summary of all the genetic variables. It is observed that the overlap between the maps is surprisingly precise (Figure 2).
Figure 2. More of the article here
Zoom in closer and the map even reveals distinct genetic clusters within Switzerland based on the language people speak. Even so, the clusters overlap, and in general, data reveals a genetic continuum between Europeans, where the borders of the genetic map are fuzzier than those of its geographical counterpart. As far as genes are concerned, the closer together two people live, the more similar their DNA is.
This is a very clear example of a PCA application. In data science is a very useful technique that helps in data compression in order to keep most relevant information. It can also improve predictive performance by reducing the dimensionality curse.
Written by María Coria