Principal Component Analysis (PCA)

Principal Component Analysis (.pdf)

Principal component analysis (also known as principal components analysis) (PCA) is a technique from statistics for simplifying a data set. It was developed by Pearson (1901) and Hotelling (1933), whilst the best modern reference is Jolliffe (2002). The aim of the method is to reduce the dimensionality of multivariate data whilst preserving as much of the relevant information as possible. It is a form of unsupervised learning in that it relies entirely on the input data itself without reference to the corresponding target data (the criterion to be maximized is the variance).

PCA is a linear transformation that transforms the data to a new coordinate system such that the new set of variables, the principal components, are linear functions of the original variables, are uncorrelated, and the greatest variance by any projection of the data comes to lie on the first coordinate, the second greatest variance on the second coordinate, and so on. In practice, this is achieved by computing the covariance matrix for the full data set. Next, the eigenvectors and eigenvalues of the covariance matrix are computed, and sorted according to decreasing eigenvalue. Note that PCA's bias is not always appropriate; features with low variance might actually have high predictive relevance, it depends on the application.