Principal Components Analysis
Under Heavy Construction
This page is a short and practical indtroduction in Principal Component Analysis. It aims to highlight the importance of the values returned by PCA. In addition, it addresses numerical accuracy issues with respect to the default implementation of PCA in GRASS through the i.pca module.
Principal Component Analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique used extensively in Remote Sensing studies (e.g. in change detection studies, image enhancement tasks and more). PCA is in fact a linear transformation applied on (usually) highly correlated multidimensional (e.g. multispectral) data. The input dimensions are transformed in a new coordinate system in which the produced dimensions (called principal components) contain, in decreasing order, the greatest variance related with unchanged landscape features.
PCA has two algebraic solutions:
1. Eigenvectors of Covariance
2. Singular Value Decomposition (The SVD method is used for numerical accuracy [R Documentation])
Background
The basic steps of the transformation are:
1. organizing a dataset in a matrix
2. data centering (=subtracting the dimensions means from tehmself so the dataset has mean=0)
3. calculate
(a) the covariance matrix (non-standartised PCA)
or
(b) the correlation matrix (standartised PCA, also known as scaling)
4. calculate
(a) the eigenvectors and eigenvalues of the covariance (or correlation) matrix
or
(b) the SVD(=singular value decomposition)
5. sort variances in decreasing order (decreasing eigenvalue)
6. project original dataset *signals* (PC's = eigenvector * input-data)
Solutions to PCA
The "Eigenvector" solution to PCA involves
1. calculation of:
- the covariance matrix of the given multidimensional dataset (non-standardised PCA)
- the correlation matrix of the given multidimensional dataset (standardised PCA)
2. calculation of the eigenvalues and eigenvectors of the covariance (or correlation) matrix
3. transformation of the input dataset using the eigenvalues as weighting coefficients
???
Terminology
eigenvalues represent the variance of the original data contained in each principal component.
Also know as: +++
eigenvectors act as weighting coefficients / represent the contribution of each original dimension the principal components.
Also known as: loadings, +++
Implementation of PCA with GRASS
1. Manually by using the m.eigensystem module
The m.eigensystem module implements the eigenvector solution to PCA. The respective function in R is princomp(). A comparison of their results confirms their almost identical performance.
-Eigenvalues: The standard deviations (sdev) reported by princomp() are (almost) identical with the variances (= sdev^2) reported by m.eigensystem.
-Eigenvectors: princomp() scales the eigenvectors and so does m.eigensystem. The scaled(=normalised) eigenvectors produced by m.eigensystem are marked with the capital letter N.
2. Using the i.pca module
The i.pca module performs PCA based on the SVD solution without data centering and/or scaling. A comparison of the results derived by i.pca and R's prcomp() function confirms this. Specifically, i.pca yields the same eigenvectors as R's prcomp() function does with the following options:
prcomp( ,center=FALSE,scale=FALSE)
Open issues with i.pca
Currently the eigenvalues reported by i.pca do not agree with the respective variance reported by m.eigensystem. In addition a cross-comparison of the results derived by m.eigensystem and R functions that perform PCA yield almost identical results.
Examples
All examples presented below were performed using some MODIS surface reflectance products
Perform PCA using m.eigensystem based on the covariance matrix
Command in one line
(echo 3; r.covar b02,b06,b07) | m.eigensystem
Output (note: with # are marked the scaled(=normalised) eigenvectors that correspond to princomp()'s results presented after this output)
r.covar: complete ... 100% E 778244.0258462029 .0000000000 79.20 V .5006581842 .0000000000 V .8256483300 .0000000000 V .6155834548 .0000000000 N # .4372107421 # .0000000000 N # .7210155161 # .0000000000 N # .5375717557 # .0000000000 W 385.6991853500 .0000000000 W 636.0664787886 .0000000000 W 474.2358050886 .0000000000 E 192494.5769628266 .0000000000 19.59 V -.8689798010 .0000000000 V .0996340298 .0000000000 V .5731134848 .0000000000 N # -.8309940700 # .0000000000 N # .0952787255 # .0000000000 N # .5480609638 # .0000000000 W -364.5920328433 .0000000000 W 41.8027823088 .0000000000 W 240.4573848757 .0000000000 E 11876.4548199713 .0000000000 1.21 V .2872248982 .0000000000 V -.5731591248 .0000000000 V .5351449518 .0000000000 N # .3439413070 # .0000000000 N # -.6863370819 # .0000000000 N # .6408165005 # .0000000000 W 37.4824307850 .0000000000 W -74.7964308085 .0000000000 W 69.8356366100 .0000000000
Repeat the same solution by using R's princomp() function
Command
princomp(modis)
Output
Call: princomp(x = modis) Standard deviations: Comp.1 Comp.2 Comp.3 857.5737 436.0922 108.5083 3 variables and 350596 observations.
Get loadings (note: for some reason R refused to yield the eigenvector in row 2, column 2!)
(princomp(modis))$loadings Loadings: Comp.1 Comp.2 Comp.3 b02 -0.418 0.839 0.348 b06 -0.725 -0.684 b07 -0.547 -0.539 0.641
Comp.1 Comp.2 Comp.3 SS loadings 1.000 1.000 1.000 Proportion Var 0.333 0.333 0.333 Cumulative Var 0.333 0.667 1.000
Comments +++
Perform PCA using i.pca
Command
i.pca input=b2,b6,b7 output=pca.b267
Output
Eigen values, (vectors), and [percent importance]: PC1 6307563.04 ( -0.64 -0.65 -0.42 ) [ 98.71% ] PC2 78023.63 ( -0.71 0.28 0.64 ) [ 1.22% ] PC3 4504.60 ( -0.30 0.71 -0.64 ) [ 0.07% ]
Repeat PCA with R's prcomp() using the same data with options set to center=FALSE, scale=FALSE
Command
prcomp(mod07, center=FALSE, scale=FALSE) <<== this corresponds to i.pca
Output
Standard deviations: [1] 4288.3788 476.8904 114.3971 Rotation: PC1 PC2 PC3 MOD2007_242_500_sur_refl_b02 -0.6353238 0.7124070 -0.2980602 MOD2007_242_500_sur_refl_b06 -0.6485551 -0.2826985 0.7067234 MOD2007_242_500_sur_refl_b07 -0.4192135 -0.6423066 -0.6416403
- The eigenvector matrices match although prcomp() report's the loadings (=eigenvectors) column-wise and i.pca row-wise.
- The eigenvalues do not match. To exemplify, the standart deviation for PC1 reported (above) by prcomp() is 4288.3788 and the variance reported by i.pca is 6307563.04 (or standard deviation reported by i.pca = sqrt(6307563.04) = 2511.486)
More examples to be added
References
Posts in the GRASS-user mailing list
There are many posts concerning the functionality of i.pca. Most of them are questioning the non-reporting of eigenvalues (an issue recently fixed).
Old posts
Recent posts
More sources to be added