Principal Components Analysis: Difference between revisions
m (urls restored) |
|||
Line 694: | Line 694: | ||
[[Category: Documentation]] | [[Category: Documentation]] | ||
[[Category: Image processing]] |
Revision as of 16:04, 5 April 2012
A practical introduction in Principal Components Analysis (or Transformation) that aims to:
- highlight the importance of the values returned by PCA
- address numerical accuracy issues with respect to the default implementation of PCA in GRASS through the i.pca module.
This page is still not complete
Principal Components Analysis
Principal Components Analysis (PCA) is a dimensionality reduction technique used extensively in Remote Sensing studies (e.g. in change detection studies, image enhancement tasks and more). PCA is in fact a linear transformation applied on (usually) highly correlated multidimensional (e.g. multispectral) data. The input dimensions are transformed in a new coordinate system in which the produced dimensions (called principal components) contain, in decreasing order, the greatest variance related with unchanged landscape features.
PCA has two algebraic solutions:
- Eigenvectors of Covariance (or Correlation) of a given data matrix
- Singular Value Decomposition of a given data matrix
The SVD method is used for numerical accuracy [R Documentation]
Background
The basic steps of the transformation are:
- organizing a dataset in a matrix
- data centering (that is: subtracting the dimensions means from themself so each of the dimensions in the dataset has zero mean)
- calculate
- the covariance matrix (non-standardised PCA) or
- the correlation matrix (standartised PCA, also known as scaling)
- calculate either
- the eigenvectors and eigenvalues of the covariance (or the correlation) matrix or
- the SVD of the data matrix
- sort variances in decreasing order (decreasing eigenvalues; this is default in eigenvalue analysis)
- project original dataset signals (PC's or PC scores: eigenvector * input-data) to get
Why is data centering performed?
Data centering reduces the square mean error of approximating the input data [A.A. Miranda, Y.-A. Le Borgne, and G. Bontempi. New Routes from Minimal Approximation Error to Principal Components, Volume 27, Number 3 / June, 2008, Neural Processing Letters, Springer].
Why is data scaling performed?
Scaling normalises(=standartises) the variables to have unit variance before the analysis takes place. This normalisation prevents certain features to dominate the analysis because of their large numerical values [Duda and Hart (1973), Eklundh and Singh (1993)]. However "if the variables are measured on comparable scales, unstandartised data may be appropriate". [Maindonald, John; Braun, John: Data Analysis and Graphics Using R. 2. Aufl. 2007, Cambridge University Press]
Solutions to PCA
The Eigenvector solution to PCA involves:
- calculation of
- the covariance matrix of the given multidimensional dataset (non-standardised PCA) or
- the correlation matrix of the given multidimensional dataset (standardised PCA)
- calculation of the eigenvalues and eigenvectors of the covariance (or correlation) matrix
- transformation of the input dataset using the eigenvalues as weighting coefficients
Terminology
eigenvalues represent either the variance of the original data contained in each principal component (in case they were computed from the covariance matrix), or the amount of correlation captured by the respective principle components (in case they were computed from the correlation matrix).
eigenvectors
- act as weighting coefficients
- represent the contribution of each original dimension the principal components
- tell how the principle components relate to the the data variables (dimensions).
Suppose an eigenvector is (0, 0, 1, 0), this indicates that the corresponding principle component is equal to dimension 3 and perpendicular to dimensions 1, 2 and 4. If it is (0.7, 0.7, 0, 0) the PC is equally determined by dimension 1 and 2 (it averages them), and not determined by dimensions 3 and 4. If it is (0.7, 0, -0.7, 0) it is the difference between dimension 1 and 3. Eigenvectors are normalized, meaning that the sum of its squared elements equals 1.
loadings The individual numbers in an eigenvector are called loadings.
scores When the original data are projected onto the eigenvectors, the resulting new variables are called the principle components; the new data values are called PC scores.
Performing PCA with GRASS
- m.eigensystem
The m.eigensystem module implements the eigenvector solution to PCA. The respective function in R is princomp(). A comparison of their results confirms their almost identical performance. Specifically,
- the standard deviations (sdev) reported by princomp() are (almost) identical with the variances (eigenvalues) reported by m.eigensystem.
- princomp() scales (also referred as normalization) the eigenvectors and so does m.eigensystem. The scaled(=normalised) eigenvectors produced by m.eigensystem are marked with the capital letter N.
- i.pca
The i.pca module performs PCA based on the SVD solution with data centering but without scaling. A comparison of the results derived by i.pca and R's prcomp() function confirms this. Specifically, i.pca yields the same eigenvectors as R's prcomp() function does with the following options:
prcomp(x, center=TRUE, scale=FALSE)
where x is a numeric or complex matrix (or data frame) which provides the data for the principal components analysis (R Documentation).
Issues concerning the i.pca module
In general i.pca works as expected (tested with SPOT and Landsat5 satellite data from the spearfish location) producing slightly different eigenvectors. Nevertheless, the reported eigenvalues are very different than the ones reported by m.eigensystem or R's equivalent functions. In addition, using some MODIS surface reflectance bands, there is one case in which the reported eigenvectors, are not centered prior to the Principal Component Analysis as it should be done. Need to re-check if this is still an issue.
Overview of points for discussion
Eigenvalues vary between i.pca and the rest of modules/functions
- Data centering not performed in one case
Useful details
princomp()
If one computes principle components with R, the loadings are printed by default such that loadings close to 0 (between -.1 and .1, this can be controlled) are left out. This can be overriden (see the help page of function loadings). The reason for this is that for large loading tables, the real information is in loadings not close to 0; reading large loading tables is much easier when only important loadings are printed.
Examples using SPOT imagery
Get SPOT images from the Spearfish dataset
Eigenvectors solution
Based on the covariance matrix
Using m.eigensystem
Note: download m.eigensystem from GRASS_AddOns#m.eigensystem GRASS-Addons
Command in one line
(echo 3; r.covar spot.ms.1,spot.ms.2,spot.ms.3) | m.eigensystem
Output
r.covar: complete ... 100% E 1159.7452017844 .0000000000 88.38 V .6910021591 .0000000000 V .7205280412 .0000000000 V .4805108400 .0000000000 N .6236808478 .0000000000 N .6503301526 .0000000000 N .4336967751 .0000000000 W 21.2394712045 .0000000000 W 22.1470141296 .0000000000 W 14.7695575384 .0000000000 E 5.9705414972 .0000000000 .45 V .7119385973 .0000000000 V -.6358200627 .0000000000 V -.0703936743 .0000000000 N .7438340890 .0000000000 N -.6643053754 .0000000000 N -.0735473745 .0000000000 W 1.8175356507 .0000000000 W -1.6232096923 .0000000000 W -.1797107407 .0000000000 E 146.5031967184 .0000000000 11.16 V .2265837636 .0000000000 V .3474697082 .0000000000 V -.8468727535 .0000000000 N .2402770238 .0000000000 N .3684685345 .0000000000 N -.8980522763 .0000000000 W 2.9082771721 .0000000000 W 4.4598880523 .0000000000 W -10.8698904856 .0000000000
Note that the output is not sorted by relative importance.
The red-bold numbers are the normalised eigen vectors. Compare the above results with R's princomp() function below.
Using R's princomp() function
Launch R from within GRASS
R
Load spgrass6() and projection information in R
library(spgrass6) G <- gmeta6()
Read spot.ms bands
spot.ms <- readRAST6(c('spot.ms.1', 'spot.ms.2', 'spot.ms.3'))
Work around NA's
spot.ms.nas <- which(is.na(spot.ms@data$spot.ms.1) & is.na(spot.ms@data$spot.ms.2) & is.na(spot.ms@data$spot.ms.3)) spot.ms.values <- which(!is.na(spot.ms@data$spot.ms.1) & !is.na(spot.ms@data$spot.ms.2) & !is.na(spot.ms@data$spot.ms.3)) spot.ms.nonas <- spot.ms.values@data[spot.ms.values, ]
A better option is to use R's complete.cases() function. See ?complete.cases within R.
Command
princomp(spot.ms.nonas)
Output
Call: princomp(x = modis) Standard deviations: Comp.1 Comp.2 Comp.3 34.055018 12.103846 2.443468 3 variables and 1231860 observations.
Get loadings
princomp(spot.ms.nonas)$loadings Loadings: Comp.1 Comp.2 Comp.3 spot.ms.1 -0.624 0.240 0.744 spot.ms.2 -0.650 -0.368 -0.664 spot.ms.3 -0.434 0.898 Comp.1 Comp.2 Comp.3 SS loadings 1.000 1.000 1.000 Proportion Var 0.333 0.333 0.333 Cumulative Var 0.333 0.667 1.000
The red-bold eigen vectors of the first principal component, ease of the comparison with the results derived from GRASS' m.eigensystem module above.
Note the missing eigen value in row 3, column 3. It is omitted on purpose. You will appreciate how useful this is when you compute 20 PC's from 120 dimensions. It will be printed when you explicitly ask for showing loadings between -0.1 and 0.1, by
print(princomp(spot.ms.nonas)$loadings, cutoff=0)
Based on the correlation matrix
Using m.eigensystem
Command in one line
(echo 3; r.covar -r spot.ms.1,spot.ms.2,spot.ms.3) | m.eigensystem
Output
r.covar: complete ... 100% E 2.5897901435 .0000000000 86.33 V -.7080795162 .0000000000 V -.6979341819 .0000000000 V -.6128387525 .0000000000 N -.6062688915 .0000000000 N -.5975822957 .0000000000 N -.5247222419 .0000000000 W -.9756579133 .0000000000 W -.9616787268 .0000000000 W -.8444263177 .0000000000 E .0123265666 .0000000000 .41 V -.6690685456 .0000000000 V .6302711261 .0000000000 V .0552608155 .0000000000 N -.7265842171 .0000000000 N .6844516242 .0000000000 N .0600112449 .0000000000 W -.0806690651 .0000000000 W .0759912909 .0000000000 W .0066627528 .0000000000 E .3978832898 .0000000000 13.26 V .3005969725 .0000000000 V .3883277727 .0000000000 V -.7895613377 .0000000000 N .3232853332 .0000000000 N .4176378502 .0000000000 N -.8491555920 .0000000000 W .2039218921 .0000000000 W .2634375639 .0000000000 W -.5356302845 .0000000000
Note that the output is not sorted by relative importance. In this example the second principal component (accounts for 13.26% of the original variance) can be created by using numbers (i.e. the W lines) from the third group of eigen vectors. To compare with princomp()'s results look at column Comp.2 below.
Using R's princomp() function
Perform PCA
princomp(spot.ms.nonas, cor=TRUE)
Output
Call: princomp(x = spot.ms.nonas, cor = TRUE) Standard deviations: Comp.1 Comp.2 Comp.3 1.6092826 0.6307795 0.1110256 3 variables and 1231860 observations.
Get loadings
princomp(spot.ms.nonas, cor=TRUE)$loadings
Output
Loadings: Comp.1 Comp.2 Comp.3 spot.ms.1 -0.606 -0.323 0.727 spot.ms.2 -0.598 -0.418 -0.684 spot.ms.3 -0.525 0.849 Comp.1 Comp.2 Comp.3 SS loadings 1.000 1.000 1.000 Proportion Var 0.333 0.333 0.333 Cumulative Var 0.333 0.667 1.000
SVD solution
Using i.pca
Command
i.pca input=spot.ms.1,spot.ms.2,spot.ms.3 out=pca.spot.ms rescale=0,0
Output
Eigen values, (vectors), and [percent importance]:
PC1 1170.12 (-0.6251,-0.6536,-0.4268)[88.07%]
PC2 152.49 ( 0.2328, 0.3658,-0.9011)[11.48%]
PC3 6.01 ( 0.7450,-0.6626,-0.0765) [0.45%]
Using R's prcomp() function
The following example replicates i.pca's solution using the same data with data centering and without scaling (options center=TRUE and scale=FALSE). Note that these settings are the defaults.
Command
prcomp(spot.ms.nonas)
Output
Standard deviations: [1] 34.055032 12.103851 2.443469 Rotation: PC1 PC2 PC3 spot.ms.1 -0.6236808 0.2402770 -0.74383409 spot.ms.2 -0.6503302 0.3684685 0.66430538 spot.ms.3 -0.4336968 -0.8980523 0.07354738
In this example i.pca's performance seems to be identical to R's prcomp() function which means that data centering is applied prior to the actual PCA.
The three SPOT bands used in this example have the following ranges:
- spot.ms.1
min=24 max=254
- spot.ms.2
min=14 max=254
- spot.ms.3
min=12 max=254
Nevertheless, as shown in the examples using MODIS surface reflectance products (read below) in which the range of the bands varies significantly, i.pca's results do not match the results of R's prcomp() function with the parameter center set to TRUE. Instead, the results derived from i.pca are almost identical when the parameter center is set to FALSE.
The following example performs PCA without data centering and scaling (options center=FALSE and scale=FALSE).
Command
prcomp(spot.ms.nonas, center=FALSE, scale=FALSE)
Output
Standard deviations: [1] 101.00927 15.79861 2.83775 Rotation: PC1 PC2 PC3 spot.ms.1 -0.5605487 -0.3652694 0.7432116 spot.ms.2 -0.4726613 -0.5958032 -0.6493149 spot.ms.3 -0.6799827 0.7152600 -0.1613280
+++
The following example performs PCA with data centering and scaling (options center=TRUE and scale=TRUE)
Command
prcomp(spot.ms.nonas, center=TRUE, scale=TRUE)
Output
Standard deviations: [1] 1.6092826 0.6307795 0.1110256 Rotation: PC1 PC2 PC3 spot.ms.1 -0.6062688 0.3232856 -0.72658417 spot.ms.2 -0.5975823 0.4176378 0.68445170 spot.ms.3 -0.5247224 -0.8491555 0.06001103
Examples using MODIS surface reflectance products
Eigenvectors solution
Based on the covariance matrix
Using m.eigensystem
Command in one line
(echo 3; r.covar b02,b06,b07) | m.eigensystem
Output
- The E line is the eigen value. (Real part, imaginary part, percent importance)
- The V lines are the eigen vectors associated with E.
- The N lines are the V vectors normalized to have a magnitude of 1. These are the scaled eigen vectors that correspond to princomp()'s results presented in the following section.
- The W lines are the N vector multiplied by the square root of the magnitude of the eigen value(E). Generally referred to as "the factor loadings", also called "component loadings".
r.covar: complete ... 100% E 778244.0258462029 .0000000000 79.20 V .5006581842 .0000000000 V .8256483300 .0000000000 V .6155834548 .0000000000 N .4372107421 .0000000000 N .7210155161 .0000000000 N .5375717557 .0000000000 W 385.6991853500 .0000000000 W 636.0664787886 .0000000000 W 474.2358050886 .0000000000 E 192494.5769628266 .0000000000 19.59 V -.8689798010 .0000000000 V .0996340298 .0000000000 V .5731134848 .0000000000 N -.8309940700 .0000000000 N .0952787255 .0000000000 N .5480609638 .0000000000 W -364.5920328433 .0000000000 W 41.8027823088 .0000000000 W 240.4573848757 .0000000000 E 11876.4548199713 .0000000000 1.21 V .2872248982 .0000000000 V -.5731591248 .0000000000 V .5351449518 .0000000000 N .3439413070 .0000000000 N -.6863370819 .0000000000 N .6408165005 .0000000000 W 37.4824307850 .0000000000 W -74.7964308085 .0000000000 W 69.8356366100 .0000000000
In general, the solution to the eigen system results in complex
numbers (with both real and imaginary parts). However, in the example
above, since the input matrix is symmetric (i.e., inverting the rows and columns
gives the same matrix) the eigen system has only real values (i.e., the
imaginary part is zero).
This fact makes it possible to use eigen vectors to perform principle component
transformation of data sets. The covariance or correlation
matrix of any data set is symmetric
and thus has only real eigen values and vectors.
The red-bold eigen vectors of the first principal component, ease of the comparison with the results derived from R's princomp() function that follows.
Using the W vector, new maps can be created:. The new maps (coordinate system) system is formed by the normalized eigenvectors (N) of the variance–covariance (or correlation) matrix [Tso, B. & P.M. Mather. Classification methods for remotely sensed data. 2001. Taylor & Francis, London ; New York].
r.mapcalc 'pc.1 = .4372107421*map.1 +.7210155161 *map.2 + .5375717557*map.3' r.mapcalc 'pc.2 = -.8309940700 *map.1 + .0952787255*map.2 + .5480609638*map.3' r.mapcalc 'pc.3 = .3439413070*map.1 - .6863370819*map.2 + .6408165005*map.3'
Visualize results:
d.mon x0 d.rast pc.1 d.rast pc.2 d.rast pc.3
Using R's princomp() function
Command
princomp(modis)
Output
Call: princomp(x = modis) Standard deviations: Comp.1 Comp.2 Comp.3 857.5737 436.0922 108.5083 3 variables and 350596 observations.
Get loadings
(princomp(modis))$loadings Loadings: Comp.1 Comp.2 Comp.3 b02 -0.418 0.839 0.348 b06 -0.725 -0.684 b07 -0.547 -0.539 0.641
Comp.1 Comp.2 Comp.3 SS loadings 1.000 1.000 1.000 Proportion Var 0.333 0.333 0.333 Cumulative Var 0.333 0.667 1.000
The red-bold eigen vectors of the first principal component, ease of the comparison with the results derived from GRASS' m.eigensystem module above.
Based on the correlation matrix
Using m.eigensystem
Command in one line
(echo 3; r.covar -r MOD07_b02,MOD07_b06,MOD07_b07)|m.eigensystem
Output
r.covar: complete ... 100% E 2.2915877718 .0000000000 76.39 V -.5755655569 .0000000000 V -.7660355041 .0000000000 V -.6809380186 .0000000000 N -.4896413269 .0000000000 N -.6516766616 .0000000000 N -.5792830912 .0000000000 W -.7412186091 .0000000000 W -.9865075560 .0000000000 W -.8769182329 .0000000000 E .6740687010 .0000000000 22.47 V .8667178982 .0000000000 V -.1116525720 .0000000000 V -.6069908335 .0000000000 N .8145815825 .0000000000 N -.1049362531 .0000000000 N -.5704780699 .0000000000 W .6687852213 .0000000000 W -.0861544341 .0000000000 W -.4683721194 .0000000000 E .0343435272 .0000000000 1.14 V .2486404469 .0000000000 V -.6006166822 .0000000000 V .4655120098 .0000000000 N .3109794470 .0000000000 N -.7512029762 .0000000000 N .5822249325 .0000000000 W .0576307320 .0000000000 W -.1392129859 .0000000000 W .1078979635 .0000000000
The red-bold eigen vectors of the second principal component, ease of the comparison with the results derived from R's princomp() function that follows.
Using R's princomp() function
Command
princomp(mod07, cor=TRUE)
Output
Call: princomp(x = mod07, cor = TRUE)
Standard deviations: Comp.1 Comp.2 Comp.3 1.5030740 0.8397807 0.1885121 3 variables and 350596 observations.
Get loadings
(princomp(mod07, cor=TRUE))$loadings
Output
Loadings: Comp.1 Comp.2 Comp.3 MOD2007_242_500_sur_refl_b02 -0.481 0.820 0.310 MOD2007_242_500_sur_refl_b06 -0.656 -0.102 -0.748 MOD2007_242_500_sur_refl_b07 -0.582 -0.563 0.587 Comp.1 Comp.2 Comp.3 SS loadings 1.000 1.000 1.000 Proportion Var 0.333 0.333 0.333 Cumulative Var 0.333 0.667 1.000
The red-bold eigen vectors of the second component, ease of the comparison with the results derived from GRASS' m.eigensystem module above.
Comments
Add comments here...
SVD
Using i.pca
Command
i.pca input=b2,b6,b7 output=pca.b267
Output
Eigen values, (vectors), and [percent importance]:
PC1 6307563.04 ( -0.64, -0.65, -0.42 ) [ 98.71% ]
PC2 78023.63 ( -0.71, 0.28, 0.64 ) [ 1.22% ]
PC3 4504.60 ( -0.30, 0.71, -0.64 ) [ 0.07% ]
Using R's prcomp() function
The following example shows that i.pca' does not perform data centering in this specific case:
Command
prcomp(mod07, center=FALSE, scale=FALSE)
Output
Standard deviations: [1] 4288.3788 476.8904 114.3971 Rotation: PC1 PC2 PC3 MOD2007_242_500_sur_refl_b02 -0.6353238 0.7124070 -0.2980602 MOD2007_242_500_sur_refl_b06 -0.6485551 -0.2826985 0.7067234 MOD2007_242_500_sur_refl_b07 -0.4192135 -0.6423066 -0.6416403
- Performing data centering manually in grass (using r.mapcalc) and repeating the i.pca gives the same results as prcomp(x, center=TRUE, scale=FALSE). Example to be added.
- The eigenvector matrices match although prcomp() reports loadings (=eigenvectors) column-wise and i.pca row-wise.
- The eigenvalues do not match. To exemplify, the standard deviation for PC1 reported by prcomp() is 4288.3788 and the variance reported by i.pca is 6307563.04 [ sqrt(6307563.04) = 2511.486 ]
More examples to be added
References
Jon Shlens, "Tutorial on Principal Component Analysis, Dec 2005," [1] (accessed on March, 2009).
e-mails in GRASS-user mailing list
There are many posts concerning the functionality of i.pca. Most of them are questioning the non-reporting of eigenvalues (an issue recently fixed).
Posts in grass-user mailing list
More sources to be added