Principal Components Analysis: Difference between revisions

From GRASS-Wiki
Jump to navigation Jump to search
No edit summary
Line 80: Line 80:


Currently the eigenvalues reported by '''''i.pca''''' do '''not''' agree with the respective variance reported by '''''m.eigensystem'''''. In addition a cross-comparison of the results derived by '''''m.eigensystem''''' and R functions that perform PCA yield almost identical results.
Currently the eigenvalues reported by '''''i.pca''''' do '''not''' agree with the respective variance reported by '''''m.eigensystem'''''. In addition a cross-comparison of the results derived by '''''m.eigensystem''''' and R functions that perform PCA yield almost identical results.


= Examples =
= Examples =

Revision as of 06:10, 17 March 2009

Under Construction

This page is a short and practical introduction in Principal Component Analysis. It aims to highlight the importance of the values returned by PCA. In addition, it addresses numerical accuracy issues with respect to the default implementation of PCA in GRASS through the i.pca module.


Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique used extensively in Remote Sensing studies (e.g. in change detection studies, image enhancement tasks and more). PCA is in fact a linear transformation applied on (usually) highly correlated multidimensional (e.g. multispectral) data. The input dimensions are transformed in a new coordinate system in which the produced dimensions (called principal components) contain, in decreasing order, the greatest variance related with unchanged landscape features.


PCA has two algebraic solutions:

  • Eigenvectors of Covariance (or Correlation)
  • Singular Value Decomposition


The SVD method is used for numerical accuracy [R Documentation]


Background

The basic steps of the transformation are:

  1. organizing a dataset in a matrix
  2. data centering (that is: subtracting the dimensions means from tehmself so the dataset has mean=0)
  3. calculate
    1. the covariance matrix (non-standartised PCA) or
    2. the correlation matrix (standartised PCA, also known as scaling)
  4. calculate
    1. the eigenvectors and eigenvalues of the covariance (or correlation) matrix or
    2. the SVD
  5. sort variances in decreasing order (decreasing eigenvalues)
  6. project original dataset signals (PC's = eigenvector * input-data)


Solutions to PCA

The Eigenvector solution to PCA involves

  1. calculation of
    1. the covariance matrix of the given multidimensional dataset (non-standardised PCA) or
    2. the correlation matrix of the given multidimensional dataset (standardised PCA)
  2. calculation of the eigenvalues and eigenvectors of the covariance (or correlation) matrix
  3. transformation of the input dataset using the eigenvalues as weighting coefficients


Terminology

eigenvalues represent the variance of the original data contained in each principal component.

Also know as: +++

eigenvectors act as weighting coefficients / represent the contribution of each original dimension the principal components.

Also known as: loadings, +++


Implementation of PCA with GRASS

  • Manually by using the m.eigensystem module

The m.eigensystem module implements the eigenvector solution to PCA. The respective function in R is princomp(). A comparison of their results confirms their almost identical performance.

-Eigenvalues: The standard deviations (sdev) reported by princomp() are (almost) identical with the variances (= sdev^2) reported by m.eigensystem.

-Eigenvectors: princomp() scales the eigenvectors and so does m.eigensystem. The scaled(=normalised) eigenvectors produced by m.eigensystem are marked with the capital letter N.


  • Using the i.pca module

The i.pca module performs PCA based on the SVD solution without data centering and/or scaling. A comparison of the results derived by i.pca and R's prcomp() function confirms this. Specifically, i.pca yields the same eigenvectors as R's prcomp() function does with the following options:

prcomp(x, center=FALSE, scale=FALSE)

where x is a numeric or complex matrix (or data frame) which provides the data for the principal components analysis (R Documentation).


Open issues with i.pca

Currently the eigenvalues reported by i.pca do not agree with the respective variance reported by m.eigensystem. In addition a cross-comparison of the results derived by m.eigensystem and R functions that perform PCA yield almost identical results.


Examples

All examples presented below were performed using some MODIS surface reflectance products


Eigenvectors solution

Based on the covariance matrix

Using m.eigensystem

Command in one line

(echo 3; r.covar b02,b06,b07) | m.eigensystem


Output

  • The E line is the eigen value. (Real part, imaginary part, percent importance)
  • The V lines are the eigen vectors associated with E.
  • The N lines are the V vectors normalized to have a magnitude of 1. These are the scaled eigen vectors that correspond to princomp()'s results presented in the following section.
  • The W lines are the N vector multiplied by the square root of the magnitude of the eigen value (E).
r.covar: complete ...
100%
E    778244.0258462029          .0000000000    79.20
V          .5006581842          .0000000000
V          .8256483300          .0000000000
V          .6155834548          .0000000000
N          .4372107421          .0000000000
N          .7210155161          .0000000000
N          .5375717557          .0000000000
W       385.6991853500          .0000000000
W       636.0664787886          .0000000000
W       474.2358050886          .0000000000

E    192494.5769628266          .0000000000    19.59
V         -.8689798010          .0000000000
V          .0996340298          .0000000000
V          .5731134848          .0000000000
N         -.8309940700          .0000000000
N          .0952787255          .0000000000
N          .5480609638          .0000000000
W      -364.5920328433          .0000000000
W        41.8027823088          .0000000000
W       240.4573848757          .0000000000

E     11876.4548199713          .0000000000     1.21
V          .2872248982          .0000000000
V         -.5731591248          .0000000000
V          .5351449518          .0000000000
N          .3439413070          .0000000000
N         -.6863370819          .0000000000
N          .6408165005          .0000000000
W        37.4824307850          .0000000000
W       -74.7964308085          .0000000000
W        69.8356366100          .0000000000


In general, the solution to the eigen system results in complex numbers (with both real and imaginary parts). However, in the example above, since the input matrix is symmetric (i.e., inverting the rows and columns gives the same matrix) the eigen system has only real values (i.e., the imaginary part is zero). This fact makes it possible to use eigen vectors to perform principle component transformation of data sets. The covariance or correlation matrix of any data set is symmetric and thus has only real eigen values and vectors.

Note The bold and red colored eigen vectors of the first principal component, ease of the comparison with the results derived from R's princomp() function that follows.


Using the W vector, new maps can be created:

r.mapcalc 'pc.1 =  385.6992*map.1 +636.0665*map.2 + 474.2358*map.3'
r.mapcalc 'pc.2 = -364.5920*map.1 + 41.8027*map.2 + 240.4573*map.3'
r.mapcalc 'pc.3 =   37.4824*map.1 - 74.7964*map.2 +  69.8356*map.3'


Using R's princomp() function

Command

princomp(modis)


Output

Call:
princomp(x = modis)
Standard deviations:
  Comp.1   Comp.2   Comp.3
857.5737 436.0922 108.5083
3  variables and  350596 observations.


Get loadings

(princomp(modis))$loadings
Loadings:
    Comp.1 Comp.2 Comp.3
b02 -0.418  0.839  0.348
b06 -0.725        -0.684
b07 -0.547 -0.539  0.641

The bold and red colored loadings, that is, the eigen vectors of the first principal component, ease of the comparison with the results derived from GRASS' m.eigensystem module above. Note the missing eigen value in row 2, column 2. For some reason R refused to yield it.

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000


Based on the correlation matrix

Using m.eigensystem

Command in one line

(echo 3; r.covar -r MOD07_b02,MOD07_b06,MOD07_b07)|m.eigensystem


Output

r.covar: complete ...
 100%
E         2.2915877718          .0000000000    76.39
V         -.5755655569          .0000000000
V         -.7660355041          .0000000000
V         -.6809380186          .0000000000
N         -.4896413269          .0000000000
N         -.6516766616          .0000000000
N         -.5792830912          .0000000000
W         -.7412186091          .0000000000
W         -.9865075560          .0000000000
W         -.8769182329          .0000000000

E          .6740687010          .0000000000    22.47
V          .8667178982          .0000000000
V         -.1116525720          .0000000000
V         -.6069908335          .0000000000
N          .8145815825          .0000000000
N         -.1049362531          .0000000000
N         -.5704780699          .0000000000
W          .6687852213          .0000000000
W         -.0861544341          .0000000000
W         -.4683721194          .0000000000

E          .0343435272          .0000000000     1.14
V          .2486404469          .0000000000
V         -.6006166822          .0000000000
V          .4655120098          .0000000000
N          .3109794470          .0000000000
N         -.7512029762          .0000000000
N          .5822249325          .0000000000
W          .0576307320          .0000000000
W         -.1392129859          .0000000000
W          .1078979635          .0000000000

The bold and red colored eigen vectors of the second principal component, ease of the comparison with the results derived from R's princomp() function that follows.


Using R's princomp() function

Command

princomp(mod07, cor=TRUE)


Output

Call:
princomp(x = mod07, cor = TRUE)
Standard deviations:
   Comp.1    Comp.2    Comp.3
1.5030740 0.8397807 0.1885121

 3  variables and  350596 observations.


Get loadings

(princomp(mod07, cor=TRUE))$loadings


Output

Loadings:
                             Comp.1 Comp.2 Comp.3
MOD2007_242_500_sur_refl_b02 -0.481  0.820  0.310
MOD2007_242_500_sur_refl_b06 -0.656 -0.102 -0.748
MOD2007_242_500_sur_refl_b07 -0.582 -0.563  0.587

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000

The bold and red colored loadings, that is, the eigen vectors of the second component, ease of the comparison with the results derived from GRASS' m.eigensystem module above.


Comments Add comments here...

SVD solution

Using i.pca

Command

i.pca input=b2,b6,b7 output=pca.b267


Output

Eigen values, (vectors), and [percent importance]:
PC1  6307563.04 ( -0.64, -0.65, -0.42 ) [ 98.71% ]
PC2    78023.63 ( -0.71,  0.28,  0.64 ) [  1.22% ]
PC3     4504.60 ( -0.30,  0.71, -0.64 ) [  0.07% ] 


Using R's prcomp() function

The following example replicates i.pca's solution, that is, using the same data without data centering and/or scaling (options center=FALSE and scale=FALSE).


Command

prcomp(mod07, center=FALSE, scale=FALSE) <<== this corresponds to i.pca


Output

Standard deviations:
[1] 4288.3788  476.8904  114.3971
Rotation:
                                   PC1        PC2        PC3
MOD2007_242_500_sur_refl_b02 -0.6353238  0.7124070 -0.2980602
MOD2007_242_500_sur_refl_b06 -0.6485551 -0.2826985  0.7067234
MOD2007_242_500_sur_refl_b07 -0.4192135 -0.6423066 -0.6416403


Comments

  • The eigenvector matrices match although prcomp() reports loadings (=eigenvectors) column-wise and i.pca row-wise.
  • The eigenvalues do not match. To exemplify, the standard deviation for PC1 reported by prcomp() is 4288.3788 and the variance reported by i.pca is 6307563.04 [ sqrt(6307563.04) = 2511.486 ]


More examples to be added


References

Jon Shlens, "Tutorial on Principal Component Analysis, Dec 2005," [1] (accessed on March, 2009).


e-mails in GRASS-user mailing list

There are many posts concerning the functionality of i.pca. Most of them are questioning the non-reporting of eigenvalues (an issue recently fixed).


Old posts

[2]


Recent posts

Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

[3]

[4]

[5]


More sources to be added