Bioinformatics for Microbial Research: September 2011

Today's very late update concerns the coding and theory behind principal components. First off, the data set we will be using is a Made Up data set available at the link.

A refresher on how to import data from the previous week :

>data<-read.table("MadeUp.txt",header=TRUE,row.names=1)

Now take a quick look at your data, you should have 6 columns (headers 1-6) and 13 rows (A-M).

What we are actually going to look at today is finding and displaying the Principal Components of the dataset. Essentially, what the principal component analysis achieves is taking a set of observations, assigning a linear equation that describes the most variance of the data, then repeating this method a second time for the uncorrelated (orthogonal axis) to discover the next principal component. Essentially, it is both describing and transforming the dataset.

A useful package to know in R is vegan. Install it on your R platform. I'll let you read up on that one on your own and see why it is actually easier.

To delve into the meat of how to perform PCA in R, first we need to find the actual fit. This is achieved by:

>fit<-princomp(data,cor=TRUE)

When correlation is TRUE, it will report the PCA for possible correlated varaibles in uncorrelated space. If it is FALSE, than the function will report the PCA for possible covarying variables in non-convarying space.

Let's look at the summary of the fit.

>summary(fit)

Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 1.7856433 1.6750222 0.0558639339
Proportion of Variance 0.5314204 0.4676166 0.0005201299
Cumulative Proportion 0.5314204 0.9990369 0.9995570583
  Comp.4   Comp.5 Comp.6
Standard deviation   0.0391451350 0.0304540019 1.406635e-02
Proportion of Variance 0.0002553903 0.0001545744 3.297705e-05
Cumulative Proportion 0.9998124486 0.9999670229 1.000000e+00

Now, the Proportion of Variance is pretty key. This shows how much of the data's variability is captured by the specific component. PC1 captures 53% and PC2 47%. Between these two, most of the variability is captured. Note, real data rarely looks like this so sometimes PC6 or PC7 still is very descriptive.

Let's look at the actual loading, the linear equation, itself.

>loadings(fit)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
X1 -0.415 -0.401 0.271 -0.283 0.512 0.501
X2 -0.408 -0.409 -0.240 -0.191 0.104 -0.749
X3 -0.406 -0.411 0.477 -0.611 0.252
X4 -0.405 0.412 -0.194 0.629 0.481
X5 -0.405 0.411 -0.542 -0.493 -0.247 0.261
X6 -0.411 0.405 0.732 -0.129 -0.247 -0.233

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.167 0.167 0.167 0.167 0.167 0.167
Cumulative Var 0.167 0.333 0.500 0.667 0.833 1.000

A quick way to asses this data is through either heatmap or biplot analysis. Since we are only concerned with the first two PCs primarily, we will us a biplot. A biplot plots a PC on each axis as shown below.

>biplot(fit)

This visualizes what we saw in the tables above. Note that columns 1-3 and 1-6 are completely orthogonal to each other. This indicates that the correlation between these two is near zero and they are describing different items. Additionally, note that the letters are plotted on this vector space. If they appear in the same direction of the arrow, they are well described by/correlated with that vector. Opposite direction, anti-correlated. Orthogonal, no correlation.

Play around with the data, there are other visualization techniques and analysis techniques, but this is just the basics to get you on your way.

Bioinformatics for Microbial Research

Welcome to a Short Course!

Thursday, September 29, 2011

Principal Components Analysis

Thursday, September 15, 2011

Tomorrow's Session : BiPlots and Heatmaps

Friday, September 9, 2011

Bootstrap Dendogram from Today's Talk

Sunday, September 4, 2011

Meeting Time and Location