Welcome to a Short Course!

This seminar series hopefully will elucidate some useful tools to use in bioinformatic research in regards to microbial data. Enjoy!

Thursday, September 29, 2011

Principal Components Analysis

Today's very late update concerns the coding and theory behind principal components. First off, the data set we will be using is a Made Up data set available at the link.

A refresher on how to import data from the previous week :

>data<-read.table("MadeUp.txt",header=TRUE,row.names=1)

Now take a quick look at your data, you should have 6 columns (headers 1-6) and 13 rows (A-M).

What we are actually going to look at today is finding and displaying the Principal Components of the dataset. Essentially, what the principal component analysis achieves is taking a set of observations, assigning a linear equation that describes the most variance of the data, then repeating this method a second time for the uncorrelated (orthogonal axis) to discover the next principal component. Essentially, it is both describing and transforming the dataset.

A useful package to know in R is vegan. Install it on your R platform. I'll let you read up on that one on your own and see why it is actually easier.

To delve into the meat of how to perform PCA in R, first we need to find the actual fit. This is achieved by:

>fit<-princomp(data,cor=TRUE)

When correlation is TRUE, it will report the PCA for possible correlated varaibles in uncorrelated space. If it is FALSE, than the function will report the PCA for possible covarying variables in non-convarying space.

Let's look at the summary of the fit.

>summary(fit)


Importance of components:
                          Comp.1      Comp.2       Comp.3       
Standard deviation     1.7856433    1.6750222    0.0558639339 
Proportion of Variance 0.5314204    0.4676166    0.0005201299 
Cumulative Proportion  0.5314204    0.9990369    0.9995570583 
                          Comp.4      Comp.5       Comp.6
Standard deviation     0.0391451350 0.0304540019 1.406635e-02
Proportion of Variance 0.0002553903 0.0001545744 3.297705e-05
Cumulative Proportion  0.9998124486 0.9999670229 1.000000e+00


Now, the Proportion of Variance is pretty key. This shows how much of the data's variability is captured by the specific component. PC1 captures 53% and PC2 47%. Between these two, most of the variability is captured. Note, real data rarely looks like this so sometimes PC6 or PC7 still is very descriptive.


Let's look at the actual loading, the linear equation, itself.


>loadings(fit)

Loadings:
   Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
X1 -0.415 -0.401  0.271 -0.283  0.512  0.501
X2 -0.408 -0.409 -0.240 -0.191  0.104 -0.749
X3 -0.406 -0.411         0.477 -0.611  0.252
X4 -0.405  0.412 -0.194  0.629  0.481       
X5 -0.405  0.411 -0.542 -0.493 -0.247  0.261
X6 -0.411  0.405  0.732 -0.129 -0.247 -0.233


               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.167  0.167  0.167  0.167  0.167  0.167
Cumulative Var  0.167  0.333  0.500  0.667  0.833  1.000



A quick way to asses this data is through either heatmap or biplot analysis. Since we are only concerned with the first two PCs primarily, we will us a biplot. A biplot plots a PC on each axis as shown below.

>biplot(fit)


This visualizes what we saw in the tables above. Note that columns 1-3 and 1-6 are completely orthogonal to each other. This indicates that the correlation between these two is near zero and they are describing different items. Additionally, note that the letters are plotted on this vector space. If they appear in the same direction of the arrow, they are well described by/correlated with that vector. Opposite direction, anti-correlated. Orthogonal, no correlation.

Play around with the data, there are other visualization techniques and analysis techniques, but this is just the basics to get you on your way.

Thursday, September 15, 2011

Tomorrow's Session : BiPlots and Heatmaps

Hello all-

Outside of a few other basic components of R, we will be going over some BiPlots and simple heatmaps. Again, time is 10 AM and room is 208 Hollister.

See you then!

-Cresten

Friday, September 9, 2011

Bootstrap Dendogram from Today's Talk

Here is the bootstrap dendogram from today's talk. I truncated the experimental names for data privacy reasons. I will put together a summary of what we went over and post it on Monday. Next week, we will go through PCA, biplots, and some R Graphics and R basics.


A little on the image above...

The red is the "au" value which is the approximately unbiased p-value. The green is the "bp" value which is the bootstrap probability. One of the things that irks me about R is how unfriendly it is to people who are red green color blind. If you ever consider either presenting or publishing anything in color, make sure to switch this color color scheme.

Sunday, September 4, 2011

Meeting Time and Location

Hello all-

After hearing back from a few of you, the meeting time and location will be at 10 AM on Fridays in Hollister Hall. I will get a specific room this week. Looking forward to it myself!

Make sure to bring a laptop, it will be most useful that way.

-C