1. Introduction1. Introduction
  1.1 General Notes
  1.2 Getting Help Inside of R
  1.3 Helpful Links
2. Writing and Running Programs
  2.1 The Basics
  2.2 Libraries
3. Data Entry
4. Univariate Statistics
5. Regression
  5.1 Simple Linear Regression
  5.2 Multiple Linear Regression
  5.3 Generalized Linear Models
  5.4 Fitting nonlinear Models using Least Squares
6. Comparing Discrete Treatment Effects
  6.1 Chi-Squared
  6.2 T-tests
  6.3 ANOVA and ANCOVA
  6.4 Post-Hocs
  6.5 Mixed Models
  6.6 MANOVA
  6.7 Repeated Measures ANOVA
7. Multivariate Techniques
  7.1 Principle Components Analysis
  7.2 Structural Equation Modeling and Path Analysis
8. The Kitchen Sink
Appendix A: Random Graphing Notes
Appendix B: Installing R Commander and Xgobi on OSX
Appendix C: Changelog
1) SAS - Not only does it have all of the basic statistical tools at your easy disposal, but it also contains library upon library of obscure and powerful statistical methods that you might need at a moments notice. That, and the code, once you wrap your head around the basic syntax, is very simple. You can do an anova on your dataset in three lines.All of this said, R is also just like S, or S-Plus, or whatever flavor of S you have worked on, in that it essentially is a free open source implementation of S. Rock.
2) MATLAB - It operates from a command line. After running your analytical program, if you want to do additional things with the results, you don't need to write and run a whole new program, but can just do it at the prompt. Similarly, you can just run your analyses at the command line without every writing separate standalone program. Also like MATLAB, it is a fully programmable language. Any modeling, graphing, etc, that you can do in MATLAB, you can do in R. There's even a package that makes R use MATLAB syntax, if you really don't want to learn any new tricks.
3) JAVA - This is where some folk who are used to the functional approach of SAS and MATLAB will stumble at first. R is largely object oriented. This means that, instead of saying "run my anova!" and getting results printed up, you create an anova object with your data. The object contains every bit of information about the ANOVA you would ever want. To see this information, or work with it later, you need to call a function on the ANOVA object itself! E.g.my.anova<-anova(response ~ treatment); summary(anova)
For future reference, text like that is R code.
4) PERL - Like PERL, R can be very opaque. But, like PERL, R is part of the open source community. It has a HUGE number of people out there constantly tweaking, modifying, writing new statistical methods, and answering questions about it. Go to http://www.r-project.org to begin to get a sense of it. There are wonderful mailing lists with searchable archives, and people always willing to give you pointers.
help(function)
, ?function
, or by asking for an example with example(function)
. Also, note, if you NEED a graphical point n' click interface, and coding terrifies you, there is an option. Make sure you have X11 installed on your computer with the tcltk library, and then you can run the package R Commander, which creates a point and click interface for R. It's not as powerful, but it can yield up some code snippets for you to use later on. See this appendix for detailed install instructions.
source("/path/to/file/filename.r")
Or set your working directory to be the same as the file, and you can leave out the path. R uses UNIX directory syntax. There should also be a menu entry to change your working directory. FYI, I often write my data anlaysis code in the same directory as my datasets, that way I don't have to bother with remembering all sorts of directory paths.
load(libraryname)
in your program.
vector <- c(1, 2, 3, 4)
But this gets tedious rather quickly. FYI, a few language pointers from the above statemnet. vector
is the name of the object creating you are assigning your data to. <-
is the symbol of that assignation. c
combines your arguments in parens into a vector. To avoid the tedium, let's assume you've entered your data into excel, gnumeric, openoffice, or whatever. Save them as a csv (comma separated document). Then in R, you can use the easy command:
my.data <- read.csv("/path/to/data/data.csv")
Yup, read.csv
. It's that easy. There's also a nice read.delim
for other kinds of text files. You now have an object for your data, my.data
, that you can do many wonderful things with in the future. If you ever want to look at your data in tabular format again, type edit(my.data)
. If you ever want to see what the columns of your data are in R, type labels(my.data)
.
edit
command will NOT edit your data. It just brings it up in an editor window. If you want to actually make changes to your data, either creating a new data object, or overwriting your old data, you need to do something like my.new.data<-edit(my.data)
mean(values)
sd(values)
var(values)
length(values) #for the sample size
shapiro.test(values)
As a side note, the #
denotes a comment. Anything on a line after that #
will not be parsed. You can also get more information with summary(values)
and see a boxplot with boxplot(data)
.
But, let's say you have imported data from a csv, and it's multivariate data. Tricky! Here, you need to use the following syntax for any of the above functions
with(data_name, by(response, treatment, function)
#or, use $ syntax instead of with
by(data_name$response, data_name$treatment, function)
The with
says "Use data_name as your dataset!", the by
says "show me the values of response, split by treatment, and apply function to them." So, to get the mean of a dataset's response by treatment, you would use with(dataset, by(response, treatment, mean))
. To plot your data, by treatment, you use a slightly different syntax. If you were to use with with-by method, you would get overlapping plots of each treatment, which would be somewhat useless. Instead, use
boxplot(response ~ treatment, data=my.data)
Note how the ~
is the standing here for by
. Why can you not use this syntax with univariate tests? Beats me! But, you WILL be using this syntax with 99% of your later statistical tests, so get used to it! In fact, here's another dose. If you wish to look at the simple effects of a two-way model, try this
boxplot(response ~ treatment1:treatment2, data=my.data)
Unlike SAS, : signifies an interaction. * is used, but it has the same meaning as |, e.g. do the full factorial of all of these treatments for my statistical model. More on that later.
plotMeans
(note the caps) which can plot simple effects with error bars beautifully. It's the command I use the most for ANOVAs.
with
, you can use the attach
statement. This will make all of the column headings in your data file have meaning in the global environment. Why might you not want to do this? Let's say you had a program analyzing two different data files, and each had a column with the same name. If the values were not the same, however, things would be awfully messy. If you must use attach(dataset)
, once you are done with the dataset, please, be hygienic, and detach(dataset)
.
lm
, or linear model. Below is the code to run a simple linear regression, and plot it.
my.slr<-lm(response~treatment, data=dataset)
my.slr #see the stats
anova(my.slr) #see the type I analysis of deviance table
plot(response~treatment, data=dataset)
abline(my.slr)
Note a few things. First, we are creating a linear model object, so anova
is needed to get the statistical table out of it. Just typing the object namegives you a table for parameter estimates. Note: anova works on sequential type I sums of squares. To get type III (which is what you should use if you're being conservative - especially for multiple linear regression - SAS and JMP use type III for regressions) use Anova
from the car library. Secondly, we can plot the line quite easily for SLR with abline
. What about assumptions and error? You can examine some of those qualities graphically with the following - and this will be the same for most stats here on out.
par(mfrow=c(2,2)) #sets plotting to a 2x2 format
plot(my.lm)
This gives you a Fitted v. residual plot, a Q-Q plot, a Standardized residual v. fitted plot, and a Cook's Distance plot. What about something more rigorous? Try the following code (after setting your plot back to a 1x1)
my.errorfit<-lm(residuals(my.lm)~fitted(my.lm))
plot(residuals(my.lm)~fitted(my.lm))
abline(my.errorfit)
summary(my.errorfit)
This gives both plots and gives you the regression results for a fitted against residial. residuals
and fitted
can be used as methods to generate vectors wherever you wish. For example, if you wish to do an observed against predicted, try creating a new regression fit my.goodfit<-lm(Response ~ fitted(my.lm), data=dataset)
my.lm <- lm(response ~ treatment1 + treatment2 + treatment1:treatment2)
my.lm <- lm(response ~ treeatment1*treatment2)
If you wish to do a stepwise model selection, create the full model, and then use step(my.lm)
to get the best model based on minimization of the AIC.
curve
(described below in GLM) here to great effect, letting one of the variables be continuous, while holding another constant. Then just plot multiple lines at different levels of the second variable, and, viol&accute;, you have one hot looking graph.
lm
and go to glm
which is subtly different in the things you can do. Many of the functions are the same once you create a glm object, but some are not. To be brief, the following code will create a glm object, give you its output, and plot the function overlaid on your data points.
my.glm <- glm(Response ~ Treatment,
    data = my_data,
    family = poisson(link = "log"))
plot(Response ~ Treatment, data = my_data)
my.curve <- function(x) predict(my.glm,
    data.frame(Treatment=c(x)),
    type = "response")
curve(my.curve(x), add = TRUE)
There's a lot here. First off, note that you specify the error family. FYI, if you wish to use the negative binomial family, you need to load the package MASS and use glm.nb
. Each family will only take certain link functions. ?family
to find out what goes with what. Next, you'll note that we can't just use abline
. Bummer, but, we are using nonlinear fits. Instead, you have to specify a function. Rather than type out the actual function, and have to modify the coefficients, use predict
. The type="response"
tells predict to use the nonlinear fit with the coefficients, rather than use them in a linear equation. curve
then uses the function you just created to draw the curve, and add=TRUE
makes sure that your curve is overlaid on the already plotted points. If you had non add statement, you could instead say from=xmin, to=xmax
, and you'll get a curve for that spans xmin to xmax.
my.nls<-nls(Response ~ a + b*Treatment^2, data=my.data, start=list(a="1", b="1"))
This begins a series of iterations that stop when the least squares of the error have been minimized. You can plot this function in the same manner as the glm above.
chisq.test(coundata, data=my.data)
. If you are dealing with a contingency table (i.e. 2x2), you need to do a bit more weedling. You will need to start with your data in a single column. For example, if your table is:
Treatment 1 | Treatment 2 | |
Group A | 10 | 20 |
Group B | 70 | 80 |
data.vector<-c(10,20,70,90)
data.table<-matrix(data=data.vector, nrow=2, ncol=2)
my.chi<-chisq.test(data.table)
summary(my.chi)
  6.2 T-Tests
my.t.test <- t.test(Response ~ Treatment, data=my.data, alternative="two.sided"
summary(my.t.test)
alternative
can also be "greater" or "less" for a one tailed test, depending on which tail you are using.
lm
to calculate an ANOVA. You can then use Anova
to get the relevant analysis of deviace tables, as summary
merely gives you parameter values. aov
. I find aov
usefull if some of my treatments could be construed as continuous variables. aov
will treat them as categories.
But first, a brief digression on sums of squares. For a good description of the different types, see here. There is no quicker way to start of a holy war about hypothesis testing than ask a question about type II or III sums of squares on the R discussion list. I know, I've tried. Part of the problem is that many of us have been using something like SAS or JMP for years, and want to know how to get the equivalent output. Well, there's the rub. It appears that SAS uses an awfully strange definition of what is the type III sums of squares method. R is consistent. Or so I've been told. If you want to achieve parity between the two different stats packages, when doing a strict regression model, type III in SAS = type III in R. Use it. When doing an ANOVA, type III in SAS = type II in R. I know. Odd. But, I've tested this on the same data set, and found it to be true. The problem comes in when doing an ANCOVA or other glm. There, type III in SAS = some hybrid between type II and III in R. From my relatively ignorant perspective (I admit this), I would advise using type II for an experiment where one is testing a hypothesis, as interactions without a main effect make little sense. For modelling uncontrolled pattern data, perhaps type III would be more appropriate? If someone thinks this is poppycock and has a bettern explanation, please, let me know, and I will correct this potion of the guide.Anova
from the package car. The following will produce the same output:
library(car) #load your library
my.lm.anova <- lm(Response ~ Treatment, data = my.data)
anova(my.lm.anova)
Anova(my.lm.anova, type=c("II"))
my.anova <- aov(Response ~ Treatment, data = my.data)
summary(my.anova)
Anova(my.anova, type=c("II"))
To run a factorial ANOVA, just use *. Also, you can use lm, if you want. The following three examples will give you the same output.
#1
my.anova <-aov(Response ~ Treatment1 + Treatment2 + Treatment1:Treatment2, data=my.data)
Anova(my.anova, type=c("II"))
#2
my.anova2 <-aov(Response ~ Treatment1*Treatment2, data=my.data)
Anova(my.anova, type=c("II"))
#3
my.anova3 <- aov(Response ~ Treatment1*Treatment2, data=my.data)
Anova(my.anova3, type=c("II"))
To see the interaction, use the following
with(my.data, interaction.plot(Treatment1, Treatment2, Response))
If one of your terms is nested (e.g. you have replicates and subsamples of those replicates) add the effect as replicated %in% treatment
.lm
. Note, lm
will assume any variables that are composed of numbers are indeed continuous, unless you create a factor for the variable. For example, let's say you wanted to turn a continuous treatment in a categorical one in my.data. Just use my.data$CategoricalTreatment<-as.factor(my.data$Treatment)
.
To make sure you have not violated the assumptions of homoscedacity and normally distributed residuals with a mean of 0, you should use both a levene test and a shapiro test respectively. For the preceeding example, you would use the following code:
#check homoscedacity
with(my.data, levene.test(Response, Treatment1:Treatment2))
#check the residuals
shapiro.test(my.anova3$residuals)
TukeyHSD
command and the broader simint
command from the multcomp package. The simint
command can do a wide variety of different multiple comparisons, and is also useful for ANCOVA and other more complicated models that the base TukeyHSD
cannot handle.
my.anova<-aov(Response ~ Treatment, data=my.data)
my.tukey<-TukeyHSD(my.anova, "Treatment", ordered=TRUE)
summary(my.tukey)
plot(my.tukey)
The ordered argument will at least let you see factors in their order by means. The output will be confidence intervals that, if they pass over 0, indicate a significant difference between two treatments. This can be difficult to grasp in the text, but the plot makes it pretty clear. It's then up to you to assign grouping values, etc. You can get similar results for other sorts of tests using simint
in the multcomp library (which also does more than just Tukey tests, and is useful for ANCOVA and more complicated models) like so:
#now with simint
library(multcomp)
my.tukey<-simint(Response ~ Treatment, data=my.data, whichf="Treatment", type="Tukey")
summary(my.tukey)
plot(my.tukey)
I know, this lack of groupings is pretty unsatisfying, but, oh well! Go write something that'll do the grouping for you! And write a Ryan's Q test while you're at it...
library(gmodels)
my.aov<-aov(Response ~ Treatment, data=my.data)
contr.matrix <- rbind ("Control v. Treatment" =c(-3,1,1,1),
        "Trt2 v Trt 3 and 4" =c(0,-2,1,1))
fit.contrast(my.aov, "Treatment", contr.matrix)
Personally, I like this as it forces you to think out your contrast matrix structure, and make sure it's orthogonal. The output is nice and straightforward as well.
attatch(my.data)
nested.ls <- aov(response ~ treatment + Error(treatment:replicate)
summary(nested.ls)
library(nlme)
nested.reml <- lme(response ~ treatment, random=~treatment|replicate)
anova.lme(nested.reml, type="marginal") #gets fixed effect
nested.effects <-lm(response ~ treatment)
anova.lme(rested.reml, nested.effects, type="marginal") #shows effect of random variable
This shows three important things. 1) For least squares, you're on your own to provide your own error terms in a mixed model. 2) You can only get sequential sums of squares for the least squares method, as Anova
from the car package doesn't work with aov
, and Error
doesn't work with lm
. 3) You can, however, get marginal sums of squares using REML. Yay REML! It's what all of the cool kids are doing these days. The syntax can be a little dicey, but, if you want to compare it to what you used to do with proc mixed, there's a library called SASmixed which has R code for examples from published texts on proc mixed. The only bugaboo here is that you need to compare models with and without random effects to get the p values for random effects. With complex models, things can get dicey, so, be careful.
my.data$aggregate_response <- cbind(my.data$response1, my.data$response2)
my.manova <- manova(aggregate_response ~ Treatment, data=my.data)
summary.aov(my.manova) #gives univariate ANOVA responses
summary.manova(my.manova, test="Wilks") #or use your favorite MANOVA test
I'm still not quite sure how to extract canonical values, and make a proper centroid plot. Any advice on this would be helpful.
aov
with the appropriate error term (where each Replicate was sampled multiple times) as follows
my.rm.anova<-aov(Response ~ Treatment*Time + Error(Time:Replicate), data=my.data)
summary(my.rm.anova)
There are serious problems with this method if sphericity is at all violated (i.e. the difference between time point 1 and time point 2 is not the same as between 2 and 3, or 3 and 4, etc). To solve this, it is best to use a repeated measures MANOVA approach which will report adjusted p values using two different approaches. Basically, you are comparing multivariate linear models here, so the code can get a little long. The code is as follows, assuming that you have your data in the format of:
Treatment | Time1 | Time2 | Time3 |
Control | 1 | 2 | 3 |
Trt1 | 5 | 8 | 9 |
Control | 1 | 1 | 3 |
Trt1 | 5 | 7 | 9 |
Control | 1 | 2 | 2 |
Trt1 | 5 | 6 | 10 |
response <- with(my.data, cbind(Time1, Time2, Time3))
#first, set up the contrast matrices
mlmfit <- lm(response ~ Treatment, data=my.data)
mlmfit1 <- lm(response ~ 1)
mlmfit0 <- lm(response ~ 0)
#now, do the testing for effects
anova.mlm(mlmfit, mlmfit1, M=~1) #tests for the treatment effect
anova.mlm(mlmfit1, mlmfit0, X=~1, test="Spherical") #tests for the time effect
anova.mlm(mlmfit, mlmfit1, X=~1, test="Spherical") #tests the interaction
Yes, this works. No, I don't quite know the mechanics of why. This was code given to me by a statistician, so take it with that grain of salt, and get thee a better understnding of using repeated measures for MANOVA before you entirely trust this. But, that said, it should be all kosher.
my.matrix<-with(my.data, cbind(factor1, factor2, factor3....))
my.pc<-princomp(my.matrix, cor=TRUE) #does the PCA on correlations, not covariances
summary(my.pc) #gives the importance of each PC
loadings(my.pc) #gives the coefficients to create each PC
biplot(my.pc) #shows a plot of your PCs
my.pc$scores #shows the values of each PC, can be used as new data
If you are really slick, have X11 on your computer along with xgobi, you can then use the xgobi library to visualize your data with xgobi(cbind(my.matrix, my.pc$scores))
#first, check your data for multivariate normality
library(mvnormtest)
mshapiro.test(t(my.data[1:99, 1:4]))
#assuming all is well, calculate the covariance matrix
cov_matrix<-var(my.data)
#if, however, you have only been given a correlation matrix
#and a vector of standard deviations, calculate the cov_matrix
#as in the following again with x1, x2, x3, and y
#note: correlation values are made up
sd.vect<-c(std.dev.x1, std.dev.x2, std.dev.x3, std.dev.y)
corr_matrix<-matrix(c(1.0000000, 0.8627184, -0.6551867, -0.2508666,
0.8627184, 1.0000000, -0.6302580, 0.2680659,
-0.6551867, -0.6302580, 1.0000000, 0.1432521,
-0.2508666, 0.2680659, 0.1432521, 1.0000000), nrow=4, ncol=4)
cov_matrix<-outer(sd.vect, sd.vect)*corr_matrix
Now, assuming all is well, let's specify the model using the RAM format. You have an a priori hypothesis that X2 and X3 are both in part determined by X1. Your path model (as this is a path analysis - there are not latent variables - yet) should look like this:specify.model
function from library SEM and then list all of the links in the following format: Var1 -> Var2, path coefficient name, starting value for ML search
. You also MUST specify an error term for each variable like so Var <-> Var, error coefficient name, starting value
. Note the double headed arrow. Double headed arrows are also how you represent covariances between two variables in your model. As model specification from the prompt is interactive, in code for a script, you end with an extra blank line and the name of your model. To setup a model for the path diagram above, use the following code.
library(sem)
model.path <- specify.model()
X1 -> Y, xy1, NA
X1 -> X3, x13, NA
X1 -> X2, x12
X2 -> Y, xy2, NA
X3 -> Y, xy3, NA
X3 <->X3, x3error, NA
X2 <-> X2, x2error, NA
Y <-> Y, yerror, NA
model.path
Great. Now we want to fit out model, look at it's χ² goodness of fit, as well as the significance values from Wald tests for each path, the coefficient values, and the standardized coefficients. For that, we use the following
model.path.sem<-sem(model.path, cov_matrix, N=99, fixed.x="X1")
#note, if your covariance matrix doesn't have row and column names (i.e. you
#calculated it from a correlation matrix and std dev vector, you need to
#tell sem what the row and column names are, as in the following syntax
#that does the same thing
obs.var.names<-c('X1', 'X2', 'X3', 'Y')
model.path.sem<-sem(model.path, cov_matrix, N=99, obs.var.names, fixed.x="X1")
#get the coefficient values, and see relevant statistics about various paths
#and the model as a whole
summary(model.path.sem)
#check the standardized coefficients
std.coef(model.path.sem)
So, does your model fit the data? The sem package provides several indices to help you out. Specifically, the Goodness of fit index, the Bentler-Bonnet NFI, and the Bentler CFI should all be greater than 0.9. The CFI is more reliable than the NFI with small sample sizes (Palomares et al, 1998 Journal of Ecology).
#output the model to a dotfile from which you can generate an image
path.diagram(model.path.sem, "model.path.sem.dot", edge.labels="values")
Note that this outputs text to a file model.path.sem.dot which is a dotfile. Dotfile's are wonderful little creations that can be read by the program graphviz. You can actually just open the file in a text editor and add detail (line widths, standardized coefficients, and more). The language specification is here, and is quite straightforward. Of course, there are more GUI oriented programs out there that read dotfiles and let you edit to your heart's content. OSX installs often come with OmniGraffle installed by default. If you know of any others you'd like to see listed here, let me know.pchisq
.
par(cex=1.5) #specifies ration of text to numbers on the axis
plot(Y~X,
  font=2, #font size
  xlab="X Axis Label",
  ylab="Y Axis Label",
  pch=19) #makes the points filled circles
Use ?par
to find a lot more about creating graph windows. lwd=3
for line thickness as well as a color with col="red"
.
plot(y~x,
  col=c("red", "blue", "green")[unclass(Grouping)]
)
This also works for pch
and other similar properties.
data.means<-with(my.data, by(Response, Treatment, mean))
data.se<-with(my.data, by(Response, Treatment, function(x) sd(x)/sqrt(length(x))))<
data.summary<-data.frame(means=as.vector(data.means),
  se=as.vector(data.se),
  levels=c(list,of,levels))
plot(data.summary$levels, data.summary$means)
segments(data.summary$levels, data.summary$means+data.summary$se,
  data.summary$levels, data.summary$means-data.summary$se)
You can adjust xlim=range(..)
and ylim as you will in the plot statement. If you have a multifactorial thang that you want to plot, you can split up your dataset using subset
, calculate all of your means and standard errors, then moosh it back into a single dataframe and plot from there. One day, I will write a function to do this for n factors so I don't have to think about it any more. One day...
ls /usr/local/bin
and ls /usr/local/lib
.
If both don't return anything, or that the directories don't exist,
then you can do the following
command at the prompt:
sudo rm -rf /usr/local
sudo ln -s /sw /usr/local
5) At this point, you should be able to run Rcommander after you install it using the package installer. To run Rcommander, start up X11 first, then simply load the package from the package manager. There will be a few associated packages you will need to install, but you'll know what those are the first time you try and click on "load R Commander" and instead of getting a new set of windows, you get an error message saying "package X not found". Install it and keep going.IDIR=/usr/X11R6/include
and LDIR=/usr/X11R6/lib
. Also, fix the corresponding code in the lint statement. The just type make. This SHOULD generate xgobi and xgvis. You need to have a directory to plunk them into, so type mkdir /usr/local/bin
. If you don't have a /usr/local already, make it first. Then copy the two files into /usr/local/bin. xgobi(my.data)
to pop up a data visualization window.