1. Introduction1. Introduction
1.1 General Notes
1.2 Getting Help Inside of R
1.3 Helpful Links
2. Writing and Running Programs
2.1 The Basics
3. Data Entry
4. Univariate Statistics
5.1 Simple Linear Regression
5.2 Multiple Linear Regression
5.3 Generalized Linear Models
5.4 Fitting nonlinear Models using Least Squares
6. Comparing Discrete Treatment Effects
6.3 ANOVA and ANCOVA
6.5 Mixed Models
6.7 Repeated Measures ANOVA
7. Multivariate Techniques
7.1 Principle Components Analysis
7.2 Structural Equation Modeling and Path Analysis
8. The Kitchen Sink
Appendix A: Random Graphing Notes
Appendix B: Installing R Commander and Xgobi on OSX
Appendix C: Changelog
1) SAS - Not only does it have all of the basic statistical tools at your easy disposal, but it also contains library upon library of obscure and powerful statistical methods that you might need at a moments notice. That, and the code, once you wrap your head around the basic syntax, is very simple. You can do an anova on your dataset in three lines.All of this said, R is also just like S, or S-Plus, or whatever flavor of S you have worked on, in that it essentially is a free open source implementation of S. Rock.
2) MATLAB - It operates from a command line. After running your analytical program, if you want to do additional things with the results, you don't need to write and run a whole new program, but can just do it at the prompt. Similarly, you can just run your analyses at the command line without every writing separate standalone program. Also like MATLAB, it is a fully programmable language. Any modeling, graphing, etc, that you can do in MATLAB, you can do in R. There's even a package that makes R use MATLAB syntax, if you really don't want to learn any new tricks.
3) JAVA - This is where some folk who are used to the functional approach of SAS and MATLAB will stumble at first. R is largely object oriented. This means that, instead of saying "run my anova!" and getting results printed up, you create an anova object with your data. The object contains every bit of information about the ANOVA you would ever want. To see this information, or work with it later, you need to call a function on the ANOVA object itself! E.g.
my.anova<-anova(response ~ treatment); summary(anova)For future reference, text like that is R code.
4) PERL - Like PERL, R can be very opaque. But, like PERL, R is part of the open source community. It has a HUGE number of people out there constantly tweaking, modifying, writing new statistical methods, and answering questions about it. Go to http://www.r-project.org to begin to get a sense of it. There are wonderful mailing lists with searchable archives, and people always willing to give you pointers.
?function, or by asking for an example with
example(function). Also, note, if you NEED a graphical point n' click interface, and coding terrifies you, there is an option. Make sure you have X11 installed on your computer with the tcltk library, and then you can run the package R Commander, which creates a point and click interface for R. It's not as powerful, but it can yield up some code snippets for you to use later on. See this appendix for detailed install instructions.
Or set your working directory to be the same as the file, and you can leave out the path. R uses UNIX directory syntax. There should also be a menu entry to change your working directory. FYI, I often write my data anlaysis code in the same directory as my datasets, that way I don't have to bother with remembering all sorts of directory paths.
load(libraryname)in your program.
But this gets tedious rather quickly. FYI, a few language pointers from the above statemnet.
vector <- c(1, 2, 3, 4)
vectoris the name of the object creating you are assigning your data to.
<-is the symbol of that assignation.
ccombines your arguments in parens into a vector. To avoid the tedium, let's assume you've entered your data into excel, gnumeric, openoffice, or whatever. Save them as a csv (comma separated document). Then in R, you can use the easy command:
my.data <- read.csv("/path/to/data/data.csv")
read.csv. It's that easy. There's also a nice
read.delimfor other kinds of text files. You now have an object for your data,
my.data, that you can do many wonderful things with in the future. If you ever want to look at your data in tabular format again, type
edit(my.data). If you ever want to see what the columns of your data are in R, type
editcommand will NOT edit your data. It just brings it up in an editor window. If you want to actually make changes to your data, either creating a new data object, or overwriting your old data, you need to do something like
As a side note, the
length(values) #for the sample size
#denotes a comment. Anything on a line after that
#will not be parsed. You can also get more information with
summary(values)and see a boxplot with
boxplot(data). But, let's say you have imported data from a csv, and it's multivariate data. Tricky! Here, you need to use the following syntax for any of the above functions
with(data_name, by(response, treatment, function)
#or, use $ syntax instead of with
by(data_name$response, data_name$treatment, function)
withsays "Use data_name as your dataset!", the
bysays "show me the values of response, split by treatment, and apply function to them." So, to get the mean of a dataset's response by treatment, you would use
with(dataset, by(response, treatment, mean)). To plot your data, by treatment, you use a slightly different syntax. If you were to use with with-by method, you would get overlapping plots of each treatment, which would be somewhat useless. Instead, use
Note how the
boxplot(response ~ treatment, data=my.data)
~is the standing here for
by. Why can you not use this syntax with univariate tests? Beats me! But, you WILL be using this syntax with 99% of your later statistical tests, so get used to it! In fact, here's another dose. If you wish to look at the simple effects of a two-way model, try this
Unlike SAS, : signifies an interaction. * is used, but it has the same meaning as |, e.g. do the full factorial of all of these treatments for my statistical model. More on that later.
boxplot(response ~ treatment1:treatment2, data=my.data)
plotMeans(note the caps) which can plot simple effects with error bars beautifully. It's the command I use the most for ANOVAs.
with, you can use the
attachstatement. This will make all of the column headings in your data file have meaning in the global environment. Why might you not want to do this? Let's say you had a program analyzing two different data files, and each had a column with the same name. If the values were not the same, however, things would be awfully messy. If you must use
attach(dataset), once you are done with the dataset, please, be hygienic, and
lm, or linear model. Below is the code to run a simple linear regression, and plot it.
Note a few things. First, we are creating a linear model object, so
my.slr #see the stats
anova(my.slr) #see the type I analysis of deviance table
anovais needed to get the statistical table out of it. Just typing the object namegives you a table for parameter estimates. Note: anova works on sequential type I sums of squares. To get type III (which is what you should use if you're being conservative - especially for multiple linear regression - SAS and JMP use type III for regressions) use
Anovafrom the car library. Secondly, we can plot the line quite easily for SLR with
abline. What about assumptions and error? You can examine some of those qualities graphically with the following - and this will be the same for most stats here on out.
This gives you a Fitted v. residual plot, a Q-Q plot, a Standardized residual v. fitted plot, and a Cook's Distance plot. What about something more rigorous? Try the following code (after setting your plot back to a 1x1)
par(mfrow=c(2,2)) #sets plotting to a 2x2 format
This gives both plots and gives you the regression results for a fitted against residial.
fittedcan be used as methods to generate vectors wherever you wish. For example, if you wish to do an observed against predicted, try creating a new regression fit
my.goodfit<-lm(Response ~ fitted(my.lm), data=dataset)
If you wish to do a stepwise model selection, create the full model, and then use
my.lm <- lm(response ~ treatment1 + treatment2 + treatment1:treatment2)
my.lm <- lm(response ~ treeatment1*treatment2)
step(my.lm)to get the best model based on minimization of the AIC.
curve(described below in GLM) here to great effect, letting one of the variables be continuous, while holding another constant. Then just plot multiple lines at different levels of the second variable, and, viol&accute;, you have one hot looking graph.
lmand go to
glmwhich is subtly different in the things you can do. Many of the functions are the same once you create a glm object, but some are not. To be brief, the following code will create a glm object, give you its output, and plot the function overlaid on your data points.
There's a lot here. First off, note that you specify the error family. FYI, if you wish to use the negative binomial family, you need to load the package MASS and use
my.glm <- glm(Response ~ Treatment,
data = my_data,
family = poisson(link = "log"))
plot(Response ~ Treatment, data = my_data)
my.curve <- function(x) predict(my.glm,
type = "response")
curve(my.curve(x), add = TRUE)
glm.nb. Each family will only take certain link functions.
?familyto find out what goes with what. Next, you'll note that we can't just use
abline. Bummer, but, we are using nonlinear fits. Instead, you have to specify a function. Rather than type out the actual function, and have to modify the coefficients, use
type="response"tells predict to use the nonlinear fit with the coefficients, rather than use them in a linear equation.
curvethen uses the function you just created to draw the curve, and
add=TRUEmakes sure that your curve is overlaid on the already plotted points. If you had non add statement, you could instead say
from=xmin, to=xmax, and you'll get a curve for that spans xmin to xmax.
This begins a series of iterations that stop when the least squares of the error have been minimized. You can plot this function in the same manner as the glm above.
my.nls<-nls(Response ~ a + b*Treatment^2, data=my.data, start=list(a="1", b="1"))
chisq.test(coundata, data=my.data). If you are dealing with a contingency table (i.e. 2x2), you need to do a bit more weedling. You will need to start with your data in a single column. For example, if your table is:
|Treatment 1||Treatment 2|
data.table<-matrix(data=data.vector, nrow=2, ncol=2)
my.t.test <- t.test(Response ~ Treatment, data=my.data, alternative="two.sided" summary(my.t.test)
alternativecan also be "greater" or "less" for a one tailed test, depending on which tail you are using.
lmto calculate an ANOVA. You can then use
Anovato get the relevant analysis of deviace tables, as
summarymerely gives you parameter values.
aov. I find
aovusefull if some of my treatments could be construed as continuous variables.
aovwill treat them as categories. But first, a brief digression on sums of squares. For a good description of the different types, see here. There is no quicker way to start of a holy war about hypothesis testing than ask a question about type II or III sums of squares on the R discussion list. I know, I've tried. Part of the problem is that many of us have been using something like SAS or JMP for years, and want to know how to get the equivalent output. Well, there's the rub. It appears that SAS uses an awfully strange definition of what is the type III sums of squares method. R is consistent. Or so I've been told. If you want to achieve parity between the two different stats packages, when doing a strict regression model, type III in SAS = type III in R. Use it. When doing an ANOVA, type III in SAS = type II in R. I know. Odd. But, I've tested this on the same data set, and found it to be true. The problem comes in when doing an ANCOVA or other glm. There, type III in SAS = some hybrid between type II and III in R. From my relatively ignorant perspective (I admit this), I would advise using type II for an experiment where one is testing a hypothesis, as interactions without a main effect make little sense. For modelling uncontrolled pattern data, perhaps type III would be more appropriate? If someone thinks this is poppycock and has a bettern explanation, please, let me know, and I will correct this potion of the guide.
Anovafrom the package car. The following will produce the same output:
To run a factorial ANOVA, just use *. Also, you can use lm, if you want. The following three examples will give you the same output.
library(car) #load your library
my.lm.anova <- lm(Response ~ Treatment, data = my.data)
my.anova <- aov(Response ~ Treatment, data = my.data)
To see the interaction, use the following
my.anova <-aov(Response ~ Treatment1 + Treatment2 + Treatment1:Treatment2, data=my.data)
my.anova2 <-aov(Response ~ Treatment1*Treatment2, data=my.data)
my.anova3 <- aov(Response ~ Treatment1*Treatment2, data=my.data)
If one of your terms is nested (e.g. you have replicates and subsamples of those replicates) add the effect as
with(my.data, interaction.plot(Treatment1, Treatment2, Response))
replicated %in% treatment.
lmwill assume any variables that are composed of numbers are indeed continuous, unless you create a factor for the variable. For example, let's say you wanted to turn a continuous treatment in a categorical one in my.data. Just use
my.data$CategoricalTreatment<-as.factor(my.data$Treatment). To make sure you have not violated the assumptions of homoscedacity and normally distributed residuals with a mean of 0, you should use both a levene test and a shapiro test respectively. For the preceeding example, you would use the following code:
with(my.data, levene.test(Response, Treatment1:Treatment2))
#check the residuals
TukeyHSDcommand and the broader
simintcommand from the multcomp package. The
simintcommand can do a wide variety of different multiple comparisons, and is also useful for ANCOVA and other more complicated models that the base
The ordered argument will at least let you see factors in their order by means. The output will be confidence intervals that, if they pass over 0, indicate a significant difference between two treatments. This can be difficult to grasp in the text, but the plot makes it pretty clear. It's then up to you to assign grouping values, etc. You can get similar results for other sorts of tests using
my.anova<-aov(Response ~ Treatment, data=my.data)
my.tukey<-TukeyHSD(my.anova, "Treatment", ordered=TRUE)
simintin the multcomp library (which also does more than just Tukey tests, and is useful for ANCOVA and more complicated models) like so:
I know, this lack of groupings is pretty unsatisfying, but, oh well! Go write something that'll do the grouping for you! And write a Ryan's Q test while you're at it...
#now with simint
my.tukey<-simint(Response ~ Treatment, data=my.data, whichf="Treatment", type="Tukey")
Personally, I like this as it forces you to think out your contrast matrix structure, and make sure it's orthogonal. The output is nice and straightforward as well.
my.aov<-aov(Response ~ Treatment, data=my.data)
contr.matrix <- rbind ("Control v. Treatment" =c(-3,1,1,1),
"Trt2 v Trt 3 and 4" =c(0,-2,1,1))
fit.contrast(my.aov, "Treatment", contr.matrix)
This shows three important things. 1) For least squares, you're on your own to provide your own error terms in a mixed model. 2) You can only get sequential sums of squares for the least squares method, as
nested.ls <- aov(response ~ treatment + Error(treatment:replicate)
nested.reml <- lme(response ~ treatment, random=~treatment|replicate)
anova.lme(nested.reml, type="marginal") #gets fixed effect
nested.effects <-lm(response ~ treatment)
anova.lme(rested.reml, nested.effects, type="marginal") #shows effect of random variable
Anovafrom the car package doesn't work with
Errordoesn't work with
lm. 3) You can, however, get marginal sums of squares using REML. Yay REML! It's what all of the cool kids are doing these days. The syntax can be a little dicey, but, if you want to compare it to what you used to do with proc mixed, there's a library called SASmixed which has R code for examples from published texts on proc mixed. The only bugaboo here is that you need to compare models with and without random effects to get the p values for random effects. With complex models, things can get dicey, so, be careful.
I'm still not quite sure how to extract canonical values, and make a proper centroid plot. Any advice on this would be helpful.
my.data$aggregate_response <- cbind(my.data$response1, my.data$response2)
my.manova <- manova(aggregate_response ~ Treatment, data=my.data)
summary.aov(my.manova) #gives univariate ANOVA responses
summary.manova(my.manova, test="Wilks") #or use your favorite MANOVA test
aovwith the appropriate error term (where each Replicate was sampled multiple times) as follows
There are serious problems with this method if sphericity is at all violated (i.e. the difference between time point 1 and time point 2 is not the same as between 2 and 3, or 3 and 4, etc). To solve this, it is best to use a repeated measures MANOVA approach which will report adjusted p values using two different approaches. Basically, you are comparing multivariate linear models here, so the code can get a little long. The code is as follows, assuming that you have your data in the format of:
my.rm.anova<-aov(Response ~ Treatment*Time + Error(Time:Replicate), data=my.data)
Yes, this works. No, I don't quite know the mechanics of why. This was code given to me by a statistician, so take it with that grain of salt, and get thee a better understnding of using repeated measures for MANOVA before you entirely trust this. But, that said, it should be all kosher.
response <- with(my.data, cbind(Time1, Time2, Time3))
#first, set up the contrast matrices
mlmfit <- lm(response ~ Treatment, data=my.data)
mlmfit1 <- lm(response ~ 1)
mlmfit0 <- lm(response ~ 0)
#now, do the testing for effects
anova.mlm(mlmfit, mlmfit1, M=~1) #tests for the treatment effect
anova.mlm(mlmfit1, mlmfit0, X=~1, test="Spherical") #tests for the time effect
anova.mlm(mlmfit, mlmfit1, X=~1, test="Spherical") #tests the interaction
If you are really slick, have X11 on your computer along with xgobi, you can then use the xgobi library to visualize your data with
my.matrix<-with(my.data, cbind(factor1, factor2, factor3....))
my.pc<-princomp(my.matrix, cor=TRUE) #does the PCA on correlations, not covariances
summary(my.pc) #gives the importance of each PC
loadings(my.pc) #gives the coefficients to create each PC
biplot(my.pc) #shows a plot of your PCs
my.pc$scores #shows the values of each PC, can be used as new data
Now, assuming all is well, let's specify the model using the RAM format. You have an a priori hypothesis that X2 and X3 are both in part determined by X1. Your path model (as this is a path analysis - there are not latent variables - yet) should look like this:
#first, check your data for multivariate normality
#assuming all is well, calculate the covariance matrix
#if, however, you have only been given a correlation matrix
#and a vector of standard deviations, calculate the cov_matrix
#as in the following again with x1, x2, x3, and y
#note: correlation values are made up
sd.vect<-c(std.dev.x1, std.dev.x2, std.dev.x3, std.dev.y)
corr_matrix<-matrix(c(1.0000000, 0.8627184, -0.6551867, -0.2508666,
0.8627184, 1.0000000, -0.6302580, 0.2680659,
-0.6551867, -0.6302580, 1.0000000, 0.1432521,
-0.2508666, 0.2680659, 0.1432521, 1.0000000), nrow=4, ncol=4)
specify.modelfunction from library SEM and then list all of the links in the following format:
Var1 -> Var2, path coefficient name, starting value for ML search. You also MUST specify an error term for each variable like so
Var <-> Var, error coefficient name, starting value. Note the double headed arrow. Double headed arrows are also how you represent covariances between two variables in your model. As model specification from the prompt is interactive, in code for a script, you end with an extra blank line and the name of your model. To setup a model for the path diagram above, use the following code.
Great. Now we want to fit out model, look at it's χ² goodness of fit, as well as the significance values from Wald tests for each path, the coefficient values, and the standardized coefficients. For that, we use the following
model.path <- specify.model()
X1 -> Y, xy1, NA
X1 -> X3, x13, NA
X1 -> X2, x12
X2 -> Y, xy2, NA
X3 -> Y, xy3, NA
X3 <->X3, x3error, NA
X2 <-> X2, x2error, NA
Y <-> Y, yerror, NA
So, does your model fit the data? The sem package provides several indices to help you out. Specifically, the Goodness of fit index, the Bentler-Bonnet NFI, and the Bentler CFI should all be greater than 0.9. The CFI is more reliable than the NFI with small sample sizes (Palomares et al, 1998 Journal of Ecology).
model.path.sem<-sem(model.path, cov_matrix, N=99, fixed.x="X1")
#note, if your covariance matrix doesn't have row and column names (i.e. you
#calculated it from a correlation matrix and std dev vector, you need to
#tell sem what the row and column names are, as in the following syntax
#that does the same thing
obs.var.names<-c('X1', 'X2', 'X3', 'Y')
model.path.sem<-sem(model.path, cov_matrix, N=99, obs.var.names, fixed.x="X1")
#get the coefficient values, and see relevant statistics about various paths
#and the model as a whole
#check the standardized coefficients
Note that this outputs text to a file model.path.sem.dot which is a dotfile. Dotfile's are wonderful little creations that can be read by the program graphviz. You can actually just open the file in a text editor and add detail (line widths, standardized coefficients, and more). The language specification is here, and is quite straightforward. Of course, there are more GUI oriented programs out there that read dotfiles and let you edit to your heart's content. OSX installs often come with OmniGraffle installed by default. If you know of any others you'd like to see listed here, let me know.
#output the model to a dotfile from which you can generate an image
path.diagram(model.path.sem, "model.path.sem.dot", edge.labels="values")
par(cex=1.5) #specifies ration of text to numbers on the axis
font=2, #font size
xlab="X Axis Label",
ylab="Y Axis Label",
pch=19) #makes the points filled circles
?parto find a lot more about creating graph windows.
lwd=3for line thickness as well as a color with
This also works for
col=c("red", "blue", "green")[unclass(Grouping)]
pchand other similar properties.
You can adjust
data.means<-with(my.data, by(Response, Treatment, mean))
data.se<-with(my.data, by(Response, Treatment, function(x) sd(x)/sqrt(length(x))))<
xlim=range(..)and ylim as you will in the plot statement. If you have a multifactorial thang that you want to plot, you can split up your dataset using
subset, calculate all of your means and standard errors, then moosh it back into a single dataframe and plot from there. One day, I will write a function to do this for n factors so I don't have to think about it any more. One day...
ls /usr/local/lib. If both don't return anything, or that the directories don't exist, then you can do the following command at the prompt:
5) At this point, you should be able to run Rcommander after you install it using the package installer. To run Rcommander, start up X11 first, then simply load the package from the package manager. There will be a few associated packages you will need to install, but you'll know what those are the first time you try and click on "load R Commander" and instead of getting a new set of windows, you get an error message saying "package X not found". Install it and keep going.
sudo rm -rf /usr/local
sudo ln -s /sw /usr/local
LDIR=/usr/X11R6/lib. Also, fix the corresponding code in the lint statement. The just type make. This SHOULD generate xgobi and xgvis. You need to have a directory to plunk them into, so type
mkdir /usr/local/bin. If you don't have a /usr/local already, make it first. Then copy the two files into /usr/local/bin.
xgobi(my.data)to pop up a data visualization window.