A Quick and (Very) Dirty Intro to Doing Your Statistics in R v0.93-1

A Quick and (Very) Dirty Intro to Doing Your Statistics in R
v0.93

1. Introduction
1.1 General Notes
1.2 Getting Help Inside of R
1.3 Helpful Links
2. Writing and Running Programs
2.1 The Basics
2.2 Libraries
3. Data Entry
4. Univariate Statistics
5. Regression
5.1 Simple Linear Regression
5.2 Multiple Linear Regression
5.3 Generalized Linear Models
5.4 Fitting nonlinear Models using Least Squares
6. Comparing Discrete Treatment Effects
6.1 Chi-Squared
6.2 T-tests
6.3 ANOVA and ANCOVA
6.4 Post-Hocs
6.5 Mixed Models
6.6 MANOVA
6.7 Repeated Measures ANOVA
7. Multivariate Techniques
7.1 Principle Components Analysis
7.2 Structural Equation Modeling and Path Analysis
8. The Kitchen Sink
Appendix A: Random Graphing Notes
Appendix B: Installing R Commander and Xgobi on OSX
Appendix C: Changelog

1. Introduction

1.1 General Notes

The statistical programming language is, in essense, much like the bastard love child of SAS and MATLAB if it were raised by JAVA and PERL. This is why it's fantastic for any statistical application. Let me explain:

1) SAS - Not only does it have all of the basic statistical tools at your easy disposal, but it also contains library upon library of obscure and powerful statistical methods that you might need at a moments notice. That, and the code, once you wrap your head around the basic syntax, is very simple. You can do an anova on your dataset in three lines.

2) MATLAB - It operates from a command line. After running your analytical program, if you want to do additional things with the results, you don't need to write and run a whole new program, but can just do it at the prompt. Similarly, you can just run your analyses at the command line without every writing separate standalone program. Also like MATLAB, it is a fully programmable language. Any modeling, graphing, etc, that you can do in MATLAB, you can do in R. There's even a package that makes R use MATLAB syntax, if you really don't want to learn any new tricks.

3) JAVA - This is where some folk who are used to the functional approach of SAS and MATLAB will stumble at first. R is largely object oriented. This means that, instead of saying "run my anova!" and getting results printed up, you create an anova object with your data. The object contains every bit of information about the ANOVA you would ever want. To see this information, or work with it later, you need to call a function on the ANOVA object itself! E.g. my.anova<-anova(response ~ treatment); summary(anova) For future reference, text like that is R code.

4) PERL - Like PERL, R can be very opaque. But, like PERL, R is part of the open source community. It has a HUGE number of people out there constantly tweaking, modifying, writing new statistical methods, and answering questions about it. Go to http://www.r-project.org to begin to get a sense of it. There are wonderful mailing lists with searchable archives, and people always willing to give you pointers.

All of this said, R is also just like S, or S-Plus, or whatever flavor of S you have worked on, in that it essentially is a free open source implementation of S. Rock.

So, the following tutorial is designed to be a quick and dirty (very dirty) description of how to do all of your basic statistical tasks in R. A word on my bias: I'm an empirical ecologist, not a statistics guru. This is written for those who just want to crunch their data in the wonderful, powerful, and free environment of R. If you think I've missed some statistical nuance, well, that may have been intentional, as it would muddy the waters for most of us data crunchers in our day to day use. If it's a BIG nuance, email me so that it can be included. This is a constantly evolving document.

1.2 Getting Help Inside of R

You can get help on any specific function while running R by typing help(function), ?function, or by asking for an example with example(function). Also, note, if you NEED a graphical point n' click interface, and coding terrifies you, there is an option. Make sure you have X11 installed on your computer with the tcltk library, and then you can run the package R Commander, which creates a point and click interface for R. It's not as powerful, but it can yield up some code snippets for you to use later on. See this appendix for detailed install instructions.

1.3 Helpful links

If you have questions beyond this, google and the R website will most likely have your answers. The manual is at http://cran.r-project.org/manuals.html and you can find great other tutorials at http://cran.r-project.org/other-docs.html. There are also a few other great resources out there:

Searchable archives of the R mailing list http://tolstoy.newcastle.edu.au/~rking/R/

RTips http://www.ku.edu/~pauljohn/R/Rtips.html

Short course on R by Dan Stoebelhttp://life.bio.sunysb.edu/~dstoebel/R/

R Graph Gallery http://addictedtor.free.fr/graphiques/index.php

2. Writing and Running Programs

2.1 The Basics

If you wish, you can write blocks of code in whatever editor you want (R even has a nice color coded one) and then copy and paste those blocks of code into the R prompt. This can be useful for debugging. However, if you do not wish to do this, you can write your program, save it as filename.r, and then type

source("/path/to/file/filename.r")

Or set your working directory to be the same as the file, and you can leave out the path. R uses UNIX directory syntax. There should also be a menu entry to change your working directory. FYI, I often write my data anlaysis code in the same directory as my datasets, that way I don't have to bother with remembering all sorts of directory paths.

2.2 Libraries

One of the really cool things about R is that, if it's missing something in the basic install, chances are, someone else has written a method to do it. Or, you can yourself! These bunches of code are pulled together into Libraries. If you want to use a library, there are several ways of making it active. One is to just use the package manager, and make sure said library is active. If you don't have the library, you can use the package installer to fetch and install new libraries. Lastly, if you want to load a library for a certain program, just make sure you have the line load(libraryname) in your program.

3. Data Entry

Before we do anything, how in the heck do you get your data in R? There are several methods. You could enter it into a series of vectors or matrices, like so

vector <- c(1, 2, 3, 4)

But this gets tedious rather quickly. FYI, a few language pointers from the above statemnet. vector is the name of the object creating you are assigning your data to. <- is the symbol of that assignation. c combines your arguments in parens into a vector. To avoid the tedium, let's assume you've entered your data into excel, gnumeric, openoffice, or whatever. Save them as a csv (comma separated document). Then in R, you can use the easy command:

my.data <- read.csv("/path/to/data/data.csv")

Yup, read.csv. It's that easy. There's also a nice read.delim for other kinds of text files. You now have an object for your data, my.data, that you can do many wonderful things with in the future. If you ever want to look at your data in tabular format again, type edit(my.data). If you ever want to see what the columns of your data are in R, type labels(my.data).

Note that the edit command will NOT edit your data. It just brings it up in an editor window. If you want to actually make changes to your data, either creating a new data object, or overwriting your old data, you need to do something like my.new.data<-edit(my.data)

4. Univariate Statistics

Univariate statistics are quite simple in R. Were you to have just a vector of values, and wanted the mean, standard deviation, variance, the sample size, and whether your data fits a normal distribution using the Shapiro-Wilks test, just use the following:

mean(values) sd(values) var(values) length(values) #for the sample size shapiro.test(values)

As a side note, the # denotes a comment. Anything on a line after that # will not be parsed. You can also get more information with summary(values) and see a boxplot with boxplot(data). But, let's say you have imported data from a csv, and it's multivariate data. Tricky! Here, you need to use the following syntax for any of the above functions

with(data_name, by(response, treatment, function) #or, use $ syntax instead of with by(data_name$response, data_name$treatment, function)

The with says "Use data_name as your dataset!", the by says "show me the values of response, split by treatment, and apply function to them." So, to get the mean of a dataset's response by treatment, you would use with(dataset, by(response, treatment, mean)). To plot your data, by treatment, you use a slightly different syntax. If you were to use with with-by method, you would get overlapping plots of each treatment, which would be somewhat useless. Instead, use

boxplot(response ~ treatment, data=my.data)

Note how the ~ is the standing here for by. Why can you not use this syntax with univariate tests? Beats me! But, you WILL be using this syntax with 99% of your later statistical tests, so get used to it! In fact, here's another dose. If you wish to look at the simple effects of a two-way model, try this

boxplot(response ~ treatment1:treatment2, data=my.data)

Unlike SAS, : signifies an interaction. * is used, but it has the same meaning as |, e.g. do the full factorial of all of these treatments for my statistical model. More on that later.

A simple way of visualizing this data might be to use some methods native to the Rcmdr library. R Commander (installation described in appendix B is a pretty simple GUI for most common R tasks. It also generates code, that you can copy and paste for later usage. In addition, it has a few unique methods, such as plotMeans (note the caps) which can plot simple effects with error bars beautifully. It's the command I use the most for ANOVAs.

There is, of course, another way to do univariate statistics. Namely, if you want to stop dealing with with, you can use the attach statement. This will make all of the column headings in your data file have meaning in the global environment. Why might you not want to do this? Let's say you had a program analyzing two different data files, and each had a column with the same name. If the values were not the same, however, things would be awfully messy. If you must use attach(dataset), once you are done with the dataset, please, be hygienic, and detach(dataset).

5. Regression

5.1 Simple Linear Regression

As we are now entering the world of linear models, our method of choice will be lm, or linear model. Below is the code to run a simple linear regression, and plot it.

my.slr<-lm(response~treatment, data=dataset) my.slr #see the stats anova(my.slr) #see the type I analysis of deviance table plot(response~treatment, data=dataset) abline(my.slr)

Note a few things. First, we are creating a linear model object, so anova is needed to get the statistical table out of it. Just typing the object namegives you a table for parameter estimates. Note: anova works on sequential type I sums of squares. To get type III (which is what you should use if you're being conservative - especially for multiple linear regression - SAS and JMP use type III for regressions) use Anova from the car library. Secondly, we can plot the line quite easily for SLR with abline. What about assumptions and error? You can examine some of those qualities graphically with the following - and this will be the same for most stats here on out.

par(mfrow=c(2,2)) #sets plotting to a 2x2 format plot(my.lm)

This gives you a Fitted v. residual plot, a Q-Q plot, a Standardized residual v. fitted plot, and a Cook's Distance plot. What about something more rigorous? Try the following code (after setting your plot back to a 1x1)

my.errorfit<-lm(residuals(my.lm)~fitted(my.lm)) plot(residuals(my.lm)~fitted(my.lm)) abline(my.errorfit) summary(my.errorfit)

This gives both plots and gives you the regression results for a fitted against residial. residuals and fitted can be used as methods to generate vectors wherever you wish. For example, if you wish to do an observed against predicted, try creating a new regression fit my.goodfit<-lm(Response ~ fitted(my.lm), data=dataset)

5.2 Multiple Linear Regression

Multiple linear regression works pretty much the same way as simple linear regression. However, let's use this as an opportunity to introduce the syntax for factorial processes. Namely, the following two are equivalent statements.

my.lm <- lm(response ~ treatment1 + treatment2 + treatment1:treatment2) my.lm <- lm(response ~ treeatment1*treatment2)

If you wish to do a stepwise model selection, create the full model, and then use step(my.lm) to get the best model based on minimization of the AIC.

Also, kiddies, remember how important the type of the sums of squares is for doing multiple linear regression!

Lastly, on graphing, you can use curve (described below in GLM) here to great effect, letting one of the variables be continuous, while holding another constant. Then just plot multiple lines at different levels of the second variable, and, viol&accute;, you have one hot looking graph.

5.3 Generalized Linear Models

But, let's say a reviewer gets testy about you using count data, and brings you in to the land of the generalized linear model. Suddenly your eyes are openend. No need to transform! And maximum likelihood fitting! It's almost too good to be true. Here, we leave the land of lm and go to glm which is subtly different in the things you can do. Many of the functions are the same once you create a glm object, but some are not. To be brief, the following code will create a glm object, give you its output, and plot the function overlaid on your data points.

my.glm <- glm(Response ~ Treatment, data = my_data, family = poisson(link = "log")) plot(Response ~ Treatment, data = my_data) my.curve <- function(x) predict(my.glm, data.frame(Treatment=c(x)), type = "response") curve(my.curve(x), add = TRUE)

There's a lot here. First off, note that you specify the error family. FYI, if you wish to use the negative binomial family, you need to load the package MASS and use glm.nb. Each family will only take certain link functions. ?family to find out what goes with what. Next, you'll note that we can't just use abline. Bummer, but, we are using nonlinear fits. Instead, you have to specify a function. Rather than type out the actual function, and have to modify the coefficients, use predict. The type="response" tells predict to use the nonlinear fit with the coefficients, rather than use them in a linear equation. curve then uses the function you just created to draw the curve, and add=TRUE makes sure that your curve is overlaid on the already plotted points. If you had non add statement, you could instead say from=xmin, to=xmax, and you'll get a curve for that spans xmin to xmax.

5.4 Fitting Nonlinear Models using Least Squares

This is a technique for all of your basic polynomial regressions and the like. This big difference is that you need to specify where the coefficients go, as well as some initial coefficients to begin the fit with. Here, as an example, is a regression on a second degree polynomial

my.nls<-nls(Response ~ a + b*Treatment^2, data=my.data, start=list(a="1", b="1"))

This begins a series of iterations that stop when the least squares of the error have been minimized. You can plot this function in the same manner as the glm above.

6. Comparing Discrete Treatment Effects

6.1 Chi-Squared

To use a chi square for count data on your dataset, create and import an array of data, and then use chisq.test(coundata, data=my.data). If you are dealing with a contingency table (i.e. 2x2), you need to do a bit more weedling. You will need to start with your data in a single column. For example, if your table is:

	Treatment 1	Treatment 2
Group A	10	20
Group B	70	80

You would do the following:

data.vector<-c(10,20,70,90) data.table<-matrix(data=data.vector, nrow=2, ncol=2) my.chi<-chisq.test(data.table) summary(my.chi)

6.2 T-Tests

T-tests are, of course, quite simple. The following computes a two-tailed t-test comparing two means

my.t.test <- t.test(Response ~ Treatment, data=my.data, alternative="two.sided" summary(my.t.test)

alternative can also be "greater" or "less" for a one tailed test, depending on which tail you are using.

6.3 ANOVA and ANCOVA

One can use lm to calculate an ANOVA. You can then use Anova to get the relevant analysis of deviace tables, as summary merely gives you parameter values.

There is also a more specific function, aov. I find aov usefull if some of my treatments could be construed as continuous variables. aov will treat them as categories. But first, a brief digression on sums of squares. For a good description of the different types, see here. There is no quicker way to start of a holy war about hypothesis testing than ask a question about type II or III sums of squares on the R discussion list. I know, I've tried. Part of the problem is that many of us have been using something like SAS or JMP for years, and want to know how to get the equivalent output. Well, there's the rub. It appears that SAS uses an awfully strange definition of what is the type III sums of squares method. R is consistent. Or so I've been told. If you want to achieve parity between the two different stats packages, when doing a strict regression model, type III in SAS = type III in R. Use it. When doing an ANOVA, type III in SAS = type II in R. I know. Odd. But, I've tested this on the same data set, and found it to be true. The problem comes in when doing an ANCOVA or other glm. There, type III in SAS = some hybrid between type II and III in R. From my relatively ignorant perspective (I admit this), I would advise using type II for an experiment where one is testing a hypothesis, as interactions without a main effect make little sense. For modelling uncontrolled pattern data, perhaps type III would be more appropriate? If someone thinks this is poppycock and has a bettern explanation, please, let me know, and I will correct this potion of the guide.

But, to get type II or III sums of squares you need the function Anova from the package car. The following will produce the same output:

library(car) #load your library my.lm.anova <- lm(Response ~ Treatment, data = my.data) anova(my.lm.anova) Anova(my.lm.anova, type=c("II")) my.anova <- aov(Response ~ Treatment, data = my.data) summary(my.anova) Anova(my.anova, type=c("II"))

To run a factorial ANOVA, just use *. Also, you can use lm, if you want. The following three examples will give you the same output.

#1 my.anova <-aov(Response ~ Treatment1 + Treatment2 + Treatment1:Treatment2, data=my.data) Anova(my.anova, type=c("II")) #2 my.anova2 <-aov(Response ~ Treatment1*Treatment2, data=my.data) Anova(my.anova, type=c("II")) #3 my.anova3 <- aov(Response ~ Treatment1*Treatment2, data=my.data) Anova(my.anova3, type=c("II"))

To see the interaction, use the following

with(my.data, interaction.plot(Treatment1, Treatment2, Response))

If one of your terms is nested (e.g. you have replicates and subsamples of those replicates) add the effect as replicated %in% treatment.

ANCOVA uses the same syntax using lm. Note, lm will assume any variables that are composed of numbers are indeed continuous, unless you create a factor for the variable. For example, let's say you wanted to turn a continuous treatment in a categorical one in my.data. Just use my.data$CategoricalTreatment<-as.factor(my.data$Treatment). To make sure you have not violated the assumptions of homoscedacity and normally distributed residuals with a mean of 0, you should use both a levene test and a shapiro test respectively. For the preceeding example, you would use the following code:

#check homoscedacity with(my.data, levene.test(Response, Treatment1:Treatment2)) #check the residuals shapiro.test(my.anova3$residuals)

6.4 Post-hocs

Post-hocs other than contrasts are somewhat of a mixed bag in R. You can do them, but the output is not quite as nice as a variety of other statistical packages out there. If you need to do anything fancy, use the multcomp package (it will do Dunnet's, Sheffe,s and others - although, sadly, not Ryan's Q - although, see here for a Ryan's Q in R that I whipped up). The basic syntax is similar to the following example of how do to Tukey's HSD both with the TukeyHSD command and the broader simint command from the multcomp package. The simint command can do a wide variety of different multiple comparisons, and is also useful for ANCOVA and other more complicated models that the base TukeyHSD cannot handle.

my.anova<-aov(Response ~ Treatment, data=my.data) my.tukey<-TukeyHSD(my.anova, "Treatment", ordered=TRUE) summary(my.tukey) plot(my.tukey)

The ordered argument will at least let you see factors in their order by means. The output will be confidence intervals that, if they pass over 0, indicate a significant difference between two treatments. This can be difficult to grasp in the text, but the plot makes it pretty clear. It's then up to you to assign grouping values, etc. You can get similar results for other sorts of tests using simint in the multcomp library (which also does more than just Tukey tests, and is useful for ANCOVA and more complicated models) like so:

#now with simint library(multcomp) my.tukey<-simint(Response ~ Treatment, data=my.data, whichf="Treatment", type="Tukey") summary(my.tukey) plot(my.tukey)

I know, this lack of groupings is pretty unsatisfying, but, oh well! Go write something that'll do the grouping for you! And write a Ryan's Q test while you're at it...

Now, a priori contrasts, on the other hand, although they take a bit more effort, are pretty simple. All you need to do is contruct your contrast table (bearing in mind that factors will be sorted alphabetically/numerically)), and you're off to the races using the gmodels library. Let's say you have 4 levels of a treatment, and you want to compare the first (a control) to the other three, as well as comparing the second to the last two.

library(gmodels) my.aov<-aov(Response ~ Treatment, data=my.data) contr.matrix <- rbind ("Control v. Treatment" =c(-3,1,1,1), "Trt2 v Trt 3 and 4" =c(0,-2,1,1)) fit.contrast(my.aov, "Treatment", contr.matrix)

Personally, I like this as it forces you to think out your contrast matrix structure, and make sure it's orthogonal. The output is nice and straightforward as well.

6.5 Mixed Models

When I mentioned nesting above, I sidestepped the issue of random versus fixed effects, mostly as, when it comes to factorial models, it can be so bedeviling. That, and the R syntax isn't quite as transparent as I would like, but c'est la vie! Simple random effects, say, for nesting, are no problem using least squares. Just remember that nesting creates a new error term, so, say you have a treatment and replicates from which you have subsampled. The code for either least squares or REML would be as follows.

attatch(my.data) nested.ls <- aov(response ~ treatment + Error(treatment:replicate) summary(nested.ls) library(nlme) nested.reml <- lme(response ~ treatment, random=~treatment|replicate) anova.lme(nested.reml, type="marginal") #gets fixed effect nested.effects <-lm(response ~ treatment) anova.lme(rested.reml, nested.effects, type="marginal") #shows effect of random variable

This shows three important things. 1) For least squares, you're on your own to provide your own error terms in a mixed model. 2) You can only get sequential sums of squares for the least squares method, as Anova from the car package doesn't work with aov, and Error doesn't work with lm. 3) You can, however, get marginal sums of squares using REML. Yay REML! It's what all of the cool kids are doing these days. The syntax can be a little dicey, but, if you want to compare it to what you used to do with proc mixed, there's a library called SASmixed which has R code for examples from published texts on proc mixed. The only bugaboo here is that you need to compare models with and without random effects to get the p values for random effects. With complex models, things can get dicey, so, be careful.

6.6 MANOVA

MANOVA works quite like ANOVA, only you need to first bind together your response variables into a single response, as follows:

my.data$aggregate_response <- cbind(my.data$response1, my.data$response2) my.manova <- manova(aggregate_response ~ Treatment, data=my.data) summary.aov(my.manova) #gives univariate ANOVA responses summary.manova(my.manova, test="Wilks") #or use your favorite MANOVA test

I'm still not quite sure how to extract canonical values, and make a proper centroid plot. Any advice on this would be helpful.

6.7 Repeated Measures ANOVA

There are two ways to do a repeated measures test. The first is to use time as a factor that assumes a split-split plot design. This is great, as we can just use a normal aov with the appropriate error term (where each Replicate was sampled multiple times) as follows

my.rm.anova<-aov(Response ~ Treatment*Time + Error(Time:Replicate), data=my.data) summary(my.rm.anova)

There are serious problems with this method if sphericity is at all violated (i.e. the difference between time point 1 and time point 2 is not the same as between 2 and 3, or 3 and 4, etc). To solve this, it is best to use a repeated measures MANOVA approach which will report adjusted p values using two different approaches. Basically, you are comparing multivariate linear models here, so the code can get a little long. The code is as follows, assuming that you have your data in the format of:

Treatment	Time1	Time2	Time3
Control	1	2	3
Trt1	5	8	9
Control	1	1	3
Trt1	5	7	9
Control	1	2	2
Trt1	5	6	10

response <- with(my.data, cbind(Time1, Time2, Time3)) #first, set up the contrast matrices mlmfit <- lm(response ~ Treatment, data=my.data) mlmfit1 <- lm(response ~ 1) mlmfit0 <- lm(response ~ 0) #now, do the testing for effects anova.mlm(mlmfit, mlmfit1, M=~1) #tests for the treatment effect anova.mlm(mlmfit1, mlmfit0, X=~1, test="Spherical") #tests for the time effect anova.mlm(mlmfit, mlmfit1, X=~1, test="Spherical") #tests the interaction

Yes, this works. No, I don't quite know the mechanics of why. This was code given to me by a statistician, so take it with that grain of salt, and get thee a better understnding of using repeated measures for MANOVA before you entirely trust this. But, that said, it should be all kosher.

7. Multivariate Techniques

7.1 Principal Components Analysis

Principal components are mercifully easy to extract. Simply bind together the values you want to look at, then analyze them as follows

my.matrix<-with(my.data, cbind(factor1, factor2, factor3....)) my.pc<-princomp(my.matrix, cor=TRUE) #does the PCA on correlations, not covariances summary(my.pc) #gives the importance of each PC loadings(my.pc) #gives the coefficients to create each PC biplot(my.pc) #shows a plot of your PCs my.pc$scores #shows the values of each PC, can be used as new data

If you are really slick, have X11 on your computer along with xgobi, you can then use the xgobi library to visualize your data with xgobi(cbind(my.matrix, my.pc$scores))

7.2 Structural Equation Modeling and Path Analysis

Structural Equation Modeling is handled by the SEM package by John Fox. For detailed information on this package, see his appendix on the sem package from his book An R and S-Plus Companion to Applied Regression. I have also based some of these remarks on Iriondo et al, 2003, Biological Conservation.

The sem library should be in your R package installer, so load it up, and you're ready to analyze your data, path diagram style! The sem package uses a standard notation for model specification and a covariance or correlation matrix from your data to paramaterize your path diagram using maximum likelihood. I shall also show a few methods for evaluating the goodness of fit, and how to compare two models. A simple example should suffice to elucidate how to use this powerful technique. Let us assume you have measured 3 predictor variables (X1, X2, X3) and one response variable (Y) and 99 samples. First, let's make sure your data meets the assumption of multivariate normality, and then go on to calculate a covariance matrix

#first, check your data for multivariate normality library(mvnormtest) mshapiro.test(t(my.data[1:99, 1:4])) #assuming all is well, calculate the covariance matrix cov_matrix<-var(my.data) #if, however, you have only been given a correlation matrix #and a vector of standard deviations, calculate the cov_matrix #as in the following again with x1, x2, x3, and y #note: correlation values are made up sd.vect<-c(std.dev.x1, std.dev.x2, std.dev.x3, std.dev.y) corr_matrix<-matrix(c(1.0000000, 0.8627184, -0.6551867, -0.2508666, 0.8627184, 1.0000000, -0.6302580, 0.2680659, -0.6551867, -0.6302580, 1.0000000, 0.1432521, -0.2508666, 0.2680659, 0.1432521, 1.0000000), nrow=4, ncol=4) cov_matrix<-outer(sd.vect, sd.vect)*corr_matrix

Now, assuming all is well, let's specify the model using the RAM format. You have an a priori hypothesis that X2 and X3 are both in part determined by X1. Your path model (as this is a path analysis - there are not latent variables - yet) should look like this:

Model specification is pretty simple. You use the specify.model function from library SEM and then list all of the links in the following format: Var1 -> Var2, path coefficient name, starting value for ML search. You also MUST specify an error term for each variable like so Var <-> Var, error coefficient name, starting value. Note the double headed arrow. Double headed arrows are also how you represent covariances between two variables in your model. As model specification from the prompt is interactive, in code for a script, you end with an extra blank line and the name of your model. To setup a model for the path diagram above, use the following code.

library(sem) model.path <- specify.model() X1 -> Y, xy1, NA X1 -> X3, x13, NA X1 -> X2, x12 X2 -> Y, xy2, NA X3 -> Y, xy3, NA X3 <->X3, x3error, NA X2 <-> X2, x2error, NA Y <-> Y, yerror, NA model.path

Great. Now we want to fit out model, look at it's χ² goodness of fit, as well as the significance values from Wald tests for each path, the coefficient values, and the standardized coefficients. For that, we use the following

model.path.sem<-sem(model.path, cov_matrix, N=99, fixed.x="X1") #note, if your covariance matrix doesn't have row and column names (i.e. you #calculated it from a correlation matrix and std dev vector, you need to #tell sem what the row and column names are, as in the following syntax #that does the same thing obs.var.names<-c('X1', 'X2', 'X3', 'Y') model.path.sem<-sem(model.path, cov_matrix, N=99, obs.var.names, fixed.x="X1") #get the coefficient values, and see relevant statistics about various paths #and the model as a whole summary(model.path.sem) #check the standardized coefficients std.coef(model.path.sem)

So, does your model fit the data? The sem package provides several indices to help you out. Specifically, the Goodness of fit index, the Bentler-Bonnet NFI, and the Bentler CFI should all be greater than 0.9. The CFI is more reliable than the NFI with small sample sizes (Palomares et al, 1998 Journal of Ecology).

Secondly, there is the χ² value. This p value from the χ² test tells you whether the covariance structure produced by this model differs significantly from that of a fully saturated model (i.e. covariances only) which reproduces the covariance structure from your data. If p<0.05, the model differs significantly from the covariance structure of your data. Overidentified models, however, can easily be rejected, so use this statistic in concert with other measures of fit. Beyond the above listed indices, there is also the RMSEA. A RMSEA value less than < 0.05 also indicates a good fit of the model (and note the confidence interval), as well as the BIC. The BIC for a saturated model is 0, so a negative BIC indicates a better fit than the saturated model.

How well are response variables predicted? To calculate the R² for any observed variable in the model, use the estimated error from the summary statement. R² = 1-(estimated error of y)/var(y). If you are working from a correlation matrix, just remember that the variance is on the diagonal of the covariance matrix.

Lastly, you want to output your model, don't you! Sadly, package sem won't output a path diagram with standardized coefficients, but it will output a model with either coefficient names or values. To get one with names (such as the path diagram above), use the following code. To get values, substitute the word values for names:

#output the model to a dotfile from which you can generate an image path.diagram(model.path.sem, "model.path.sem.dot", edge.labels="values")

Note that this outputs text to a file model.path.sem.dot which is a dotfile. Dotfile's are wonderful little creations that can be read by the program graphviz. You can actually just open the file in a text editor and add detail (line widths, standardized coefficients, and more). The language specification is here, and is quite straightforward. Of course, there are more GUI oriented programs out there that read dotfiles and let you edit to your heart's content. OSX installs often come with OmniGraffle installed by default. If you know of any others you'd like to see listed here, let me know.

Now, let's say we have some other model - model.path.2.sem, and we wish to compare them. For that, you want to compare their respective χ² values and degrees of freedom. Simply calculate the chi squared statistic for (χ² model 1 - χ² model 2) with the degrees of freedom equal do (df model 1 - df model 2) using pchisq.

You could also use the BIC, as it penalizes the fit for the number of parameters (i.e. overfitting). The model with the lower BIC should be preferred. Or, if you want to go with the AIC and then go on to calculate Akaike weights using the guidelines layed out by Burnham and Anderson, the χ² is (up to a constand) -2*log-likelihood. The AIC, therefore, is 2 * # of parameters - χ².

If you want to delve into more SEM issues (e.g., your data is non-normal, you want to calculate more indices, you want to have more flexibility in creating models) see my sem.additions project at R-Forge. It let's you use the Satorra-Bentler corrected χ² and standard-errors, has routines for multi-model inference, and more.

8. The Kitchen Sink

Are you looking for some technique that you really think should be in a guide like this and don't see it? Have something to contribute, or have you found an egregious error? Please, let me know, and I'll add/fix/change it!

Appendix A: Random Graphing Notes

In order to make a graph usable in powerpoint for figures or somesuch, I usually add the following syntax to my scatter plots of X and Y

par(cex=1.5) #specifies ration of text to numbers on the axis plot(Y~X, font=2, #font size xlab="X Axis Label", ylab="Y Axis Label", pch=19) #makes the points filled circles

Use ?par to find a lot more about creating graph windows.

Note: if you're on a mac, copy your figures (after sizing them to where you want) and paste them in to a graphics program, then save them as a png, then import them into powerpoint, or whatever. Otherwise, you will be unhappy with the quality.

When plotting curves or lines I usually specify lwd=3 for line thickness as well as a color with col="red".

For ANCOVA's when I want to give different sets of points different colors (or for PCA or whatever) I use the following type of col statement:

plot(y~x, col=c("red", "blue", "green")[unclass(Grouping)] )

This also works for pch and other similar properties.

If you want to plot means with SE bars, say, for a regression, rather than scatterplotting all of your data, use the following code snippet

data.means<-with(my.data, by(Response, Treatment, mean)) data.se<-with(my.data, by(Response, Treatment, function(x) sd(x)/sqrt(length(x))))< data.summary<-data.frame(means=as.vector(data.means), se=as.vector(data.se), levels=c(list,of,levels)) plot(data.summary$levels, data.summary$means) segments(data.summary$levels, data.summary$means+data.summary$se, data.summary$levels, data.summary$means-data.summary$se)

You can adjust xlim=range(..) and ylim as you will in the plot statement. If you have a multifactorial thang that you want to plot, you can split up your dataset using subset, calculate all of your means and standard errors, then moosh it back into a single dataframe and plot from there. One day, I will write a function to do this for n factors so I don't have to think about it any more. One day...

Appendix B: Installing R Commander and Xgobi on OSX

Some of you out there don't wanna code. *sigh* OK, ok, it's doable in R, BUT, you will lose some of the flexibility of R. Of course, then you don't have to write your programs, and it may be easier just to play around and see what's in your data quickly. You can also copy and paste the code into a code editor, and run it at any time in the future. FINE! Run it already! See if I care!

Some of you want to be able to visualize your data, whirl it about, and see patterns and trend. This, I can get behind. YAY! R can do this to using something called Xgobi. Wait, was that an X I see? Does this mean we'll have to install X? Yes, Dorothy, we will.

Because I'm a horrible human being, I'm going to write this tutorial for Mac OSX only, although I'm sure that the steps are roughly the same for a PC running Cygwin X. So, to get this up and running on OSX, let's follow the following simple easy steps.

1) Get and install the Mac developer tools. They may be on your install CDs, although not part of a default installation. Insert the CD's, look for anything that says "developer tools" (it'll be a .pkg), double click on it, and install the sucker. If you can't find it there, get a free membership at the apple developer connection, and download the latest version of the dev tools from there.

2) Install Apple's Version of X11.

3) Install the X11SDK. This is either a package on one of your OSX install CDs or is a file that came along with all of the dev tools that you downloaded from Apple. It often does not get checked off as part of the default "developer tools install", if it is even on the list at all. Go fig.

4) Now you have to install tcltk. You can do this one of two ways. You can either use fink, or you can use Darwin Ports. I'm not going to go into how to install using these two methods, as they both have plentiful documentation up on their own websites, and I'd only tell you something wrong. I will say that most folk seem to have a waaay easier time with fink. The one caveat being that, once it's installed, you need to have it somewhere that R will search for it. Fink installs itself in /sw. I like to make /sw into /usr/local. If you already have files and such in /usr/local, this solution will not work for you. Instead, just copy the tcltk libraries into /usr/local/lib. If you do not know if you have anything installed in /usr/local at the prompt type ls /usr/local/bin and ls /usr/local/lib. If both don't return anything, or that the directories don't exist, then you can do the following command at the prompt:

sudo rm -rf /usr/local sudo ln -s /sw /usr/local

5) At this point, you should be able to run Rcommander after you install it using the package installer. To run Rcommander, start up X11 first, then simply load the package from the package manager. There will be a few associated packages you will need to install, but you'll know what those are the first time you try and click on "load R Commander" and instead of getting a new set of windows, you get an error message saying "package X not found". Install it and keep going.

6) For those of you who want Xgobi, download the latest version of the source from xgobi's homepage. Unpack it, and then fire up your terminal. Go to the xgobi directory and copy the Makefile from the makefiles directory into the src directory. Go to the src directory, and edit the Makefile so that IDIR=/usr/X11R6/include and LDIR=/usr/X11R6/lib. Also, fix the corresponding code in the lint statement. The just type make. This SHOULD generate xgobi and xgvis. You need to have a directory to plunk them into, so type mkdir /usr/local/bin. If you don't have a /usr/local already, make it first. Then copy the two files into /usr/local/bin.

7) You can now run the xgobi package. After loading up X11 and the xgobi package, you should be able to simply type xgobi(my.data) to pop up a data visualization window.

Appendix C: Changelog

11/21/2008: ver 0.93-1 Linked to Ryan's Q script and made a few fixes in the sem section.

3/9/2008: ver 0.93 Fixed a few typos courtesy of Gabriela Cendoya 8/17/2006: ver 0.92 Added info on package Multcomp and testing assumptions for ANOVA.

8/1/2006: ver 0.91 Added section on SEM

4/3/06: ver 0.90 Document placed on the web.

GOOD LUCK AND HAPPY STATS CRUNCHING!
Jarrett