Geography 414/514:  Advanced Geographic Data Analysis
Spring 2008

Exercise 7:  Regression analysis
Finish by Thurs., May 29th

1. Introduction

The objective of this exercise is to illustrate regression analysis as well as multivariate plotting, which will be used to examine the data set that will be used in the regression analysis.  The focus of the exercise is on the relationship between annual mean temperature at climate stations in Oregon and elevation and location as expressed by latitude and longitude.  The specific goal will be to build a regression model that fits the response-variable data well, and does not suffer from assumption violations. 

Here's a link to a .csv file containing the data:  [ortann.csv].

And here are links to the components of the Oregon county outline shape files:

[orotl.shp]     [orotl.dbf]     [orotl.shx]

2.  Maps

The exercise will be involve fitting a succession of models, the "goodness-of-fit" of which can be viewed by examining the patterns in maps of the residuals.  These should show progressively less obvious patterns as the fit improves.  To start, the following script will plot the data:

# Block 1: set up -- do this once
library(maptools)
library(RColorBrewer)
library(classInt)
# read a county outline shape file, and .csv data file, if necessary
orotl.shp <- readShapeLines(file.choose(),
     proj4string=CRS("+proj=longlat"))
ortann <- read.csv(file.choose())

# attach climate data
attach(ortann)

# Block 2: set variable and get colors
plottitle <- "Annual Temperature (C)"
varname <- "tann" # name of variable to plot
plotvar <- tann # assignment of variable to plotvar
nclr <- 8 # number of colors
plotclr <- brewer.pal(nclr,"PuOr")
plotclr <- plotclr[nclr:1] # reorder colors
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)

# Block 3: plot the shape file and the selected variable
plot(orotl.shp, xlim=c(-124.5, -115), ylim=c(41,47))
# add points
points(longitude, latitude, pch=16, col=colcode, cex=2)
points(longitude, latitude, cex=2)
# add legend
legend(locator(1), legend=names(attr(colcode, "table")),
fill=attr(colcode, "palette"), cex=0.6, bty="n")
title(plottitle)

To plot maps of additional variables, only Blocks 2 and 3 will need to be repeated, after changing the variable name assignment on the second line of Block 2.  It will be convenient to copy Blocks 2 and 3 into WordPad or Word to for editing.  It will also be extremely useful to copy the maps to a Word document as they are created.

3.  An initial "null" model

The simplest of models is one in which a single value, usually the mean, is used as the "fitted value" or prediction for each observation.  Such a model can be fit using the lm() function that is used to fit more elaborate models.

# a null model, mean as the prediction
tann.lm0 <- lm(tann ~ 1)

Note the "formula" in the function above;  tann ~ 1 means "predict tann as a function of a constant value" (namely the mean).

# plot the regression line
plot(tann ~ elevation)
abline(abline(mean(tann),0))

Here the regression line is simply a horizontal line at the mean value of the response variable.  Another way of looking at this is that the slope of the regression line is 0, and the intercept is equal to the mean of the dependent or response variable.

# examine the model object
summary(tann.lm0)

The summary() function here provides, in the case of this initial model, a somewhat overblown description of the mean and standard deviation of tann, which can be verified using the mean() and sd() functions.

The plot() function with a model object as an argument provides a sequence of four diagnostic plots.  These include.  (Note that an error message will be produced, because in this naive model, Cook's distance is not available.)

  • residuals vs fit:  there should be no discernable pattern on this plot.
  • normal QQ plot for the residuals:  if the residuals are normally distributed, the points on this plot should plot along a straight line.
  • residuals vs. leverage plot:  if the fit of the model is good, and there are no distinctive outliers, there should be no discernable pattern on this plot.
  • Cook's distance:  spikes on this plot indicate observations that have an unusually large influence on the regression analysis.

# standard regression diagnostics (4-up)
oldpar <- par(mfrow = c(2, 2))
plot(tann.lm0, which=c(1,2,4))
par(oldpar)

The oldpar() function sets up the RGraphics window to contain four plots on a single page, and par(oldpar) restores the original one-plot per page format.  In the case of this "null" model, the QQ plot will provide information on the normality of the residuals, while the Cook's distance plot indicates which observations have large influence on the mean.

The residuals (deviations from the mean) can be referenced as tann.lm0$residuals.  Use the code in Part 2 above to map tann.lm0$residuals.  The second and third lines of Block 2 should read:

varname <- "tann.lm0$residuals" 
plotvar <- tann.lm0$residuals   

Q1:  Compare the values and the patterns of the maps of tann and tann.lm0$residuals.  How are they alike, and how are they different?  (Hint:  Be sure to create and examine the legends.)

4.  Bivariate regression

From previous analyses, it's apparent that tann has a clear relationship with elevation, and so the successive regressions will begin using elevation as a predictor.  Before fitting the regression model, explore the relationship between tann and elevation. 

# first regression model -- tann ~ elev
tann.lm1 <- lm(tann ~ elevation)

The lm() function fits linear (straight-line) relationships between the response variable (tann) and the predictor (elevation).  Note the formula used here, tann varies as a function of elevation. 

# plot the regression line
plot(tann ~ elevation)
abline(tann.lm1, col="blue")

The plot() and abline() functions plot the data and draw regression line

# examine the model object
summary(tann.lm1)

Note that the summary table includes the regression coefficients and several goodness-of-fit statistics.

# standard regression diagnostics (4-up)
oldpar <- par(mfrow = c(2, 2))
plot(tann.lm1, which=c(1,2,4,5))
par(oldpar)

Replot the residual scatter plot (residuals vs. fitted values), adding a lowess curve to emphasize the pattern.

# another view of the residual scatter diagram
plot(tann.lm1$fitted.values, tann.lm1$residuals)
lines(lowess(tann.lm1$fitted.values, tann.lm1$residuals, f=0.80),
    col="red")

Q2:  Examine the regression output in the Reports window.  Is there a significant relationship between tann and elevation?  What are the F statistic, its p-value, the Multiple R-Squared value, and the Residual standard error?  Are the intercept and slope significantly different from zero (and how do you know)?

Q3:  Describe the patterns on the residuals vs. fit plot and normal QQ plot.  Do these plots suggest that there is no information in the residuals, or is there some kind of pattern?  Are there any unusually influential observations?  Is this first analysis ok, or should we consider modifying it?.  (I'll answer this one--there is a distinct arch-shaped pattern on the residuals vs fit plot, nicely summarized by the loess curve, and while the points are generally linear on the residual QQ plot, they begin to drift away from the 1:1 line for residual values greater than +1.  The Crater Lake and Seneca values seem to have a large influence on the regression model.  It looks like for this simple model, several of the underlying assumptions have been violated, and we should consider looking for a better model.)

Q4:  Plot the residual values from this simple linear regression model (tann.lm1.residuals), and compare their pattern with those of from the null model.

5. Multiple regression

One of the assumptions that underlies regression analysis is that we have fit the right model, and one way of expressing the assumptions that describe the properties of the residuals is that the residuals should appear free of any pattern.  This is clearly not the case for the simple model, and the reason should be apparent--the relationship between tann and elevation is not linear.  There are two possible solutions.  One would be to fit a non-linear model, for example a parabola or quadratic curve to the relationship, but there really isn't a physical justification for why the temperature decrease with increasing elevation should be non-linear.  An alternative approach is to look for additional predictors variables that might be able to account for the pattern in the residuals, and include these in a "multiple regression model."

An obvious set of additional predictor variables for this exercise is provided by latitude and longitude.  (In practice, additional variables might not immediately come to mind.)  Check to see whether any of the "information" (pattern) in the residuals might be explained by latitude and longitude by constructing scatter plots with loess curves for the residuals:

# plot residuals vs. other predictors
plot(tann.lm1$residuals ~ longitude)
lines(lowess(tann.lm1$residuals ~ longitude, f=0.80), col="red")

plot(tann.lm1$residuals ~ latitude)
lines(lowess(tann.lm1$residuals ~ latitude, f=0.80), col="red")

Q5:  Describe the relationships between the residuals from the linear model and latitude and longitude.  (Hint:  It might be good to look at the maps of residuals too.)

The relationships you should have seen in the above question are important enough that we should consider including longitude and latitude as predictors in the regression model. 

# second regression -- tann ~ elevation + latitude + longitude
tann.lm2 <- lm(tann ~ elevation + latitude + longitude)

This code creates another model object tann.lm2, which contains the results of the multiple regression with tann as the response, and elevation, latitude, and longitude as predictors.

Q6:  Look at the summary for this model.  Describe how the F-statistic, it's p-value, the R-squared, and Residual Standard Error values have changed.  Are all predictors significant (Hint:  examine their t-statistics.)  

Examine the diagnostic plots and maps of the residuals from tann.lm2.  (Hint remember to change all instances of the model object and variable names in the blocks of code necessary to do this.)

Q7:  Are patterns in the residuals less obvious?  Are they completely gone?

Several other multiple regression models with elevation, latitude, and longitude as predictors, each allowing for interactions among the predictors, in a sense by creating additional predictors (latitude x longitude, elevation x latitude), which in the case of spatial data as here, allow the predicted "surface" more flexibility.  Here are two such interaction models:

  • tann.lm3 <- lm(tann ~ elevation + latitude*longitude)
    (allows for interactions between latitude and longitude, fitting a surface slightly more complex than a simple plane).
  • tann.lm4 <- lm(tann ~ elevation*latitude*longitude)
    (simultaneous interactions among all three predictors)

Because the exercise goes on below, you might infer that fitting these more elaborate models doesn't quite take care of the model deficiencies noted above.

6.  Nonparametric (Loess) regression

It could be the case that the individual and joint relationships between tann and elevation, latitude and longitude in the previous multiple regressions still aren't quite perfectly represented by linear relationships.  The loess fitting procedure can be generalized to more than one predictor variable, which allows the flexibility of the loess curve to be extended to the multiple regression context.  In essence, the method produces a number of local regression analyses, allowing the form of the relationship to vary across the space defined by the predictor variables, which in this example, turns out to be geographical space.

Fit a "local" (loess) regression model to the tann data, using elevation as a predictor:

# loess -- simplest model
tann.lo1 <- loess(tann ~ elevation, span=0.80, degree=2)
tann.lo1

# examine the fit
summary(tann.lo1)
plot(tann ~ elevation)
hat <- predict(tann.lo1)
lines(elevation[order(elevation)], hat[order(elevation)], col="red")
cor(tann, hat)^2

The extraction of information from the model object (tann.lo1) is a little different when fitting a loess regression model than when fitting a linear model.  Because loess regression is "nonparametric" there are no regression coefficients, and F- and t-statistics produced, but the goodness-of-fit of the regression can still be determined using R-squared and residual standard error values.  For example, the cor() function is used to get the correlation between the response variable and the fitted values.  Squaring this value gives a statistic comparable with the R-squared values from earlier regressions.

# residual scatter diagram
plot(tann.lo1$fitted, tann.lo1$residuals)
lines(lowess(tann.lo1$fitted, tann.lo1$residuals, f=0.80), col="red")

# normal probablility plot
qqnorm(tann.lo1$residuals)

This block of code constructs the residual scatter diagram.  Note that the fitted (predicted) values and the residuals are contained in the variables tann.lo1$fitted, and tann.lo1$residuals.

Q8:  Compare the results of this regression with the first linear regression (tann.lm1), in particular the goodness of fit, residual scatter diagram, and map patterns of residual.  Does fitting a flexible or local curve improve the fit relative to fitting a straight line?

Now fit an expanded model with elevation, latitude, and longitude as predictors (equivalent to tann.lm2):

# loess -- elevation, latitude, and longitude
tann.lo2 <- loess(tann ~ elevation + latitude + longitude, span=0.80,
    degree=2)
tann.lo2

# examine the fit
tann.lo2
hat <- predict(tann.lo2)
cor(tann, hat)^2

# residual scatter diagram
plot(tann.lo2$fitted, tann.lo2$residuals)
lines(lowess(tann.lo2$fitted, tann.lo2$residuals, f=0.80), col="red")

# normal probablility plot
qqnorm(tann.lo2$residuals)

Q9:  Does the loess regression fit the tann data better, worse, or about the same as the multiple regression?  (Note:  it might be easier to answer Q8-Q10 simultaneously.)

Q10:  Examine the residual scatter diagrams and QQ Normal plots.  Has there been any improvement (less pattern) in the residuals of the model?, relative to the other regressions

Q11:  Which of the various models seem to be the optimal one, and why do you think it is?

7.  What to turn in

Turn in answers to the above questions, and one or more of the plots you generated in section 2.

That's all!