Geography 414/514: Advanced
Geographic Data Analysis
Spring 2008Exercise 7: Regression analysis
Finish by Thurs., May 29th
1. Introduction
The objective of this exercise is to illustrate regression analysis
as well as multivariate plotting, which will be used to examine the data
set that will be used in the regression analysis. The focus of the
exercise is on the relationship between annual mean temperature at
climate stations in Oregon and elevation and location as expressed by
latitude and longitude. The specific goal will be to build a
regression model that fits the response-variable data well, and does not
suffer from assumption violations.
Here's a link to a .csv file containing the data:
[ortann.csv].
And here are links to the components of the Oregon county outline
shape files:
[orotl.shp]
[orotl.dbf]
[orotl.shx]
2. Maps
The exercise will be involve fitting a succession of models, the
"goodness-of-fit" of which can be viewed by examining the patterns in
maps of the residuals. These should show progressively less
obvious patterns as the fit improves. To start, the following
script will plot the data:
# Block 1: set up -- do this
once
library(maptools)
library(RColorBrewer)
library(classInt)
# read a county outline shape file,
and .csv data file, if necessary
orotl.shp <- readShapeLines(file.choose(),
proj4string=CRS("+proj=longlat"))
ortann <- read.csv(file.choose())
# attach climate data
attach(ortann)
# Block 2: set variable and get colors
plottitle <- "Annual Temperature (C)"
varname <- "tann" # name of variable to plot
plotvar <- tann # assignment of variable to plotvar
nclr <- 8 # number of colors
plotclr <- brewer.pal(nclr,"PuOr")
plotclr <- plotclr[nclr:1] # reorder colors
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)
# Block 3: plot the shape file and the selected variable
plot(orotl.shp, xlim=c(-124.5, -115), ylim=c(41,47))
# add points
points(longitude, latitude, pch=16, col=colcode, cex=2)
points(longitude, latitude, cex=2)
# add legend
legend(locator(1), legend=names(attr(colcode, "table")),
fill=attr(colcode, "palette"), cex=0.6, bty="n")
title(plottitle)
To plot maps of additional variables, only Blocks 2 and 3 will need
to be repeated, after changing the variable name assignment on the second
line of Block 2. It will be convenient to copy Blocks 2 and 3 into
WordPad or Word to for editing. It will also be extremely useful
to copy the maps to a Word document as they are created. 3.
An initial "null" model The simplest of models is one in which a
single value, usually the mean, is used as the "fitted value" or
prediction for each observation. Such a model can be fit using the
lm() function that is used
to fit more elaborate models.
# a null model, mean as the
prediction
tann.lm0 <- lm(tann ~ 1) Note the "formula" in the function
above; tann ~ 1
means "predict tann as a
function of a constant value" (namely the mean).
# plot the regression line
plot(tann ~ elevation)
abline(abline(mean(tann),0)) Here the regression line is
simply a horizontal line at the mean value of the response variable.
Another way of looking at this is that the slope of the regression line
is 0, and the intercept is equal to the mean of the dependent or
response variable.
# examine the model object
summary(tann.lm0)
The summary() function here provides, in the case of this initial
model, a somewhat overblown description of the mean and standard
deviation of tann, which
can be verified using the mean()
and sd() functions. The
plot() function with a
model object as an argument provides a sequence of four diagnostic plots.
These include. (Note that an error message will be produced, because
in this naive model, Cook's distance is not available.)
- residuals vs fit: there should be no discernable pattern on
this plot.
- normal QQ plot for the residuals: if the residuals
are normally distributed, the points on this plot should plot along
a straight line.
- residuals vs. leverage plot: if the fit of the model is good,
and there are no distinctive outliers, there
should be no discernable pattern on this plot.
- Cook's distance: spikes on this plot indicate
observations that have an unusually large influence on the
regression analysis.
# standard regression diagnostics
(4-up)
oldpar <- par(mfrow = c(2, 2))
plot(tann.lm0, which=c(1,2,4))
par(oldpar)
The oldpar() function
sets up the RGraphics
window to contain four plots on a single page, and
par(oldpar) restores the
original one-plot per page format. In the case of this "null"
model, the QQ plot will provide information on the normality of the
residuals, while the Cook's distance plot indicates which observations
have large influence on the mean. The residuals (deviations from the
mean) can be referenced as
tann.lm0$residuals. Use the code in Part 2 above to map
tann.lm0$residuals.
The second and third lines of Block 2 should read:
varname <- "tann.lm0$residuals"
plotvar <-
tann.lm0$residuals
Q1: Compare the values and the patterns of the maps of
tann and
tann.lm0$residuals.
How are they alike, and how are they different? (Hint: Be
sure to create and examine the legends.)
4. Bivariate regression
From previous analyses, it's apparent that tann has a clear
relationship with elevation, and so the successive regressions will
begin using elevation as a predictor. Before fitting the regression model, explore the relationship between
tann and elevation.
# first regression model -- tann
~ elev
tann.lm1 <- lm(tann ~ elevation) The lm() function fits
linear (straight-line) relationships between the response variable (tann)
and the predictor (elevation). Note the formula used here, tann
varies as a function of elevation.
# plot the regression line
plot(tann ~ elevation)
abline(tann.lm1, col="blue") The plot() and abline() functions plot the
data and draw regression line #
examine the model object
summary(tann.lm1) Note that the summary table includes
the regression coefficients and several goodness-of-fit statistics.
# standard regression diagnostics
(4-up)
oldpar <- par(mfrow = c(2, 2))
plot(tann.lm1, which=c(1,2,4,5))
par(oldpar) Replot the residual scatter plot (residuals
vs. fitted values), adding a lowess curve to emphasize the
pattern. # another view of the
residual scatter diagram plot(tann.lm1$fitted.values, tann.lm1$residuals) lines(lowess(tann.lm1$fitted.values, tann.lm1$residuals, f=0.80), col="red") Q2: Examine the regression output in the Reports window.
Is there a significant relationship between tann and elevation?
What are the F statistic, its p-value, the Multiple R-Squared value,
and the Residual standard error? Are the intercept and slope
significantly different from zero (and how do you know)?
Q3: Describe the patterns on the residuals vs. fit plot and
normal QQ plot. Do these plots suggest that there is no
information in the residuals, or is there some kind of pattern? Are there any unusually
influential observations? Is this first analysis ok, or should
we consider modifying it?. (I'll answer this one--there is a
distinct arch-shaped pattern on the residuals vs fit plot, nicely
summarized by the loess curve, and while the points are generally
linear on the residual QQ plot, they begin to drift away from the 1:1
line for residual values greater than +1. The Crater Lake and
Seneca values seem to have a large influence on the regression model.
It looks like for this simple model, several of the underlying
assumptions have been violated, and we should consider looking for a
better model.)
Q4: Plot the residual values from this simple linear
regression model (tann.lm1.residuals), and compare their pattern with
those of from the null model.
5. Multiple regression
One of the assumptions that underlies regression analysis is that we
have fit the right model, and one way of expressing the assumptions that
describe the properties of the residuals is that the residuals should
appear free of any pattern. This is clearly not the case for the
simple model, and the reason should be apparent--the relationship
between tann and elevation is not linear. There are two possible
solutions. One would be to fit a non-linear model, for example a
parabola or quadratic curve to the relationship, but there really isn't
a physical justification for why the temperature decrease with
increasing elevation should be non-linear. An alternative approach
is to look for additional predictors variables that might be able to
account for the pattern in the residuals, and include these in a
"multiple regression model."
An obvious set of additional predictor variables for this exercise is
provided by latitude and longitude. (In practice, additional
variables might not immediately come to mind.) Check to see
whether any of the "information" (pattern) in the residuals
might be explained by latitude and longitude by constructing scatter
plots with loess curves for the residuals:
# plot residuals vs. other
predictors
plot(tann.lm1$residuals ~ longitude)
lines(lowess(tann.lm1$residuals ~ longitude, f=0.80), col="red")
plot(tann.lm1$residuals ~ latitude)
lines(lowess(tann.lm1$residuals ~ latitude, f=0.80), col="red")
Q5: Describe the relationships between the residuals from the
linear model and latitude and longitude. (Hint: It might
be good to look at the maps of residuals too.)
The relationships you should have seen in the above question are
important enough that we should consider including longitude and
latitude as predictors in the regression model.
# second regression -- tann ~
elevation + latitude + longitude
tann.lm2 <- lm(tann ~ elevation + latitude + longitude)
This code creates another model object tann.lm2, which contains the
results of the multiple regression with tann as the response, and
elevation, latitude, and longitude as predictors.
Q6: Look at the summary for this model. Describe how the F-statistic, it's p-value, the
R-squared, and Residual Standard Error values have changed. Are
all predictors significant (Hint: examine their t-statistics.)
Examine the diagnostic plots and maps of the residuals from tann.lm2.
(Hint remember to change all instances of the model object and variable
names in the blocks of code necessary to do this.)
Q7: Are patterns in the residuals less obvious?
Are they completely gone?
Several other multiple regression models with elevation, latitude,
and longitude as predictors, each allowing for interactions among the
predictors, in a sense by creating additional predictors (latitude x
longitude, elevation x latitude), which in the case of spatial data as
here, allow the predicted "surface" more flexibility. Here are
two such interaction models:
- tann.lm3 <- lm(tann ~
elevation + latitude*longitude)
(allows for interactions
between latitude and longitude, fitting a surface slightly more
complex than a simple plane).
- tann.lm4 <- lm(tann ~
elevation*latitude*longitude)
(simultaneous interactions
among all three predictors)
Because the exercise goes on below, you might infer that fitting
these more elaborate models doesn't quite take care of the model
deficiencies noted above.
6. Nonparametric (Loess) regression
It could be the case that the individual and joint relationships
between tann and elevation, latitude and longitude in the previous
multiple regressions still aren't quite perfectly represented by linear
relationships. The loess fitting procedure can be generalized to
more than one predictor variable, which allows the flexibility of the
loess curve to be extended to the multiple regression context. In
essence, the method produces a number of local regression analyses,
allowing the form of the relationship to vary across the space defined
by the predictor variables, which in this example, turns out to be
geographical space.
Fit a "local" (loess) regression model to the tann data,
using elevation as a predictor:
# loess -- simplest model
tann.lo1 <- loess(tann ~ elevation, span=0.80, degree=2)
tann.lo1
# examine the fit
summary(tann.lo1)
plot(tann ~ elevation)
hat <- predict(tann.lo1)
lines(elevation[order(elevation)], hat[order(elevation)], col="red")
cor(tann, hat)^2
The extraction of information from the model object (tann.lo1)
is a little different when fitting a loess regression model than when
fitting a linear model. Because loess regression is "nonparametric" there are no
regression coefficients, and F- and t-statistics produced, but the
goodness-of-fit of the regression can still be determined using
R-squared and residual standard error values. For example, the
cor() function is used to
get the correlation between the response variable and the fitted values.
Squaring this value gives a statistic comparable with the R-squared
values from earlier regressions.
# residual scatter diagram
plot(tann.lo1$fitted, tann.lo1$residuals)
lines(lowess(tann.lo1$fitted, tann.lo1$residuals, f=0.80), col="red")
# normal probablility plot
qqnorm(tann.lo1$residuals)
This block of code constructs the residual scatter diagram.
Note that the fitted (predicted) values and the residuals are contained
in the variables tann.lo1$fitted,
and tann.lo1$residuals.
Q8: Compare the results of this regression with the first
linear regression (tann.lm1), in particular the goodness of fit,
residual scatter diagram, and map patterns of residual. Does
fitting a flexible or local curve improve the fit relative to fitting a
straight line?
Now fit an expanded model with elevation, latitude, and longitude as
predictors (equivalent to tann.lm2):
# loess -- elevation, latitude,
and longitude
tann.lo2 <- loess(tann ~ elevation + latitude + longitude, span=0.80,
degree=2)
tann.lo2
# examine the fit
tann.lo2
hat <- predict(tann.lo2)
cor(tann, hat)^2
# residual scatter diagram
plot(tann.lo2$fitted, tann.lo2$residuals)
lines(lowess(tann.lo2$fitted, tann.lo2$residuals, f=0.80), col="red")
# normal probablility plot
qqnorm(tann.lo2$residuals)
Q9: Does the loess regression fit the tann data better,
worse, or about the same as the multiple regression? (Note:
it might be easier to answer Q8-Q10 simultaneously.)
Q10: Examine the residual scatter diagrams and QQ Normal plots. Has there been any improvement (less
pattern) in the residuals of the model?, relative to the other
regressions
Q11: Which of the various models seem to be the optimal one, and why do
you think it is?
7. What to turn in
Turn in answers to the above questions, and one or more of the plots
you generated in section 2.
That's all!
|