Geography 414/514:  Advanced Geographic Data Analysis
Spring 2008

Exercise 5:  Spatial Plots and Statistics
Finish by Thusday, May 8

1.  Introduction

The purpose of this exercise is to examine some of the ways of characterizing spatial variations in a data set.  In particular, the exercise focuses on displaying variograms, the identfication of spatial neighbors, and the calculation of overall measures of spatial autocorrelation.  The exercise begins by examining some variograms for the Oregon climate-station data, and concludes with the calculation of spatial autocorrelation statistics for the Oregon county census data.

Read through the exercise before attempting to complete it.  More so than some of the other exercises, this one will result in the creation of a number of new objects, and stopping in the middle without saving the workspace might result in a lot of backtracking.

This exercise uses the RColorBrewer, maptools, gstat, and spdep packages.

2.  (Semi)Variograms

The semi-variogram (usually just variogram) summarizes in a descriptive plot the scale and nature of the spatial variability in a variable.  In this part of the exercise, the spatial variations in precipitation will be examined (we examined the variations of temperature in the lectures).

Here is a link to the data:  [orstationc.csv] and the Oregon county outline shape files:

[orotl.shp]     [orotl.dbf]     [orotl.shx]

Begin by mapping annual precipitation (pann) as in Section 2 of the previous exercise, plus any other way of plotting the data that might make sense. 

Q1:  Describe the broadscale pattern of annual precipitation.  Where is it highest and lowest?  Which direction does it vary the most in (remembering that up-and-down is a direction too).

Now examine the basic variogram for pann.  Here's the set up.

# 1 -- set up for calculating variograms using the gstat package
library(maptools)   # loads sp() too
library(RColorBrewer)
library(gstat)

# 2 -- open a new Lattice graphics window
library(lattice)
trellis.device(color=TRUE, theme = "col.whitebg")

# 3 -- get the county outline shapefile, and attach orstationc
orotl.shp <- readShapeLines(file.choose(),
     proj4string=CRS("+proj=longlat"))

attach(orstationc)

# 4 -- get x and y coordinates (in km) of county centroids
orstationc$y <- 110.*(lat-42)
orstationc$x <- 110.*(lon-(-125))*cos(lat/(360/(2*pi)))
plot(orstationc$x, orstationc$y) # plot the projected station locations

Step 1 does the necessary library() function calls..  The plotting functions in gstat use the lattice graphics, so step 2 opens a new graphics window.  Step 3 reads the shapefile and attaches the climate station data.  Step 4 "projects" the station locations in latitude and longitude into kilometers north and east of the southwestern corner of the region.  (The variogram calculations assume isotropy--that unit increments in the locational coordinates represent the same distance--which latitudes and longitudes don't (except right at the equator).)

# 5 -- assign a varible to examine
plotvar <- pann
plottitle <- "Annual Precipitation"

# basic variogram
vgm <- variogram(plotvar ~1, loc= ~x+y, data=orstationc)
plot(vgm, main=paste("Variogram: ",plottitle))

The variogram() function has as its first argument a formula that indicates what to find the variogram of.  In the above function call, the construct plotvar ~1 means just find the variogram of plotvar.  Note also that the variogram() function isn't restricted to examining only two-dimensional data, other dimensional coordinates my be specified in assignment to the loc keyword).

Q2:  Examine the variogram  Describe in particular the "range" of the variogram--the distance (on the horizontal axis) at which the semivariance values begin to level off.  (It's a judgement call, but it will be around the distance where the semivariance values stop continuously rising and start bouncing around).  To what extent does this estimate of the distance within which precipitation values are relatively homogeneous confirm or contradict your subjective observations about the map pattern of annual precipitation you described above?

The variogram cloud plot shows the contribution of individual pairs of locations to the overall variogram, and allows individual point-pairs to be identified.

# variogram cloud plot
vgm.cloud <- variogram(plotvar ~1, loc= ~x+y, data=orstationc, cloud=T)
plot(vgm.cloud, main=paste("Variogram: ",plottitle), identify=TRUE)

Click near individual points to identify them (the row numbers of the individual stations that contribute to the semivariance of each point clicked on are revealed), and right-click and click Stop to stop.  Then get a map of the station numbers:

# get a map, with labeled station row numbers
plot(orotl.shp)
text(lon, lat, as.character(seq(1:length(plotvar))), col="red", cex=.6)

The basic map will help figure out the number corresponding to each station.  There are a lot of stations, and they appear in alphabetical order in the data set, which won't make finding them on the map easy.  You can list the values for one particular station by typing, for example, orstationc[37,] for station 37, or orstationc[65,] for station 65 (note the use of the square brackets).

Q3:  Do you see anything interesting in the cloud plot?  Are some stations greater contributors to the larger values of gamma than other locations?  What might that indicate about the nature of those locations?

The gstat variogram() function allows large-scale trends in the data to be removed before calculating the variogram, something that might be desirable if some particular trend (i.e. western Oregon vs. eastern Oregon "swamped" the overall pattern.  This can be done by specifying some predictor variables in the formula in the variogram() function call.  Here, specifying plotvar ~x+y  is equivalent to removing a linear trend in space from the data.

# detrended variogram
detrend.vgm <- variogram(plotvar ~x+y, loc= ~x+y, data=orstationc)
plot(detrend.vgm, main=paste("Detrended variogram: ",plottitle))

Q4:  Examine how detrending changes the shape of the variogram for annual precipitation.  Does it increase or decrease the large- or small-scale variability apparent in the variogram.

A variation on the basic variogram is the directional variogram, that illustrates how spatial variability may vary as a function of direction in a data set. 

# directional variogram
dir.vgm <- variogram(plotvar ~1, loc=~x+y, data=orstationc,
     alpha=c(0,45,90,135))
plot(dir.vgm, main=paste("Directional variogram: ",plottitle))

The panel labels on the lattice plots work as follows:  0 describes N-S variations, 90 describes E-W variations, 45 describes NE-SW variations, etc.)

Q5:  Examine the plots for a couple of the climate variables.  Is there any evidence that the spatial variations in climate have any directional components to them (i.e. is there more N-S than E-W variation of climate in Oregon)?  (The answer probably should be "Well duh, yes, of course there is." but the thing to remember is that the variogram and other plots and statistics allow one to be fairly specific about how much more or less variation there is.)

3.  Mapping the Oregon census data

Begin by downloading and saving in your default working directory the components of a shapefile of Oregon county-level census data.

[orcounty.dbf] [orcounty.sbn] [orcounty.sbx] [orcounty.shp] [orcounty.shx]

Here are the Oregon county outline files:

[orotl.dbf] [orotl.shp] [orotl.shx]

The data can be viewed by opening orcounty.dbf in Excel.  Variable names are largely self explanatory.

Simple thematic maps can be produced using the maptools package, which permits a shapefile to be read, and contains a method for plotting individual variables (attributes) contained in an associated .dbf file.  Other functions in the package allow the extraction and manipulation of components of the shape file (i.e. polygons, polygon centroids, data, etc.).  The following code will read a shape file, extract some data, and plot a single variable.

# 1 -- read a shapefile of Oregon county-level census data
library(maptools)
library(RColorBrewer)
library(classInt)
# outlines of Oregon counties (lines)
orotl.shp <- readShapeLines(file.choose(),
     proj4string=CRS("+proj=longlat"))

# Oregon county census data (polygons)
orcounty.shp <- readShapePoly(file.choose(),
     proj4string=CRS("+proj=longlat"))

The readShapeLines() function requires only the name of the file with the .shp extension, and gets the remaining components automatically.  The RColorBrewer() function is called to get nice colors.

# 2 -- get names of variables in .dbf file in shape file
names(orcounty.shp@data)

The names of the variables are displayed using the names() function.  Note the hierarchical structure of the shape file:  the name of the shape file is orcounty.shp, and the variables are contained in a data frame-like object, @data, and so an individual variable, say, area of a county, can be referenced as orcounty.shp@data$AREA.  Note that you could attach the data attributes of the orcounty.shp shapefile (attach(orcounty.shp@data)), and be able to refer to individual variables by their "short" names.

# 3 -- select a variable and get color numbers for plotting
plotvar <- orcounty.shp@data$POP1990
nclr <- 8
plotclr <- brewer.pal(nclr,"BuPu")
#plotclr <- plotclr[nclr:1] # reorder colors if appropriate
# find equal-frequency intervals
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)

This fragment of code assigns a particular color number, and in turn color, to each polygon (county).  The assignments of values to plotvar and nclr make it easy to change the R code to plot a different variable, perhaps with a different number of class intervals.

# 4 plot the shapefile
plot(orcounty.shp, xlim=c(-125, -114), ylim=c(42,47))
plot(orcounty.shp, col=colcode, add=T)

The first plot() function call sets up a plotting region that is a little wider than the default one, which allows room for the legend, but doesn't actually plot anything (type="n").  The second plot() function call plots the shape file, with the "foreground" color set to the color assigned to the county.

# 6 -- add a title and legend
title("Area: Equal-Frequency Class Intervals")
legend(-117, 44,
legend=names(attr(colcode, "table")),
    fill=attr(colcode, "palette")
, cex=0.6, bty="n")

You may change the variable to be mapped and the number of categories.  Change the variable by changing the assignment to mapvar, its title by changing the assignment to maptitle, and the number of categories by changing the assignment to nclr.  Moreover, transformations can be made during this assignment.  For example, the log of county area can be plotted by changing the assignment, i.e. mapvar <-log10(orcounty$att.data$AREA).   To map a different variable, repeat steps 4 through 6.

A second variety of simple map is the graduated-symbol map (e.g. a "bubble plot"), formed by plotting symbols over an unfilled county outline map.  

# bubble plot equal-frequency class intervals
plotvar <- orcounty.shp@data$POP1990
plottitle <- "1990 Population"
nclr <- 8
plotclr <- brewer.pal(nclr,"BuPu")
#plotclr <- plotclr[nclr:1] # reorder colors if appropriate
max.symbol.size=12
min.symbol.size=1
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)
symbol.size <- ((plotvar-min(plotvar))/
    (max(plotvar)-min(plotvar))*(max.symbol.size-min.symbol.size)
    +min.symbol.size)

plot(orcounty.shp, xlim=c(-125, -115), ylim=c(42,47))
orcounty.cntr <- coordinates(orcounty.shp)
points(orcounty.cntr, pch=16, col=colcode, cex=symbol.size)
points(orcounty.cntr, cex=symbol.size)
title(plottitle)
legend(locator(1), legend=names(attr(colcode, "table")),
    fill=attr(colcode, "palette"), cex=0.6, bty="n")

The first block of code above sets the colors as usual, and also determines the symbol size for each county (which is proportional to the value of the plotting variable).. The second block establishes a plotting region and plots the counties, then gets the centroids of each polygon.  The first call to points() adds graduated-size symbols, where the color (col) and relative size of the dots (cex) are set by the integer-valued colornum., the second call to points() adds a black outline to each symbol, and finally the legend() function adds a legend.

Explore the data set.  Pay attention to the eastern Oregon/western Oregon and urban/rural contrasts.  

Q6:  For a couple of variables, attempt to subjectively characterize the broad-scale patterns in the data.  Are there general trends visible, or do smaller-scale variations prevail?  In comparing two variables, can you distinguish between different overall "styles" of spatial variability?  (Suggestion:  "Interesting" variables include county area (AREA), population (POP1990), or perhaps the number of mobile homes (MOBILEHOME), which might work better if reexpressed as a density e.g. mapvar <- orcounty$att.data$MOBILEHOME/orcounty$att.data$AREA
   maptitle <- Mobilehomes/sq mi
or alternatively
mapvar <- orcounty$att.data$MOBILEHOME/orcounty$att.data$HOUSEHOLDS
maptitle <- Mobilehomes/Total Households

4.  Spatial autocorrelation

Next, we'll examine the spatial autocorrelation of individual variables using the Moran statistic, a single-value description of the extent of spatial correlation of an individual variable--something that might seem quite useful compared with the sometimes-ambiguous nature of the variogram.  This part of the exercise also requires spatial-neighbor matrices, sometimes called "contiguity" matrices, that are of interest in their own right, inasmuch as they can reveal how the array of data points are spatially related to one another.

  • Spatial neighbors

Spatial-neighbor matrices are n x n matrices (or tables), where n is the number of points, that show how each point is related to the others.  The ith, jth element of the matrix (i.e. the entry in the ith row and jth column of the table) may be a distance (i.e. the distance between the ith and jth point), or a 1 or 0 (indicating that the ith point is a neighbor or or is adjacent to the jth point), or some other indicator of adjacency or distance.  Spatial neighbor matrices are provided by the spdep package.

Here's a set-up for obtaining and displaying the spatial-neighbor matrices for the Oregon county data, that provides a basemap to display the networks that illustrate the spatial neighbors:

library(maptools)
library(RColorBrewer)
library(spdep)

# Oregon county census data (polygons)
orcounty.shp <- readShapePoly(file.choose(),
     proj4string=CRS("+proj=longlat"))

Get the locations of the centroids of the counties .

# get centroids
orcounty.cntr <- coordinates(orcounty.shp)

# plot the county outline shape file and the variable
plot(orotl.shp, xlim=c(-125, -115), ylim=c(42,47))

# add county centroid points
points(
orcounty.cntr, pch=16, cex=1.2)

The last block of code generates a base map with county centroid points on it.

The "k-nearest neighbor" spatial spatial neighbors shows for each point the closest other point (the first or nearest neighbor), the next most closest point (the second nearest neighbor), and so on until k neighbors have been identified.  The following code gets the k-nearest neighbors (4 in this case), and plots line segments on the basemap linking those neighbors. 

# k nearest neighbors
k <- 4
nn <- knearneigh(orcounty.cntr, k, longlat=TRUE)
orcounty.neighbors.knn <- knn2nb(nn)
plot(orcounty.neighbors.knn, orcounty.cntr, add=T, lwd=2, col="green")

(Note that if the jth point is the nearest neighbor to the ith point, that does not imply that the ith point is the nearest neighbor of the jth point, because some other point may be closer to the jth point than the ith point is.  In other words, the k-nearest neighbor spatial neighbor matrix is not necessarily symmetrical.)

The distance-based neighbors of a point are all points that fall within a particular distance of the point.  The following code produces a distance-based spatial neighbors matrix:

# distance neighbors
d <- 150 # counties are neighbors if centroids are within d km of oneanother
orcounty.neighbors.dist <- dnearneigh(orcounty.cntr, 0, d, longlat=TRUE)
plot(orcounty.neighbors.dist, orcounty.cntr, add=T, lwd=2, col="magenta")

(Note, you may wish to save the previous map and reproduce the base map before plotting the distance-based spatial neighbors.)

The actual matrices (as opposed to the links between points) can be visualized by displaying the matrix as an "image"  Here's the code to display the k-nearest neighbor matrix.

# display spatial neighbors matrix
w.mat <- nb2mat(orcounty.neighbors.knn, zero.policy=TRUE)
w.cols <- 1:36
w.rows <- 1:36
image(w.cols, w.rows, w.mat)

(The default color scale used by the image() function is the "heatmap" scale, in which low values are plotted as red, and progressively higher values as orange, yellow and white (physicists love it almost as much as the rainbow scale.)

There is also a summary() function for a spatial-neighbor object, that produces some useful statistics about the links:

summary(orcounty.neighbors.dist, orcounty.cntr)

Q7:  Examine the two spatial neighbor networks, and describe the (sometimes subtle) difference between them.  (It will be easier to see the difference than describe it.)  You can experiment with the number of neighbors and the distance used to define neighbors to see the impact of those parameters.

  • The Moran statistic

The Moran statistic is a single-number summary of spatial autocorrelation.  The specific algorithm used by spdep allows the "significance" of the statistic to be judged (i.e. whether it is large enough to indicate appreciable spatial autocorrelation).  Significant Moran statistics are those with very low p-values, indicating very small chances of observing the particular value of the statistic by chance.

Here's the set-up and function (moran.test()) for calculating the Moran statistic for county area:

# Moran test
testvarname <- "AREA"
testdata <-
orcounty.shp@data
plotlab <- as.character(testdata$NAME)
moran.test(spNamedVec(testvarname, testdata),
    nb2listw(orcounty.neighbors.knn, style="B"))

Again, the assignments to testvarname (the actual character-string name of the variable, not the variable itself), testdata (the dataset name) and plotlab (the variable to label plots with (see below)) make the arguments of moran.test() easier to specify.  The function spNamedVec() checks to make sure that the order of the observations in the dataframe match those of the dimensions of the spatial-neighbors matrix.  Here the automatically do, but in general it's extremely easy for the listing of observations in the dataframe to differ from the corresponding points in the spatial-weights matrix (e.g. one might be in alphabetical order, the other in north-south order.)

An interesting plot is available that attempts to portray the strength of the spatial autocorrelation using a scatter diagram showing the relationship between the "spatially lagged" values of a variable and the values at the data points.  In addition, some "influence statistics" can be calculated that can identify those points that have a large nfluence the overall measure of spatial autocorrelation.

moran.plot(spNamedVec(testvarname, testdata),
    nb2listw(orcounty.neighbors.knn), labels=plotlab, pch=19,
    xlab=testvarname, ylab=c("Spatially lagged  ",testvarname))

Q8:  Calculate the Moran statistic for a couple of variables using the k-nearest neighbor spatial-neighbors matrix.  How does the apparent magnitude or significance of the spatial correlation vary among the individual calculations?  Does this variation make sense?  (Hint:  be sure to map the data you're describing with the Moran statistic.)