|
Geography 414/514: Advanced Geographic Data Analysis
Spring 2008
Exercise 5: Spatial Plots and Statistics
Finish by Thusday, May 8
1. Introduction
The purpose of this exercise is to examine some of the ways of characterizing
spatial variations in a data set. In particular, the exercise focuses on
displaying variograms, the identfication of spatial neighbors, and the
calculation of overall measures of spatial autocorrelation. The exercise
begins by examining some variograms for the Oregon climate-station data, and
concludes with the calculation of spatial autocorrelation statistics for the
Oregon county census data.
Read through the exercise before attempting to complete it. More
so than some of the other exercises, this one will result in the creation of a
number of new objects, and stopping in the middle without saving the workspace
might result in a lot of backtracking.
This exercise uses the
RColorBrewer,
maptools,
gstat,
and spdep
packages.
2.
(Semi)Variograms
The semi-variogram
(usually just variogram)
summarizes in a descriptive plot the scale and nature of the spatial variability
in a variable. In this part of the exercise, the spatial variations in
precipitation will be examined (we examined the variations of temperature in the
lectures).
Here is a link to
the data: [orstationc.csv]
and the Oregon county outline shape files:
[orotl.shp]
[orotl.dbf]
[orotl.shx]
Begin by mapping
annual precipitation (pann) as in Section 2 of the previous exercise, plus any
other way of plotting the data that might make sense.
Q1: Describe
the broadscale pattern of annual precipitation. Where is it highest and
lowest? Which direction does it vary the most in (remembering that
up-and-down is a direction too).
Now examine the basic variogram for pann. Here's
the set up.
# 1 -- set up
for calculating variograms using the gstat package
library(maptools) # loads sp() too
library(RColorBrewer)
library(gstat)
# 2 -- open
a new Lattice graphics window
library(lattice)
trellis.device(color=TRUE, theme = "col.whitebg")
# 3 -- get the
county outline shapefile, and attach orstationc
orotl.shp <- readShapeLines(file.choose(),
proj4string=CRS("+proj=longlat"))
attach(orstationc)
# 4 -- get x and
y coordinates (in km) of county centroids
orstationc$y <- 110.*(lat-42)
orstationc$x <- 110.*(lon-(-125))*cos(lat/(360/(2*pi)))
plot(orstationc$x, orstationc$y) # plot the projected station locations
Step 1 does the necessary
library()
function calls.. The plotting functions in
gstat
use the lattice graphics, so step 2 opens a new graphics window.
Step 3 reads the shapefile and attaches the climate station data.
Step 4 "projects" the station locations in latitude and longitude into
kilometers north and east of the southwestern corner of the region.
(The variogram calculations assume isotropy--that unit increments in the
locational coordinates represent the same distance--which latitudes and
longitudes don't (except right at the equator).)
# 5 -- assign a
varible to examine
plotvar <- pann
plottitle <- "Annual Precipitation"
# basic
variogram
vgm <- variogram(plotvar ~1, loc= ~x+y, data=orstationc)
plot(vgm, main=paste("Variogram: ",plottitle))
The variogram()
function has as its first argument a formula that indicates what to find
the variogram of. In the above function call, the construct
plotvar ~1 means just find
the variogram of plotvar.
Note also that the variogram()
function isn't restricted to examining only two-dimensional data, other
dimensional coordinates my be specified in assignment to the
loc keyword).
Q2: Examine the variogram Describe in particular the
"range" of the variogram--the distance (on the horizontal axis) at which
the semivariance values begin to level off. (It's a judgement
call, but it will be around the distance where the semivariance values
stop continuously rising and start bouncing around). To what
extent does this estimate of the distance within which precipitation
values are relatively homogeneous confirm or contradict your subjective
observations about the map pattern of annual precipitation you described
above?
The variogram cloud plot shows
the contribution of individual pairs of locations to the overall variogram,
and allows individual point-pairs to be identified.
# variogram
cloud plot
vgm.cloud <- variogram(plotvar ~1, loc= ~x+y, data=orstationc, cloud=T)
plot(vgm.cloud, main=paste("Variogram: ",plottitle), identify=TRUE)
Click near individual points to identify them (the row
numbers of the individual stations that contribute to the semivariance of
each point clicked on are revealed), and right-click and click Stop to
stop. Then get a map of the station numbers:
# get a map,
with labeled station row numbers
plot(orotl.shp)
text(lon, lat, as.character(seq(1:length(plotvar))), col="red", cex=.6)
The basic map will help figure out the number
corresponding to each station. There are a lot of stations, and they
appear in alphabetical order in the data set, which won't make finding
them on the map easy. You can list the values for one particular
station by typing, for example,
orstationc[37,]
for station 37, or
orstationc[65,]
for station 65 (note the use of the square brackets).
Q3: Do you see anything
interesting in the cloud plot? Are some stations greater
contributors to the larger values of gamma than other locations?
What might that indicate about the nature of those locations?
The gstat
variogram()
function allows large-scale trends in the data to be removed before
calculating the variogram, something that might be desirable if some
particular trend (i.e. western Oregon vs. eastern Oregon "swamped" the
overall pattern. This can be done by specifying some predictor
variables in the formula in the
variogram()
function call. Here, specifying
plotvar ~x+y
is equivalent to removing a linear trend in space from the data.
# detrended
variogram
detrend.vgm <- variogram(plotvar ~x+y, loc= ~x+y, data=orstationc)
plot(detrend.vgm, main=paste("Detrended variogram: ",plottitle))
Q4: Examine how detrending changes the shape of
the variogram for annual precipitation. Does it increase or
decrease the large- or small-scale variability apparent in the variogram.
A variation on the basic variogram is
the directional variogram, that illustrates how spatial variability
may vary as a function of direction in a data set.
# directional
variogram
dir.vgm <- variogram(plotvar ~1, loc=~x+y, data=orstationc,
alpha=c(0,45,90,135))
plot(dir.vgm, main=paste("Directional variogram: ",plottitle))
The panel labels on the lattice plots work as
follows: 0 describes N-S variations, 90 describes E-W variations,
45 describes NE-SW variations, etc.)
Q5: Examine the plots for a
couple of the climate variables. Is there any evidence that the
spatial variations in climate have any directional components to them
(i.e. is there more N-S than E-W variation of climate in Oregon)?
(The answer probably should be "Well duh, yes, of course there is." but
the thing to remember is that the variogram and other plots and
statistics allow one to be fairly specific about how much more or
less variation there is.)
3. Mapping the Oregon census data
Begin by downloading and saving in your default working directory the
components of a shapefile of Oregon county-level census data.
[orcounty.dbf]
[orcounty.sbn]
[orcounty.sbx]
[orcounty.shp]
[orcounty.shx]
Here are the Oregon county outline files:
[orotl.dbf]
[orotl.shp]
[orotl.shx]
The data can be viewed by opening orcounty.dbf in Excel. Variable names are largely self explanatory.
Simple thematic maps can be produced using the maptools package, which
permits a shapefile to be read, and contains a method for plotting individual
variables (attributes) contained in an associated .dbf file. Other
functions in the package allow the extraction and manipulation of components of
the shape file (i.e. polygons, polygon centroids, data, etc.). The
following code will read a shape file, extract some data, and plot
a single variable.
# 1 -- read a shapefile of Oregon
county-level census data
library(maptools)
library(RColorBrewer)
library(classInt)
# outlines of Oregon counties
(lines)
orotl.shp <- readShapeLines(file.choose(),
proj4string=CRS("+proj=longlat"))
# Oregon county census data (polygons)
orcounty.shp <- readShapePoly(file.choose(),
proj4string=CRS("+proj=longlat"))
The
readShapeLines()
function requires only the name of the file with the .shp extension, and
gets the remaining components automatically. The
RColorBrewer()
function is called to get nice colors.
# 2 -- get names of variables in
.dbf file in shape file
names(orcounty.shp@data)
The names of the variables are displayed using the
names()
function. Note the hierarchical structure of the shape file:
the name of the shape file is orcounty.shp, and the variables are contained in
a data frame-like object,
@data, and so an individual variable,
say, area of a county, can be referenced as
orcounty.shp@data$AREA.
Note that you could attach the data attributes of the
orcounty.shp
shapefile (attach(orcounty.shp@data)),
and be able to refer to individual variables by their "short" names.
# 3 -- select a variable and get
color numbers for plotting
plotvar <- orcounty.shp@data$POP1990
nclr <- 8
plotclr <- brewer.pal(nclr,"BuPu")
#plotclr <- plotclr[nclr:1] # reorder colors if appropriate
# find equal-frequency intervals
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)
This fragment of code assigns a particular color number, and in turn
color, to each polygon (county). The assignments of values to
plotvar
and nclr
make it easy to change the R code to plot a different variable, perhaps
with a different number of class intervals.
# 4 plot the shapefile
plot(orcounty.shp, xlim=c(-125,
-114), ylim=c(42,47))
plot(orcounty.shp, col=colcode, add=T)
The first plot()
function call sets up a plotting region that is a little wider than the
default one, which allows room for the legend, but doesn't actually plot
anything (type="n").
The second plot()
function call plots the shape file, with the "foreground" color set to the
color assigned to the county.
# 6 -- add a title and legend
title("Area: Equal-Frequency Class Intervals")
legend(-117, 44,
legend=names(attr(colcode, "table")),
fill=attr(colcode, "palette"), cex=0.6, bty="n")
You may change the variable to be mapped and the number of categories.
Change the variable by changing the assignment to
mapvar, its title by changing
the assignment to maptitle, and the number of
categories by changing the assignment to
nclr.
Moreover, transformations can be made during this assignment. For example,
the log of county area can be plotted by changing the assignment, i.e.
mapvar <-log10(orcounty$att.data$AREA).
To map a different variable, repeat steps 4 through 6.
A second variety of simple map is the graduated-symbol
map (e.g. a "bubble plot"), formed by plotting symbols over an
unfilled county outline map.
# bubble plot equal-frequency
class intervals
plotvar <- orcounty.shp@data$POP1990
plottitle <- "1990 Population"
nclr <- 8
plotclr <- brewer.pal(nclr,"BuPu")
#plotclr <- plotclr[nclr:1] # reorder colors if appropriate
max.symbol.size=12
min.symbol.size=1
class <- classIntervals(plotvar, nclr, style="quantile")
colcode <- findColours(class, plotclr)
cutpts <- round(class$brks, digits=1)
symbol.size <- ((plotvar-min(plotvar))/
(max(plotvar)-min(plotvar))*(max.symbol.size-min.symbol.size)
+min.symbol.size)
plot(orcounty.shp, xlim=c(-125, -115), ylim=c(42,47))
orcounty.cntr <- coordinates(orcounty.shp)
points(orcounty.cntr, pch=16, col=colcode, cex=symbol.size)
points(orcounty.cntr, cex=symbol.size)
title(plottitle)
legend(locator(1), legend=names(attr(colcode, "table")),
fill=attr(colcode, "palette"), cex=0.6, bty="n")
The first block of code above sets the colors as
usual, and also determines the symbol size for each county (which is
proportional to the value of the plotting variable).. The second block
establishes a plotting region and plots the counties, then gets the
centroids of each polygon. The first call to
points() adds graduated-size symbols, where the color (col)
and relative size of the dots (cex)
are set by the integer-valued colornum., the second call to
points()
adds a black outline to each symbol, and finally the
legend()
function adds a legend.
Explore the data set. Pay attention to the eastern
Oregon/western Oregon and urban/rural contrasts.
Q6: For a couple of variables, attempt to
subjectively characterize the broad-scale patterns in the data. Are there
general trends visible, or do smaller-scale variations prevail? In
comparing two variables, can you distinguish between different overall
"styles" of spatial variability? (Suggestion:
"Interesting" variables include county area (AREA), population (POP1990), or
perhaps the number of mobile homes (MOBILEHOME), which might work better if
reexpressed as a density e.g. mapvar <-
orcounty$att.data$MOBILEHOME/orcounty$att.data$AREA
maptitle <- Mobilehomes/sq mi
or alternatively
mapvar <- orcounty$att.data$MOBILEHOME/orcounty$att.data$HOUSEHOLDS
maptitle <- Mobilehomes/Total Households
4. Spatial autocorrelation
Next, we'll examine the spatial autocorrelation of individual variables using
the Moran statistic, a single-value description of the extent of spatial
correlation of an individual variable--something that might seem quite
useful compared with the sometimes-ambiguous nature of the variogram.
This part of the exercise also requires spatial-neighbor matrices,
sometimes called "contiguity" matrices, that are of interest in their own
right, inasmuch as they can reveal how the array of data points are
spatially related to one another.
Spatial-neighbor matrices are n x n matrices (or tables),
where n is the number of points, that show how each point is related to
the others. The ith, jth element of the matrix (i.e.
the entry in the ith row and jth column of the table) may be
a distance (i.e. the distance between the ith and jth
point), or a 1 or 0 (indicating that the ith point is a neighbor or
or is adjacent to the jth point), or some other indicator of
adjacency or distance. Spatial neighbor matrices are provided by the spdep
package.
Here's a set-up for obtaining and displaying the spatial-neighbor
matrices for the Oregon county data, that provides a basemap to display
the networks that illustrate the spatial neighbors:
library(maptools)
library(RColorBrewer)
library(spdep)
# Oregon county census data (polygons)
orcounty.shp <- readShapePoly(file.choose(),
proj4string=CRS("+proj=longlat"))
Get the locations of the centroids of
the counties .
# get centroids
orcounty.cntr <- coordinates(orcounty.shp)
# plot the county outline shape
file and the variable
plot(orotl.shp, xlim=c(-125, -115), ylim=c(42,47))
# add county centroid points
points(orcounty.cntr,
pch=16, cex=1.2)
The last block of code generates a base map with county centroid points
on it.
The "k-nearest neighbor" spatial spatial neighbors shows for each point
the closest other point (the first or nearest neighbor), the next most
closest point (the second nearest neighbor), and so on until k neighbors
have been identified. The following code gets the k-nearest
neighbors (4 in this case), and plots line segments on the basemap linking
those neighbors.
# k nearest neighbors
k <- 4
nn <- knearneigh(orcounty.cntr, k, longlat=TRUE)
orcounty.neighbors.knn <- knn2nb(nn)
plot(orcounty.neighbors.knn, orcounty.cntr, add=T, lwd=2, col="green")
(Note that if the jth point is the nearest neighbor to the ith point,
that does not imply that the ith point is the nearest neighbor of the jth
point, because some other point may be closer to the jth point than the
ith point is. In other words, the k-nearest neighbor spatial
neighbor matrix is not necessarily symmetrical.)
The distance-based neighbors of a point are all points that fall within
a particular distance of the point. The following code produces a
distance-based spatial neighbors matrix:
# distance neighbors
d <- 150 # counties are neighbors if centroids are within d km of
oneanother
orcounty.neighbors.dist <- dnearneigh(orcounty.cntr, 0, d, longlat=TRUE)
plot(orcounty.neighbors.dist, orcounty.cntr, add=T, lwd=2, col="magenta")
(Note, you may wish to save the previous map and reproduce the base map
before plotting the distance-based spatial neighbors.)
The actual matrices (as opposed to the links between points) can be
visualized by displaying the matrix as an "image" Here's the code to
display the k-nearest neighbor matrix.
# display spatial neighbors matrix
w.mat <- nb2mat(orcounty.neighbors.knn, zero.policy=TRUE)
w.cols <- 1:36
w.rows <- 1:36
image(w.cols, w.rows, w.mat)
(The default color scale used by the image() function is the "heatmap"
scale, in which low values are plotted as red, and progressively higher
values as orange, yellow and white (physicists love it almost as much as
the rainbow scale.)
There is also a summary() function for a spatial-neighbor object, that
produces some useful statistics about the links:
summary(orcounty.neighbors.dist,
orcounty.cntr)
Q7: Examine the two spatial neighbor networks, and describe the (sometimes subtle) difference between them.
(It will be easier to see the difference than describe it.) You can
experiment with the number of neighbors and the distance used to define
neighbors to see the impact of those parameters.
The Moran statistic is a single-number summary of spatial autocorrelation.
The specific algorithm used by spdep allows the "significance" of the
statistic to be judged (i.e. whether it is large enough to indicate
appreciable spatial autocorrelation). Significant Moran statistics are
those with very low p-values, indicating very small chances of observing the
particular value of the statistic by chance.
Here's the set-up and function (moran.test())
for calculating the Moran statistic for county area:
# Moran test
testvarname <- "AREA"
testdata <-
orcounty.shp@data
plotlab <- as.character(testdata$NAME)
moran.test(spNamedVec(testvarname, testdata),
nb2listw(orcounty.neighbors.knn, style="B"))
Again, the assignments to
testvarname
(the actual character-string name of the variable, not the variable itself),
testdata
(the dataset name) and plotlab (the variable to label plots with (see below))
make the arguments of moran.test()
easier to specify. The function
spNamedVec()
checks to make sure that the order of the observations in the dataframe match
those of the dimensions of the spatial-neighbors matrix. Here the
automatically do, but in general it's extremely easy for the listing of
observations in the dataframe to differ from the corresponding points in the
spatial-weights matrix (e.g. one might be in alphabetical order, the other in
north-south order.)
An interesting plot is available that attempts to
portray the strength of the spatial autocorrelation using a scatter diagram
showing the relationship between the "spatially lagged" values of a variable
and the values at the data points. In addition, some "influence
statistics" can be calculated that can identify those points that have a large
nfluence the overall measure of spatial autocorrelation.
moran.plot(spNamedVec(testvarname,
testdata),
nb2listw(orcounty.neighbors.knn), labels=plotlab, pch=19,
xlab=testvarname, ylab=c("Spatially lagged ",testvarname))
Q8: Calculate the Moran statistic for a couple of variables using
the k-nearest neighbor spatial-neighbors matrix. How does the apparent
magnitude or significance of the spatial correlation vary among the
individual calculations? Does this variation make sense?
(Hint: be sure to map the data you're describing with the Moran
statistic.)
|