GEOG 414/514:  Advanced Geographic Data Analysis
Multivariate displays (Part 2)

4. High-Dimensional Data Plots

There are a number of "high-dimensional" data plots implemented in R that use different graphical approaches for illustrating relationships among many variables at a time, and some recent developments will be illustrated in Lecture 9.  One such plot is the scatter plot matrix which we have seen previously.  The other plot types are most conveniently described by simply constructing them for a typical data set

  • Stars plot

library(maps)
library(maptools)
attach(orstationc)
# outlines of Oregon counties (lines)
orotl.shp <- readShapeLines(file.choose(),
     proj4string=CRS("+proj=longlat"))

# stars
plot(orotl.shp)
col.red <- rep("red",length(orstationc[,1]))
stars(orstationc[,5:10], locations=as.matrix(cbind(lon, lat)),
col.stars=col.red, len=0.5, key.loc=c(-118,41.2), labels=NULL, add=T)

  • Rectangles plot

# rectangles
plot(orotl.shp)
pjul.scale <- (pjul-min(pjul))/(max(pjul)-min(pjul)) # width
pjan.scale <- (pjan-min(pjan))/(max(pjan)-min(pjan)) # height
rects <- cbind(pjul.scale, pjan.scale)
symbols(lon, lat, rectangles=rects, add=T)

  • Thermometers plot

# thermometers
plot(orotl.shp)
t1 <- rep(0.05, length(orstationc[,1])) # width
t2 <- (pann-min(pann))/(max(pann)-min(pann)) # height
t3 <- pjan/pann # shaded proportion
thermo <- cbind(t1, t2, t3)
symbols(lon, lat, thermometer=thermo, add=T)

The stars plot (top) shows six variables at each station, January, July and annual temperature and precipitation.  The values are scaled to the overall range of each variable.  The contrasts across the state in temperature and precipitation are easy to see, with the western half of the state being warmer and wetter than the eastern.  The large range in temperature between January and July is also noticeable in the eastern half of the state.  The rectangles plot (middle) shows Janaury and July precipitation, with the values using the “minimax” transformation (the difference between each value and the minimum value, divided by the range).  Again the precipitation contrasts are easy to see.  The “thermometers” plot (bottom) shows annual precipitation by the height of the rectangle and the ratio of January to annual precipitation is the shaded portion. 

  • Parallel coordinates display

# parallel coordinates display
library(lattice)
parallel(~cbind(pjul,pjan,pann,tjul,tjan,tann))

The parallel coordinates display shows each observation by a line connecting the scaled value of each variable.  Clusters of observations that have similar values of each variable can be seen as bundles of lines.  For example, the bundle of lines with low precipitation values (at the lower left of the plot) can be seen to have the highest relative values of July temperature, and relatively low values of January temperature.  Eastern Oregon, in other words.

5.  Trellis graphics

Many data sets include a mixture of both "continuous" (ordinal-, interval- or ratio-scale variables) and "discrete" (nominal-scale variables).  Often, the issue might arise of how a particular relationship between variables might differ among groups.  Information of that nature can be gained using conditioning plots (or coplots).  Such plots are part of a general scheme of visual data analysis, known as Trellis Graphics that has been created by the developers of the S language.  Trellis Graphics are implemented in R using the package Lattice.

Coplots (conditioning scatter plots)

Conditioning scatter plots involves creating a multipanel display, where each panel contains a subset of the data.  This subset can be either a) those observations that fall in a particular group, or b) they may represent a the values that fall within a particular range of the values of a variable.  The idea is that the individual panels should illustrate the relationship between a pair of variables, over part of the range of the two marginal "conditioning" variables (i.e. the relationship "conditional on one marginal variable lying in one particular interval, and the other lying in a different interval.")

·         Coplot, conditioning by one factor variable

This coplot contains scatter diagrams for Yes as a function of the log(10) of Population, conditioned by country:

attach(scanvote)
coplot(Yes ~ log10(Pop) | Country, columns=3,
     panel=function(x,y,...) {
          panel.smooth(x,y,span=.8,iter=5,...)
          abline(lm(y ~ x), col="blue")
     }
)

Note the use of the "panel" function here.  Basically, what's going on is that the coplot() function is determining which subset of observations should appear in each panel, while the two function calls within the panel function (panel.smooth and abline) perform their tasks on that subset of observations.  In other words, coplot() selects the observations of Yes and log(Pop) for a particular panel (i.e. country), sends these to the panel function, which passes them on (relabeled as x and y), and plots the points, and then panel.smooth() and albline() draw a lowess curve and least-squares line for those observations on each panel.  The general idea is to compare the panels (countries) seeing where in the panel the points lie and what the relationship looks like.  The general relationship between population and percent of Yes votes is apparent, as well as country-to-country differences, like the generally greater proportion of Yes votes in Finland.

·         Coplot, conditioning by one continuous numeric variable

Most of the time, the conditioning variables are continuous numeric variables.  Here's a coplot for WidthWS as a function of DepthWS in the Summit Cr. data set, conditioned by CumLen (or distance downsteam):

attach(sumcr)
coplot(WidthWS ~ DepthWS | CumLen, pch=14+as.integer(Reach), cex=1.5,
     number=3, columns=3,
     panel=function(x,y,...) {
          panel.smooth(x,y,span=.8,iter=5,...)
          abline(lm(y ~ x), col="blue")
     }
)

We know the arrangement of the reaches, and so the resulting plot should be no surprise.  The plotting characters are determined by Reach, to reveal the extent of overlap in the conditioning "shingles."  The plot could be regenerated using Reach as the conditioning variable, which would result in no overlap between the individual panels.

It’s easy to see that the two grazed reaches (A upstream and C downstream) have generally wider channels, which would be expected.  Something that is not apparent in ordinary plots of the data is that the “normal” or expected inverse relationship between width and depth (as one gets bigger the other gets smaller) does not apply in the middle (exclosure) reach.

The main documentation for Trellis graphics includes:

·         Trellis Graphics User Manual, and

·         A Tour of Trellis Graphics

two .pdf documents published by the developers of the S language and Trellis Graphics, Lucent Technology.