Geography 414/514:  Advanced Geographic Data Analysis
Spring 2008

Exercise 8 – Multivariate Analysis
Finish by June 11

1.  Introduction

The data set used for most of this exercise consists of a subset of the variables and observations in a data set from Drew Lamb’s thesis, observed for 120 streams in northeastern Oregon.  The variables are defined as follows:

1) Stream – stream name   8) SmLWD – no. of pieces of small large woody debris ( LWD) per mile
2) DA –  drainage area 9) LrgLWD – no. of pieces of small large woody debris (LWD) per mile
3) Health – stream “health” (factor)   10) PoolDepth – depth of pools  
4) ReachLen – length of the reach   11) PoolArea – area of habitat units that are pools  
5) Pools – number of pools per mile of stream 12) WDriff – width- to- depth ratio in pools  
6) DeepPools – number of deep pools per mile 13) PRratio – pool- to- riffle ratio  
7) LargePools – number of large pools per mile 14) BFWDratio -- bankfull width-to-depth ratio

There are two version of the data set:  1) the original values [streams4.csv] and 2) transformed values [tstreams4.csv].  The exercise will focus on the latter.

NOTE:  The variable names are the same in the two data sets, and so they both cannot be attached at the same time. 

2.  Multivariate plots

To start, download the .csv files to your working directory, and read them in as "streams" and "tstreams":

streams <- read.csv("streams4.csv")
tstreams <- read.csv("tstreams4.csv")

or

streams <- read.csv(file.choose())
tstreams <- read.csv(file.choose())

Get matrix scatter diagrams of the two data sets:

plot(streams[3:13])
plot(tstreams[3:13])

A few features should emerge, including the fact that for some variables, many zeros are observed (i.e. the count or frequency variables), and that some values for BFWDratio have evidently been estimated using drainage area.  (How can you tell?)  Overall the transformed data set exhibits stronger linear relationships among variables, and so this one will be used below.

3.  Principal Components Analysis (PCA)

Begin by doing a simple principal components analysis on variables 3 through 13 (i.e. just the continuous variables, and not the factor Health).

# principal components analysis of (transformed) NE Oregon streams data
attach(tstreams)
tstreams.matrix <- data.matrix(tstreams[,3:13])
tstreams.pca <- princomp(tstreams.matrix, cor=T)

Columns 3 through 13 are copied into a matrix for convenience, and then the princomp() function does the analysis.  Now take a look at the results:

tstreams.pca
summary(tstreams.pca)
print(loadings(tstreams.pca),cutoff=0.0)
screeplot(tstreams.pca)
biplot(tstreams.pca, xlabs=rownames(tstreams),
     main="PCA Biplot", cex=0.8)

Typing the name of the resulting PCA object along with the application of summary() function produces information on the relative importance of the principal components.  Important components are generally regarded as those with variances (eigenvalues) greater than 1.0.  Printing the results of the application of the  "loadings()" function produces a table that contains the correlations between the original variables and the new components, while the "screeplot()" function gives a graphic picture of the importance of the first few components.

The biplot() function shows both the observations (labeled by row number here) and the variables (represented by vectors) on the same display (a biplot).  See Jacoby (1998, Ch. 7) for a discussion of this plot.

Q.1:  How many important components are there?  Is there a natural breakpoint between the important and less-important components?  Does the biplot clearly reveal the “structure” in the data (i.e. groups of vectors at right-angles to one another), or is it ambiguous (vectors aimed all over the place)?  Are the locations of the individual (numbered) observations on the biplot consistent with your interpretation of the components?  (You'll have to look at the data in some way to answer the last part.)

4.  Simple Factor Analysis

Next, do a simple (unrotated) factor analysis of the same variables:  

# factor analysis:  extract 4 factors, no rotation
tstreams.fa1 <- factanal(tstreams.matrix, factors=4, rotation="none",
     scores="regression")
tstreams.fa1 # list the results
print(loadings(tstreams.fa1),cutoff=0.7)
plot(loadings(tstreams.fa1)[,1:2])
text(loadings(tstreams.fa1)[,1:2], labels=colnames(tstreams.matrix))
plot(tstreams.fa1$scores[,1:2], type="n")
text(tstreams.fa1$scores[,1:2], labels=rownames(tstreams.matrix))
biplot(tstreams.fa1$scores[,1:2], loadings(tstreams.fa1)[,1:2],
     main="PCA/Factor Analysis Biplot -- 4 Components", cex=0.8)

Q2:  Compare the results of the factor analysis to the principal components analysis.  What looks similar, what looks different?   Hint:  Focus on the biplot, and the relative position of the vectors and points as opposed to their absolute positions on the diagrams.  What are the basic “dimensions” of the data?  From the perspective of somebody administering a field measurement and monitoring program, are there any redundancies among the variables that might be eliminated?

5.  Rotated Factors

Next, we’ll rotate the factors to try to make them more interpretable. 

tstreams.fa2 <- factanal(tstreams.matrix, factors=4, rotation="varimax",
     scores="regression")
tstreams.fa2
print(loadings(tstreams.fa2),cutoff=0.7)
plot(loadings(tstreams.fa2)[,1:2])
text(loadings(tstreams.fa2)[,1:2], labels=colnames(tstreams.matrix))
plot(tstreams.fa2$scores[,1:2], type="n")
text(tstreams.fa2$scores[,1:2], labels=rownames(tstreams.matrix))
biplot(tstreams.fa2$scores[,1:2], loadings(tstreams.fa2)[,1:2],
     main="PCA/Factor Analysis Biplot -- 4 Rotated Components", cex=0.8)

Q3:  How do the loadings and the biplot change?  Which result, the unrotated or the rotated factors seem more interpretable?  Hint:  a more complicated analysis might not be better.  What are the basic “dimensions” of the data?  From the perspective of somebody administering a field measurement and monitoring program, are there any redundancies among the variables that might be eliminated?

6.  Multivariate Analysis of Variance (MANOVA)

The status of the streams were described by the land managers in terms of their perceived “ecological health” this is coded here by the variable Health.  A multivariate analysis of variance is aimed at answering the question "do the healthy (Health=H) and "less-healthy" (Health=U) groups of reaches similar in their fluvial geomorphic characteristics?

Take a look at some plots first, to get a subjective idea of how the groups of observations differ (histogram() is a lattice function, so load the lattice library first).  Also attach tstreams (up to this point variables have been implicitly referred to as columns in data matrices).

library(lattice)
trellis.device(windows, theme = col.whitebg())

# attach tstreams
attach(tstreams)

# plot distributions by groups
histogram(~ PoolDepth | Health, nint=20, aspect=1/2)

Take a look at a univariate analyses of variance (ANOVA) first to look at the significance of the differences in means between groups for one of the variables (e.g. PoolDepth):

# one-way analysis of variance
# set variable to analyze

# test for homogeneity of group variances
tapply(PoolDepth, Health, var)
bartlett.test(PoolDepth~ Health)

# analysis of variance
streams.aov1 <- aov(PoolDepth~ Health)
summary(streams.aov1)

The test for homogeneity of group variances is done first, because if the variances are significantly different (i.e. the p-value for the Bartlett test is close to zero), then it doesn't make sense to ask whether the means are different.  In the analysis of variance, large F statistics (and consequently small p-values) signal significant differences between (or among) groups.

You could examine each variable in the data set this way, and it's likely that some variables will appear to have means that are significantly different between groups, others that aren't, and others that may be borderline, and so it may be difficult to get and overall sense of whether the two groups of streams are different.  Also, it's likely that if enough pair-wise comparisons are done, some "significant" results (1 in 20) will turn up even if the groups of observations have identical means.

Multivariate analysis of variance (MANOVA) provides a single-number test statistic that can be used to answer the question "overall, are the group means significantly different?"

# MANOVA
Y <- cbind(DA, ReachLen, Pools, DeepPools, LargePools, SmLWD, LrgLWD,
    PoolDepth, PoolArea, PRratio, BFWDratio)
streams.mva1 <- manova(Y ~ Health)
summary.aov(streams.mva1)
summary(streams.mva1, test="Wilks")

Note the creation of a new temporary variable (Y) that makes it efficient to apply the manova() function.  Wilks lambda statistic can be interpreted as a multivariate generalization of the univariate F statistic.   The summary.aov() function creates most of the output -- individual ANOVAs of for the variables in the data set.  The summary() function applied to a  manova() object produces the Wilk's lambda statistic.

Q4:  Do the groups of observations differ significantly?  (Remember, you can look at
http://geography.uoregon.edu/GeogR/topics/interpstats.htm for a guide to interpreting test statistics.)

7.  Discriminant Analysis

A discriminant analysis focuses on the issue of how two (or more) groups of observations differ (and also includes the sort of information provided by an analysis of variance on whether the groups are different).  Whereas MANOVA answers the question "are the groups different?" discriminant analysis answers the question "how are they different.)

Do a simple discriminant analysis on these data, using the lda() function from the Venebles and Ripley MASS library:  

# linear discriminant analysis
library(MASS)

health.lda1 <- lda(Health ~ DA + ReachLen + Pools + DeepPools + LargePools
    + SmLWD + LrgLWD + PoolDepth + PoolArea + PRratio + BFWDratio)

Note the formula, Health depends on the other variables.

health.lda1
plot(health.lda1)

List and plot the results.  Because there are only two groups of observations here, there will be only one discriminant function. 

health.dscore <- predict(health.lda1, dimen=1)$x
boxplot(health.dscore ~ Health)
cor(tstreams[3:13],health.dscore)

The predict(...)$x function gets the "discriminant scores" (the x's) which are assigned to the variable health.dscore, the boxplot() function plots them by group (and should show clear variations of the discriminant scores among groups if the groups are distinctly different), while the cor() function gets the correlations between the new "discrimimant function" and the original variables (called "canonical coefficients"), which may be interpreted like PCA loadings.  These values show which variables are most closely correlated with the discriminant scores, and consequently which variables best illustrate the differences between or among the groups.

Examine the ability of the new discriminant function to correctly classify the healthiness of the stream reaches.

health.class <- predict(health.lda1, dimen=1)$class
class.table <- table(Health, health.class)
mosaicplot(class.table, color=T)

The predict(...)$class function gets the group that the new discriminant function would assign each original observation to based on the values of the individual variables, the table() function provides a cross tabulation of the observed healthiness of the streams (Health), and that predicted by the discriminant function (health.class).  The mosaic() function plots the table.

Q5:  How do the two groups of observations differ?  Would knowing the values of a few variables allow you to correctly classify the healthiness of streams for which stream health has yet to be measured in the field?

8.  Cluster analysis 

Attempt to define the climate regions of Oregon (as expressed in the climate-station data in [orstationc.csv] via an "agglomerative-hierarchical" cluster analysis.  In such an analysis, individual objects (observations) are combined according to their (dis)similarity, with the two most similar objects being combined first (to create a new composite object), then the next most similar objects, and so on until a single object is defined.  The sequence of combinations is illustrated using a "dendrogram" (cluster tree), and this can be examined to determine the number of clusters.  The function cutree determines the cluster membership of each observation, which can be listed or plotted.

Begin with a map, labeled by station name (see Exercise 7)

# hierarchical clustering of Oregon climate station data
library(maptools)

# outlines of Oregon counties (lines)
orotl.shp <- readShapeLines(file.choose(),
     proj4string=CRS("+proj=longlat"))

# Oregon climate station data (points)
attach(orstationc)

plot(lon, lat, xlim=c(-125, -116), ylim=c(41,47),
   xlab="Latitude", ylab="Longitude", type="n")
plot(orotl.shp, add=T)
text(lon, lat, labels=station, cex=0.6)

Calculate a dissimilarity (distance) matrix:

# get dissimilarities (distances) and cluster
X <- as.matrix(orstationc[,5:10])
X.dist <- dist(scale(X))
image(seq(1:92),seq(1:92),as.matrix(X.dist))

For convenience, a matrix, X, is created to omit the variables we don't want to include in the clustering (i.e. the non-climatic information).  The image() function plots the dissimilarity matrix that the cluster analysis works with.  Now do the cluster analysis:

X.clusters <- hclust(X.dist, method = "complete")
plot(X.clusters, labels=station, cex=0.5)

The plot() function (or pltclust() function) plots the dendrogram. 

The dendrogram can be inspected to get an idea of how many distinct clusters there are.  In this example, let's say there were three distinct clusters evident in the dendrogram.

# cut dendrogram and plot points in each cluster (chunk 1)
nclust <- 3
clusternum <- cutree(X.clusters, k=nclust)

plot(lon, lat, xlim=c(-125, -116), ylim=c(41,47),
     xlab="Latitude", ylab="Longitude",
     main=paste(as.character(nclust), "Clusters"), type="n")
plot(orotl.shp, add=T)
text(lon, lat, labels=clusternum)
print(cbind(station, clusternum, as.character(Name)))

The cutree() function determines the cluster membership of each observation, and a map of the cluster membership of each station is generated.  A list of the station id's, the cluster number and name of the stations is provided by the print() function

Just looking at the map suggests that maybe three isn't the right number of clusters--Burns and Eugene are in the same cluster.  The clusters of stations (and how their climates differ) can be examined a number of different ways:

# examine clusters -- simple plots (chunk 2)
tapply(tjan, clusternum, mean)
histogram(~ tjan | clusternum, nint=20, aspect=1/3)

Note how many of the histograms seem multimodal--this is a clear sign of inhomogeneity of the clusters.

# discriminant analysis (chunk 3)
library(MASS)
cluster.lda1 <- lda(clusternum ~ tjan + tjul + tann + pjan + pjul + pann)
cluster.lda1
plot(cluster.lda1)

cluster.dscore <- predict(cluster.lda1, dimen=2)$x
boxplot(cluster.dscore[,1] ~ clusternum, xlab="LD1")
boxplot(cluster.dscore[,2] ~ clusternum, xlab="LD2")
cor(orstationc[5:10],cluster.dscore)

The discriminant analysis here is used to answer the question "how do the clusters of stations differ climatically?"  In this case, it looks like pann and tann are the variables most closely correlated with each discriminant function, and because each of these variables are more-or-less averages of the seasonal extreme variables, that might explain why the clusters seem inhomogeneous.

Q6:  In this first iteration, 3 clusters were identified.  Try varying this number between, say, 2 and 8 clusters.  (The easiest way to do this is to copy the three chunks of code above into a text editor, changing the assignment to nclust before executing each chunk, one at a time.)  Describe the trade-off between the number of clusters and the homogeneity of the clusters.