|
Geography 414/514: Advanced
Geographic Data Analysis Exercise 8 – Multivariate
Analysis 1. Introduction The data set used for most of this exercise consists of a subset of the variables and observations in a data set from Drew Lamb’s thesis, observed for 120 streams in northeastern Oregon. The variables are defined as follows:
There are two version of the data set: 1) the original values [streams4.csv] and 2) transformed values [tstreams4.csv]. The exercise will focus on the latter. NOTE: The variable names are the same in the two data sets, and so they both cannot be attached at the same time. 2. Multivariate plots To start, download the .csv files to your working directory, and read them in as "streams" and "tstreams":
Get matrix scatter diagrams of the two data sets:
A few features should emerge, including the fact that for some variables, many zeros are observed (i.e. the count or frequency variables), and that some values for BFWDratio have evidently been estimated using drainage area. (How can you tell?) Overall the transformed data set exhibits stronger linear relationships among variables, and so this one will be used below. 3. Principal Components Analysis (PCA) Begin by doing a simple principal components analysis on variables 3 through 13 (i.e. just the continuous variables, and not the factor Health).
Typing the name of the resulting PCA object along with the application of summary() function produces information on the relative importance of the principal components. Important components are generally regarded as those with variances (eigenvalues) greater than 1.0. Printing the results of the application of the "loadings()" function produces a table that contains the correlations between the original variables and the new components, while the "screeplot()" function gives a graphic picture of the importance of the first few components. The biplot() function shows both the observations (labeled by row number here) and the variables (represented by vectors) on the same display (a biplot). See Jacoby (1998, Ch. 7) for a discussion of this plot.
4. Simple Factor Analysis Next, do a simple (unrotated) factor analysis of the same variables:
5. Rotated Factors Next, we’ll rotate the factors to try to make them more interpretable. tstreams.fa2 <- factanal(tstreams.matrix, factors=4, rotation="varimax", 6. Multivariate Analysis of Variance (MANOVA) The status of the streams were described by the land managers in terms of their perceived “ecological health” this is coded here by the variable Health. A multivariate analysis of variance is aimed at answering the question "do the healthy (Health=H) and "less-healthy" (Health=U) groups of reaches similar in their fluvial geomorphic characteristics? Take a look at some plots first, to get a subjective idea of how the groups of observations differ (histogram() is a lattice function, so load the lattice library first). Also attach tstreams (up to this point variables have been implicitly referred to as columns in data matrices).
Take a look at a univariate analyses of variance (ANOVA) first to look at the significance of the differences in means between groups for one of the variables (e.g. PoolDepth):
The test for homogeneity of group variances is done first, because if the variances are significantly different (i.e. the p-value for the Bartlett test is close to zero), then it doesn't make sense to ask whether the means are different. In the analysis of variance, large F statistics (and consequently small p-values) signal significant differences between (or among) groups. You could examine each variable in the data set this way, and it's likely that some variables will appear to have means that are significantly different between groups, others that aren't, and others that may be borderline, and so it may be difficult to get and overall sense of whether the two groups of streams are different. Also, it's likely that if enough pair-wise comparisons are done, some "significant" results (1 in 20) will turn up even if the groups of observations have identical means. Multivariate analysis of variance (MANOVA) provides a single-number test statistic that can be used to answer the question "overall, are the group means significantly different?"
Note the creation of a new temporary variable (Y) that makes it efficient to apply the manova() function. Wilks lambda statistic can be interpreted as a multivariate generalization of the univariate F statistic. The summary.aov() function creates most of the output -- individual ANOVAs of for the variables in the data set. The summary() function applied to a manova() object produces the Wilk's lambda statistic.
7. Discriminant Analysis A discriminant analysis focuses on the issue of how two (or more) groups of observations differ (and also includes the sort of information provided by an analysis of variance on whether the groups are different). Whereas MANOVA answers the question "are the groups different?" discriminant analysis answers the question "how are they different.) Do a simple discriminant analysis on these data, using the lda() function from the Venebles and Ripley MASS library:
The predict(...)$x function gets the "discriminant scores" (the x's) which are assigned to the variable health.dscore, the boxplot() function plots them by group (and should show clear variations of the discriminant scores among groups if the groups are distinctly different), while the cor() function gets the correlations between the new "discrimimant function" and the original variables (called "canonical coefficients"), which may be interpreted like PCA loadings. These values show which variables are most closely correlated with the discriminant scores, and consequently which variables best illustrate the differences between or among the groups.
The predict(...)$class function gets the group that the new discriminant function would assign each original observation to based on the values of the individual variables, the table() function provides a cross tabulation of the observed healthiness of the streams (Health), and that predicted by the discriminant function (health.class). The mosaic() function plots the table.
8. Cluster analysis Attempt to define the climate regions of Oregon (as expressed in the climate-station data in [orstationc.csv] via an "agglomerative-hierarchical" cluster analysis. In such an analysis, individual objects (observations) are combined according to their (dis)similarity, with the two most similar objects being combined first (to create a new composite object), then the next most similar objects, and so on until a single object is defined. The sequence of combinations is illustrated using a "dendrogram" (cluster tree), and this can be examined to determine the number of clusters. The function cutree determines the cluster membership of each observation, which can be listed or plotted. Begin with a map, labeled by station name (see Exercise 7)
Calculate a dissimilarity (distance) matrix:
For convenience, a matrix, X, is created to omit the variables we don't want to include in the clustering (i.e. the non-climatic information). The image() function plots the dissimilarity matrix that the cluster analysis works with. Now do the cluster analysis:
The dendrogram can be inspected to get an idea of how many distinct clusters there are. In this example, let's say there were three distinct clusters evident in the dendrogram.
The cutree() function determines the cluster membership of each observation, and a map of the cluster membership of each station is generated. A list of the station id's, the cluster number and name of the stations is provided by the print() function Just looking at the map suggests that maybe three isn't the right number of clusters--Burns and Eugene are in the same cluster. The clusters of stations (and how their climates differ) can be examined a number of different ways:
The discriminant analysis here is used to answer the question "how do the clusters of stations differ climatically?" In this case, it looks like pann and tann are the variables most closely correlated with each discriminant function, and because each of these variables are more-or-less averages of the seasonal extreme variables, that might explain why the clusters seem inhomogeneous.
|