Geography 414/514:  Advanced Geographic Data Analysis
Spring 2008

Exercise 3:  Bivariate Plots & Descriptive Statistics
Finish by Tuesday, April 22

1.  Introduction

The purpose of this exercise is to introduce some bivariate plots and basic descriptive statistics. 

Read through the exercise before attempting to complete it.

Begin by starting R.  Check to make sure that you're back in your working directory.  You can verify that R is looking at the correct folder by clicking on File > Change dir... on the RGui menu.  If you're in the folder you created in the first exercise, fine, otherwise you could browse to it here.

2. Scatter Diagrams

Simple scatter diagrams

Check to make sure the Summit Creek "data frame" (the data set after it has been read into R's workspace) is still in the "search path":. 

ls()

if not, re-read the sumcr.csv file as in Section 8 of Exercise 1.  Also, use the names() function to get a reminder of the names of the variables in the data frame, and attach the data frame so you can refer to individual variables by their actual names (e.g. WidthWS as opposed to sumcr$WidthWS):

names(sumcr)
attach(sumcr)

The scatter diagram  (sometimes also referred to as a "scatter plot", "scattergram" or "X-Y plot" is the most basic of multivariate (more that one variable) plots, and is the workhorse among bivariate plots.  Produce a scatter diagram (with water-surface depth (DepthWS) as the Y-axis variable, and cumulative length (CumLen) as the X-axis variable:

plot(CumLen, DepthWS)   # X-axis variable first, Y-axis variable second

Produce a second scatter diagram using the above steps for water surface width (WidthWS) versus bankfull width (WidthBF).  Before creating this second plot, click on the R graphics window to give it the focus, and then use the menu command History > Recording to start saving copies of the plots so that you can use PgUp and PgDn to review previous plots.

Q1:  What do the overall relationships between DepthWS and CumLen and between WidthWS and WidthBF look like?  (Contrast them:  Which is stronger, which is weaker?  What is the sense of the relationships (positive or negative)?)

Annotated scatter diagrams

Download the Oregon climate station data from this link:  [orstationc.csv] and save it in your working directory.  Read the data into R's workspace using the following command:

orstationc <- read.csv("orstationc.csv", as.is=T)

(The "as.is=T" argument in the read.csv() function tells R not to convert text information (station and Name) into group-membership factors, which is the default behavior.)

Attach the data frame, and create a crude map by plotting longitude on the X-axis and latitude on the Y-axis:

attach(orstationc)
plot(lon, lat)

Label the points with the station-name abbreviations using the text() function

text(lon, lat, labels=station)

(Note that text() won't work unless there is a plot with the same X-axis and Y-axis variables already open.)

Now plot annual precipitation (pann) as a function of elevation (elev) (i.e. elev on the X-axis, pann on the Y-axis):

plot(elev, pann)

Q2:  Describe the relationship.  Are the data "well behaved" in the sense that most points fall in a distinct cloud of points, or are there outliers or unusual points?

Label individual points using the identify() function:

identify(elev, pann, labels=station)

To stop identifying points, right-click in the R graphics window, and click on Stop.

Q3:  Use the crude map as well as simple inspection of the data to describe local variations in the simple relationship between elevation and annual precipitation.  It may be useful to create other scatter plots, say, of elevation and longitude, or latitude or longitude and annual precipitation to get further insight.  (A quick way of seeing lots of plots is to use the plot command with a list of numeric variables, e.g., plot(orstationc[,2:10])(the "construct" [,2:10] means all rows, and columns 2 through 10 of the dataframe).)  Unfortunately identify() only works on simple scatter plots.

2. Descriptive Statistics

Simple descriptive statistics

This part of the exercise involves producing some descriptive statistics for the Summit Cr. data.  The easiest way to get a sense of what information is provided by a particular statistic (say, the standard deviation of water-surface width within a particular reach) is to compare descriptive plots for the variable with the values of a descriptive statistic.

Produce some descriptive statistics for the variable WidthWS (This is almost too easy!) :

attach(sumcr)
summary(WidthWS)

By themselves, the statistics don't mean much.  However, you can contrast the amount of information provided by the numerical and graphical descriptions, and the efficiency with which it is presented.  Get the summary statistics for bankfull width (WidthBF) as well.

Q4:  Compare the two variables using the summary statistics and whatever descriptive plots that may seem useful.  Which variable has typically larger values, which is more variable?

Descriptive statistics by observation-group membership

The calculation of descriptive statistics for subsets of a data set provides one means for investigating the influence of another variable on the variations within a data set of a variable of interest.

Remember that the Summit Cr. study area is divided into three sections: an upstream reach (reach A, represented by an A in column 2) in which cattle are permitted to graze, a middle reach (reach B, represented by a B) from which cattle have been excluded, and a downstream reach (reach C, represented by a C), in which cattle were again permitted to graze.

Use the tapply() function to get descriptive statistics for WidthWS, stratified by reach.

tapply(WidthWS, Reach, summary)

The arguments of the function are 1) the name of the variable (WidthWS), the name of the grouping variable to stratify the data by (Reach), and the function to apply to the stratified data (in this case, summary()).

You should see three blocks of summary statistics, one for each reach.  Before attempting to answer the following, produce a univariate scatter diagram (see Section 1 of Exercise 2) that shows the values of WidthWS plotted versus observation number.

Q5: Describe the variations in the location (e.g. mean) and dispersion (i.e. variance or standard deviation) of water-surface width across the reaches (i.e. from the upstream grazed reach, the middle ungrazed reach, and the downstream grazed reach). Note: This is a purposefully vague question. Think about what kind of information you will need to present to adequately describe each reach, and to compare the values of WidthWS from one reach to another. You might think about a one-page display that would contain a couple of plots and an abbreviated table.

3.  Correlations

Inferring the strength of a relationship (and sometimes even the sign) from a scatter plot is somewhat of a judgement call.  (E.g., what kind of pattern on a scatter plot indicates a "strong" relationship?)  Consequently, it would be convenient to have some kind of more objective measure of the strength of a relationship than simple visual interpretation.  This is provided by the correlation coefficient--a single number description of the extent to which the relationship between two variables can be described by a straight line.

Get the correlation coefficients between annual precipitation and elevation :

attach(orstationc)
cor(elev, pann)

By itself, a single correlation coefficient is difficult to interpret, so you may wish to calculate correlations for other pairs of variables, and compare these to their scatter plots.  Any easy way to do this to create a "correlation matrix" equivalent to the scatter plot matrix you created earlier:

plot(orstationc[,2:10])
cor(orstationc[,2:10])

Q6:  Which pairs of variables are most highly correlated, and which less so?  Are the correlations consistent with your subjective interpretation of the scatter plots or not?

Next, produce a scatter plot for CumLen vs WidthWS, using the alternative "equation" form of specifying the variables to be plotted.

plot(WidthWS ~ CumLen)

Also get the correlation between these two variables. 

Q7:  Describe the relationship evident in the scatter plot.  Is it strong or weak?  Now look at the correlation coefficient.  Is its value consistent with your inference of the strength of the relationship?  If so, does that make sense, if not, why not?

Now construct a smooth curve to summarize the relationship apparent on the scatter diagram using the loess() function:  (note--if for some reason the plot of CumLen vs WidthWS disappeared, regenerate it using the above example.)

loess.model <- loess(WidthWS ~ CumLen)
loess.model
hat <- predict(loess.model)
lines(CumLen[order(CumLen)], hat[order(CumLen)], col="red")

The loess() function finds the smooth curve and saves the details (goodness-of-fit-statistics and fitted values) in the object loess.model, typing loess.model prints out the summary statistics, the predict() function puts the fitted values into the variable hat, and the lines() function draws the line on the current scatter plot.

Construct a scatter diagram of WidthWS plotted as a function of hat, and get the correlation between those variables.

Q8:  What does the relationship between WidthWS and the fitted values (hat) from the loess curve look like?  What does this scatter diagram (and companion correlation coefficient) suggest now about the strength of the relationship between WidthWS and CumLen.?

4.  What to hand in.

Answers to the questions, and a couple of the plots.