|
Geography 414/514: Advanced Geographic Data Analysis Exercise 3: Bivariate Plots & Descriptive Statistics 1. Introduction The purpose of this exercise is to introduce some bivariate plots and basic descriptive statistics. Read through the exercise before attempting to complete it. Begin by starting R. Check to make sure that you're back in your working directory. You can verify that R is looking at the correct folder by clicking on File > Change dir... on the RGui menu. If you're in the folder you created in the first exercise, fine, otherwise you could browse to it here. 2. Scatter Diagrams Simple scatter diagrams Check to make sure the Summit Creek "data frame" (the data set after it has been read into R's workspace) is still in the "search path":.
if not, re-read the sumcr.csv file as in Section 8 of Exercise 1. Also, use the names() function to get a reminder of the names of the variables in the data frame, and attach the data frame so you can refer to individual variables by their actual names (e.g. WidthWS as opposed to sumcr$WidthWS):
The scatter diagram (sometimes also referred to as a "scatter plot", "scattergram" or "X-Y plot" is the most basic of multivariate (more that one variable) plots, and is the workhorse among bivariate plots. Produce a scatter diagram (with water-surface depth ( DepthWS) as the Y-axis variable, and cumulative length (CumLen) as the X-axis variable:
Produce a second scatter diagram using the above steps for water surface width ( WidthWS) versus bankfull width (WidthBF). Before creating this second plot, click on the R graphics window to give it the focus, and then use the menu command History > Recording to start saving copies of the plots so that you can use PgUp and PgDn to review previous plots.DepthWS and CumLen and between WidthWS and WidthBF look like? (Contrast them: Which is stronger, which is weaker? What is the sense of the relationships (positive or negative)?) Annotated scatter diagrams Download the Oregon climate station data from this link: [orstationc.csv] and save it in your working directory. Read the data into R's workspace using the following command:
(The " as.is=T" argument in the read.csv() function tells R not to convert text information (station and Name) into group-membership factors, which is the default behavior.)Attach the data frame, and create a crude map by plotting longitude on the X-axis and latitude on the Y-axis:
Label the points with the station-name abbreviations using the text() function
(Note that text() won't work unless there is a plot with the same X-axis and Y-axis variables already open.)Now plot annual precipitation (pann) as a function of elevation (elev) (i.e. elev on the X-axis, pann on the Y-axis):
Label individual points using the identify() function:
To stop identifying points, right-click in the R graphics window, and click on Stop.
2. Descriptive Statistics Simple descriptive statistics This part of the exercise involves producing some descriptive statistics for the Summit Cr. data. The easiest way to get a sense of what information is provided by a particular statistic (say, the standard deviation of water-surface width within a particular reach) is to compare descriptive plots for the variable with the values of a descriptive statistic. Produce some descriptive statistics for the variable WidthWS (This is almost too easy!) :
By themselves, the statistics don't mean much. However, you can contrast the amount of information provided by the numerical and graphical descriptions, and the efficiency with which it is presented. Get the summary statistics for bankfull width ( WidthBF) as well.
Descriptive statistics by observation-group membership The calculation of descriptive statistics for subsets of a data set provides one means for investigating the influence of another variable on the variations within a data set of a variable of interest. Remember that the Summit Cr. study area is divided into three sections: an upstream reach (reach A, represented by an A in column 2) in which cattle are permitted to graze, a middle reach (reach B, represented by a B) from which cattle have been excluded, and a downstream reach (reach C, represented by a C), in which cattle were again permitted to graze. Use the tapply() function to get descriptive statistics for WidthWS, stratified by reach.
The arguments of the function are 1) the name of the variable ( WidthWS), the name of the grouping variable to stratify the data by (Reach), and the function to apply to the stratified data (in this case, summary()).You should see three blocks of summary statistics, one for each reach. Before attempting to answer the following, produce a univariate scatter diagram (see Section 1 of Exercise 2) that shows the values of WidthWS plotted versus observation number. WidthWS from one reach to another. You might think about a one-page display that would contain a couple of plots and an abbreviated table. 3. Correlations Inferring the strength of a relationship (and sometimes even the sign) from a scatter plot is somewhat of a judgement call. (E.g., what kind of pattern on a scatter plot indicates a "strong" relationship?) Consequently, it would be convenient to have some kind of more objective measure of the strength of a relationship than simple visual interpretation. This is provided by the correlation coefficient--a single number description of the extent to which the relationship between two variables can be described by a straight line. Get the correlation coefficients between annual precipitation and elevation :
By itself, a single correlation coefficient is difficult to interpret, so you may wish to calculate correlations for other pairs of variables, and compare these to their scatter plots. Any easy way to do this to create a "correlation matrix" equivalent to the scatter plot matrix you created earlier:
Next, produce a scatter plot for CumLen vs WidthWS, using the alternative "equation" form of specifying the variables to be plotted.
Also get the correlation between these two variables.
Now construct a smooth curve to summarize the relationship apparent on the scatter diagram using the loess() function: (note--if for some reason the plot of CumLen vs WidthWS disappeared, regenerate it using the above example.)
The loess() function finds the smooth curve and saves the details (goodness-of-fit-statistics and fitted values) in the object loess.model, typing loess.model prints out the summary statistics, the predict() function puts the fitted values into the variable hat, and the lines() function draws the line on the current scatter plot.Construct a scatter diagram of WidthWS plotted as a function of hat, and get the correlation between those variables.WidthWS and the fitted values (hat) from the loess curve look like? What does this scatter diagram (and companion correlation coefficient) suggest now about the strength of the relationship between WidthWS and CumLen.? 4. What to hand in. Answers to the questions, and a couple of the plots. |