|
Geography 414/514: Advanced Geographic Data Analysis Spring 2008 Exercise 6: Reference Distributions and Significance Testing
1. Introduction The objective of this exercise is to use R to make some statistical inferences about the significance of particular data values, statistics derived from data, or of the observed differences between the means of groups of observations. 2. Confidence Interval for the Mean Begin by downloading the Specmap data set [specmap.csv] (and name it specmap in R). These data consist of three columns: 1) the age (in ka (kiloanno =1000 yrs)) of 783 observations of d18O values (oxygen-isotope values that indicate the size of the continental ice sheets, with positive values indicating large, and negative values small, ice sheets), and insolation (incoming solar radiation, expressed as differences from present values in July at 65oN in W/m2) for about the past 800,000 years. The first row in the data set is the present (0 ka) and row 783 represents the values for 782,000 years before present (782 ka). The last glacial maximum occurred around 20 ka.It’s always a good idea to take a look at data before starting any analyses. Because both the oxygen-isotope and insolation data are time series, a 2-D Line Plot (or line plot of X-Y pairs) of both insolation and d18O values will quickly display the data (Hint: Because the Y-axis variables have differing scales, putting both on the same plot would lead to a plot that is difficult to interpret, so it would be better to produce two separate plots.)
Next, take a look at the distribution of d18O, using a histogram or density plot, as in Exercise 2
How unusual is the present-day value of d18O? Are current conditions representative of those that prevailed, on average, throughout the part of the Quaternary represented by these data? To get an idea of how representative or unusual the present is, compare its value to a 90% confidence interval for the mean of the whole series. If the present value falls outside of that interval, then that would suggest that the present value would not be a good estimate of the long-term average d18O, or conversely, that the long-term average d18O is quite different from the present value.The following code will get a 0.95 (or 95%) confidence interval for the mean d18O,
Getting the standard error and confidence limits for a number of variables would quickly become tedious. R allows (and encourages) the creation of functions that can be "reused" for different variables. The following is such a function that describes (descr) a variable:
Copy and paste the function into R (making sure you include the last curly bracket). Nothing will happen, but the function can now be used as follows:
etc. 3. Probability Plots The question concerning how well an observed distribution matches the normal distribution (Q19) can be answered more directly using a graphical display called a "quantile plot" or "QQ" plot. The normal probability plot (what R calls a QQ Normal ( qqnorm()) plot) is a scatter diagram containing the values of a particular variable (plotted along the x-axis) and the values that would be returned by the cumulative density function (cdf) of the normal distribution with the same mean and standard deviation as the variable under consideration (plotted along the y-axis). The idea is that if the data really are normally distributed, then the normal probability plot should appear as a straight line, and as the distribution gets less-and-less normal, the normal probability plot will become more nonlinear.Get a normal probability plot or QQ Normal plot for the d18O data:
4. Two-Sample Difference-of-Means Tests (t-tests) The question of whether two groups of observations are similar often arises. Similarity between groups of observations could be indicated by differences in the descriptive statistics of each group, in group-wise descriptive plots or in the distributions of variables. The most common index of "differentness" of groups of observations is provided by a comparison of the means of particular variables among the groups of observations. The test statistic that is appropriate for comparing two means is the t-statistic, which can be thought of as a "scaled" difference between two means, where the observed difference between two means is scaled by an estimate of the uncertainty in or variability of the the means. Download and read the data set [orsim.csv] (and call it orsim in R). This data set includes climate values (January and July temperature and precipitation) simulated by a regional climate model for present, and for potential future conditions when the concentration of carbon dioxide in the atmosphere has doubled, for a set of model grid points that cover Oregon. The variables named TJan1x, TJul1x, PJan1x and PJul1x are the simulated present values of temperature (T) and precipitation (P), while the variables with "2x" in their names are those for the "doubled CO2 scenarios" that are intended to describe the climate of the next century. Use one of the mapping procedures described in previous exercises to get an impression of the basic pattern of climate over Oregon. Note that the grid pattern in this particular data set is not parallel to the graticule (lat/lon lines). Check to make shure the Oregon county outlines shape files (orotl.shp) are available; if not, see Part 2 of Exercise 4. Here's some code
Note the plotvar assignment computes the difference between the 1x and 2x CO2 simulation. Also note that for temperature, it makes sense to reorder the colors, so that blue is low and orange is high, while for precipitation, the unreordered sequence orange for low and blue for high works better. The line plotclr <- plotclr[nclr:1] does the reordering. Comment it out (with a #) if you don't want to reorder, remove the # if you do.
How would a doubling of carbon dioxide in the atmosphere affect temperature over Oregon? From what we already know about the greenhouse effect, atmospheric change, and basic climatology, we can suspect that the temperatures will increase with an increase in carbon dioxide. The question is, are the simulated temperatures for the "enhanced greenhouse"-situation significantly higher than those for present? A two-sample t-test can be performed as follows:
In performing a t-test, the "default" null hypothesis, Ho, is that the means of the two groups of observations are equal, and the alternative hypothesis (Ha) is that the are not. A more explicit alternative hypothesis can be stated when, as is the case for this climate data, we expect the mean of one group (the present in this case) to be smaller than the mean of the other group (the doubled carbon dioxide scenario). In such a situation, where the potential sign of the difference between group means is known, a more "decisive" test can be made, by adopting a different, "one-sided" alternative hypothesis: For January precipitation (PJan1x and PJan2x) test the hypotheses that PJan1x and PJan2x are equal and that PJan1x is less than PJan2x.
boxplot(PJan2x, PJan1x) mean(PJan2x)-mean(PJan1x) tt.PJan.1 <- t.test(PJan2x, PJan1x) tt.PJan.1 # one-tailed test Q8: Is future winter precipitation (PJan2x) the same as present (PJan1x), or is it significantly greater than present? 5. Analysis of Variance Many times, more than two groups of observations exist. Although it is possible to compare pairs of groups of observations, what often happens is that some comparisons appear significant, some don’t, and it may be difficult to reach an overall conclusion about whether the groups jointly are significantly different. The procedure known as "analysis of variance" (ANOVA) allows a single overall measure of the difference in the means among more than two groups to be computed. The statistic generated in an analysis of variance is the F-statistic, which is a measure of the variability among groups of data relative to the variability within groups—as the group means become more distinct, the F-statistic increases in size, while as the groups become more internally variable, the F-statistic decreases. Download and read in the Scandinavian EU Preference Vote data [scanvote.csv] (call it scanvote in R, and attach it). These data describe the percentages of votes in favor of membership in the European Union (Yes(%)), by administrative district in Finland, Norway, and Sweden, collected by Alec Murphy. As always, plot the data first. A boxplot or stripchart by country would be good. Use the descr() function and from section 1 above and tapply() to calculate standard errors of the mean for each country, e.g.:
Now do an analysis of variance.
The F-statistic has as its reference distribution, the F-distribution (no surprise there). The F-distribution varies considerably with the number of groups and the number of observations within groups, and is not as easily visualized as the normal or t-distributions. Q10: What is the value of the F-statistic, and how unusual is this value for the analysis of variance of these data? (Hint, look at the p-value.) Are the variances similar among groups? What does this suggest about the difference among countries in preference for joining the EU?
Finally, open the Summit Cr. geomorphic data project [sumcr.csv] and do an analysis of variance on water-surface width (WidthWS), with group membership determined by Reach. Compare the extent of overlap of the confidence intervals for the group means and the values of the F statistics for this analysis, and the previous one on preferences. You should see how the size of the F-statistic reflects the distinctiveness of the groups. 6. What to hand in Hand in the answers to the above questions, plus the plot you made of the Scandinavian vote data. |