|
Geography 414/514: Advanced Geographic Data Analysis Exercise 2: Univariate Descriptive Plots The objective of this exercise is to demonstrate some of the basic univariate plots for displaying data. Read through the exercise before attempting to complete it. 1. Univariate scatter plot (scatter diagrams) -- plot() version The univariate "scatter diagram" is a very simple plot of the values of a variable, plotted vs the observation number, or row number in a rectangular data set (labeled "Index" on the plot). (Ordinary scatter diagrams or scatterplots will be described later.) As it happens, in this data set the observations are arranged in downstream order, so there actually is some meaning to the observation number. That won't always be the case, however. Check to see if the "sumcr" data set ("data frame") is still in your workspace using the ls()or "list" function::
If not, read it in again as in Exercise 1 (here's a link [sumcr.csv]). Next, to make it easy to refer to individual variables (by their simple names (e.g. WidthWS) as opposed to their compound names ("sumcr$WidthWS")), use the attach() function:
To create a univariate scatter diagram for the variable "Length", type
Note that there is no information lost in this display. The original values of the variables, and even the order of the observations, could be reconstructed using a ruler. It's sometimes helpful to be able to return to previously generated plots. The obvious way to do that is to just reissue the commands again, but the individual plots can also be saved. After creating a plot, make sure the RGraphics window is active (by clicking on it), and then use the RGui History > Recording menu to turn recording on. Subsequent plots will then be added, and can be viewed again by using the PgUp and PgDn keys with the RGraphics window is active. 2. Univariate scatter Diagram -- stripchart() version (This plot is also known as a "strip plot" or "dot plot".) Type the following stripchart(Length) # stripchart of Length Here's an alternative version with the points with identical values "stacked":
The two plots each have one point for each observation, and each point is plotted using the particular value of Length, but each gives a slightly different view of the data. That's actually a desirable situation because one view may allow you to see a pattern that is not evident in the other view. 3. Dotplots (dotcharts) Another way to examine a single variable and gain some insight into its variations across the observations in a data is through Cleveland's dotplots (called "dotchart" in R). In the simple version here, the plot again shows individual observations. By default, the observations are arranged in the row-order of the data frame (i.e. first observation on the bottom of the plot, last on the top). Try dotchart(WidthWS) Here's an alternative, this time with each line labeled by the Location variable
The as.character() function converts the Location variable back to a character string (it was read in as a factor by the read.csv() function. The cex=0.5 parameter makes the characters smaller for legibility. Here's a version in which the observations are sorted by the values of WidthWS. The creation of an index using the order() function is used to rearrange the the values of WidthWS and the corresponding values of the Location character-string label:
4. Boxplots A boxplot contains a object (a box) and some decorations (lines, etc.) that are drawn to illustrate certain aspects of a variable. The box is drawn in such a way that the box itself encloses half of the data points. The top edge of the box is drawn so that 1/4 of the observations have values greater than that value, the bottom edge is drawn so that 1/4 of the observations have values that are less than that value, and the line in the middle of the box is drawn so that half of the observations have values greater than its value and half have values less than its value (i.e., at the median). The other parts of the box plot will be discussed in class. Try
(the "whiskers", by default, extend to 1.5 times the interquartile range). Here's an alternative version where the "whiskers" of the plot extend to the extremes of the data: boxplot(Length, range=0) At this point we're done with the Summit Cr. data set, and so it's a good practice to "detach" using the detach() function. This removes the shorthand way of referring to the variables in that data frame, which will avoid possible collisions with the variables in another data frame that might be read in that could have variables with the same names as those in the Summit Cr. data frame--a data frame from another study site for example. detach() doesn't remove the data frame from the workspace, which you can verify using the ls() function. To detach the sumcr data frame, type detach(sumcr). 5. Histograms A histogram essentially is a bar chart of a frequency table, where the heights of the bars reflect the relative (or absolute) proportion of observations that fall within particular class intervals of the variable of interest. The shape of the histogram reveals the distribution of the individual observations. Open Alec Murphy's Scandinavian EU voter-preference data. The data include the name of the commune (or county) (Commune), the percentage of Yes votes (Yes), the population of each commune (Pop), and a country code (Country). Here's a link to the .csv file: [scanvote.csv]. Download it and save it in you working directory. The command to read the .csv file is a little different than was used for reading the Summit Cr. data:
The as.is=1 parameter prevents R from turning the commune name (Commune) in column 1 into a "factor" (like Country), leaving it as a text label. Here's the alternative approach using file.choose()
Attach the data frame by typing attach(scanvote), and take a look at it using the data editor (see Exercise 1, e.g. fix(scanvote)). Now, get histograms for the variable "Yes" (the proportion of voters in each commune (county) expressing a positive preference for joining the EU. To get a basic histogram, type
Experiment with the number of bars in the histogram: Type the following:
breaks=40? 6. Density Plots Evidently, the shape of the histogram (and consequently what it may imply about the distribution of the data) can vary considerably depending on the bin widths that are used to summarize the data or the number of bars (bins) used. An alternative plot type is the "kernel density smoother plot" This plot is produced by first using the density() function to estimate the number of data points in the vicinity of different values of the Yes percents, and then plotting these. To produce the plot, type the following two lines at the command prompt:
7. A composite plot The views of the data provided by the different plotting methods vary quite a bit. Some retain a lot of information, but may be hard to interpret (particularly if there are a lot of data), while others are very simple appearing, but lose information. One strategy for dealing with this is to produce a plot that superimposes several different plots. Type the following, one line at a time:
8. What to hand in Answers to the seven questions. Do not go overboard—all of them should fit on a single typed page. |