Scatter diagrams (scatter plots, X-Y plots)
The scatter diagram or scatter plot is the workhorse bivariate plot, and is probably the plot type that is most frequently generated in practice (which is why it is the default plot method in R).
Scatter diagram
Displays the values of two variables at a time using symbols, where the value of one variable determines the relative position of the symbol along the X-axis and the value of a second variable determines the relative position of the symbol along the Y-axis.
Traditionally the dependent or response variable is plotted on the vertical or Y-axis, while the “independent” or predictor variable is plotted on the horizontal or X-axis. The convention can be imposed on plots in by listing the horizontal or X-axis variable first, and Y-axis variable second (as om X-Y plots).
# use Oregon climate-station data [orstationc.csv]
attach(orstationc)
plot(elev,tann)
detach(orstationc)
Variations on basic scatter plots
There are several variations on the basic scatter plot that can be made to enhance interpretability of the plots. These include:
Axis scaling – Linear, log (base 10), ln (base e) and probability scale axes may be examined to linearize the relationship between variables (Axis Display/Scale dialog)
# use Scandinavian EU vote data [scanvote.csv]
attach(scanvote)
plot(Pop, Yes) # arithmetic axis
plot(log10(Pop), Yes) # logrithmic axis
detach(scanvote)
Plotting symbols – differing symbol types may be used to distinguish multiple Y-variables plotted versus the same X-variable
# use Oregon climate-station data [orstationc.csv]
attach(orstationc)
opar <- par(mar=c(5,4,4,5)+0.1) # space for second axis
plot(elev, tann) # first plot
par(new=T) # second plot is going to get added to first
plot(elev, pann, pch=3, axes=F, ylab="") # don't overwrite
axis(side=4) # add axis
mtext(side=4,line=3.8,"pann") # add label
par(opar) # restore plot par
detach(orstationc)
Line Plots
Line plots are bivariate plots in which the individual symbols are connected by line segments. This generally makes sense when the X-axis variable can be arranged in a natural sequence, such a by time or by distance. Sometimes, the symbols are omitted--they are usually redundant, and may clutter the plot.
Line Plots
# use Specmap oxygen-isotope data [specmap.csv]
attach(specmap)
plot(Age, O18)
plot(Age, O18, type="l")
plot(O18, Insol, type="o")
detach(specmap)
However, it should make sense to connect the symbols.
# use Oregon climate-station data [orstationc.csv]
attach(orstationc)
plot(elev, tann, type="l") # does this make sense?
detach(orstationc)
Labeled plots -- enhancing information on bivariate plots
There are a number of ways of enhancing the information gained from bivariate displays, in addition to simple alterations of plot axes and symbols. These include:
Symbol variations
One of the most effective ways to add more information to a scatter diagram is to use different symbols (either size, shape or color) to represent the values of a third variable. This technique begins to touch on multivariate descriptive plots.
The symbol styles on scatter diagrams can be used to encode or represent the influence of an additional variable. The symbol properties can be changed on a scatter plot as follows:
attach(scanvote)
plot(log10(Pop),Yes, pch=unclass(Country)) # different symbol
plot(log10(Pop),Yes)
text(log10(Pop),Yes, labels=as.character(Country)) # text
plot(log10(Pop),Yes, type="n")
text(log10(Pop),Yes, labels=as.character(District))
detach(scanvote)
Jittering points
Sometimes symbols are plotted on top of one another so much that the relationship is obscured. An extreme case of this occurs when two factor-type variables are plotted
# use Summit Cr. geomorph data [sumcr.csv]
attach(sumcr)
plot(unclass(Reach), unclass(HU)) # an extreme case
plot(jitter(unclass(Reach)), jitter(unclass(HU)))
detach(sumcr)
# use Florida 2000 presidential election data [florida.csv]
attach(florida)
plot(BUSH, GORE)
plot(BUSH, jitter(GORE, factor=500))
detach(florida)
Although the visual inspection of a scatter plot generally reveals the nature of the relationship between the pair of variables being plotted, sometimes this relationship may be obscured simply by the number of points on the plot. In such cases the relationship, if present, may be detected by a summarization method. Similarly, our tendancy to seek order in a chaotic set of points may lead us to perceive a relationship where none is really there. Again, summarizing the scatter plot may be useful.
Slicing scatter diagrams
Slicing scatter diagrams involves first "binning" the data along the X-axis, and then plotting a boxplot for each bin. Variations in the shapes of the box plots may indicate relationships between the variables that are obscured in a basic scatter plot.
# use Sierra Nevada annual climate reconstructions [sierra.csv]
attach(sierra)
plot(PWin, TSum) # scatter diagram
# sliced scatter diagram
PWin.classes <- c(0.0, 25., 50., 75., 100., 125., 150., 175.)
PWin.group <- cut(PWin, PWin.classes)
plot(TSum ~ PWin.group) # note formula in plot function call
detach(sierra)
Smoothing a scatter diagram
Scatter plot smoothing involves fitting a smoothed curve through the cloud of points to describe the general relationship between variables. This technique is very generalizable, and a variety of smoothers can be used. The most common is the "lowess" or "loess" smoother, which will be discussed in more detail later:
# use Oregon climate station annual temperature data [ortann]
attach(ortann)
plot(elevation,tann)
lines(lowess(elevation,tann))
detach(ortann)
Annotation of points involves adding information to a plot for individual points, that include such things as the observation number or other information, to "explain" individual points that may be unusual, or to simply identify the point for a key observation or set of observations.
Identify function
The identify() function allows one to click near points on a scatter plot and add some text labels to the plot. Note the x and y variables are the same as for the recently created plot.
attach(florida)
plot(GORE, BUCHANAN)
identify(GORE, BUCHANAN, labels=County)
detach(florida)
Scatter plot matrices (sometimes called "sploms") are simply sets of scatter plots arranged in matrix form on the page. As is the case for using symbol properties to show the influence of a third variable, scatter plot matrices also touch on multivariate descriptive plots.
Scatter plot matrix
# use large-cities data [cities.csv]
attach(cities)
plot(cities[,2:12]) # scatter plot matrix, omit city name
detach(cities)