Univariate Summary Plots

Summary plots display an object or a graph that gives a more concise expression of the location, dispersion, and distribution of a variable than an enumerative plot, but this comes at the expense of some loss of information:  In a summary plot, it is no longer possible to retrieve the individual data value, but this loss is usually matched by the gain in understanding that results from the efficient representation of the data. Summary plots generally prove to be much better than the enumerative plots in revealing the distribution of the data.

Histograms

Histograms are a type of bar chart that displays the counts or relative frequencies of values falling in different class intervals or ranges.

# use Specmap O-18 data [specmap.csv]
attach(specmap)
hist(O18)

The overall impression one gets about the distribution of a variable depends somewhat on the way the histogram is constructed:  fewer bars give a more generalized view, but may obscure details of the distribution (the existence of a bimodal distribution, for example), while more may not generalize enough.

hist(O18, breaks=20)

Density Plots (or Kernel Plots/Smoothed Histograms)

A density plot is a plot of the local relative frequency or density of points along the number line or x-axis of a plot.  The local density is determined by summing the individual "kernel" densities for each point.  Where points occur more frequently, this sum, and consequently the local density, will be greater.  Density plots get around some of the problems that histograms have, but still require some choices to be made.

histogram smoothing illustration

different kernels

O18.density <- density(O18)
plot(O18.density)

Plots with both a histogram and density line can be created:

O18.density <- density(O18)
hist(O18, breaks=40, probability=TRUE)
lines(O18.density)
rug(O18)

detach(specmap)

Boxplot (or Box-and-Whisker Plot)

A boxplot characterizes the location, dispersion and distribution of a variable by construction a box-like figure with a set of lines (whiskers) extending from the ends of the box.  The edges of the box are drawn at the 25th and 75th percentiles of the data, and a line in the middle of the box marks the 50th percentile.  The whiskers and other aspects of the boxplot are drawn in various ways.

# use Scandinavian EU-preference vote data [scanvote.csv]
attach(scanvote)
boxplot(Pop)
boxplot(log10(Pop))

An Aside on Reference Distributions

There are a number of "theoretical" reference distributions that arise in data analysis that can be compared with observed or empirical distributions (i.e. of a set of observations of a particular variable) and used in other ways.  One of the more frequently used reference distributions is the normal distribution (which arises frequently in practice owing to the Central Limit Theorem).

Theoretical distributions are represented by their

For the standard normal distribution (with mean of 0 and a standard deviation of 1), the PDF and CDF can be displayed as follows:

z <- seq(-3.0,3.0,.05)
pdf.z <- dnorm(z)   # get probability density function
plot(z, pdf.z)
cdf.z <- pnorm(z)   # get cumulative distribution function
plot(z, cdf.z)

and the inverse cumulative distribution function as follows:

p <- seq(0,1,.01)
invcdf.z <- qnorm(p)  # get inverse cumulative distibution function
plot(p,invcdf.z)

QQ Plot (or QQ Normal Plot)

A quantile plot is a two-dimensional graph where each observation is shown by a point, so strictly speaking, a QQ plot is an enumerative plot.  The data value for each point is plotted along the vertical or y-axis, while the equivalent quantile (e.g. a percentile) value is plotted along the horizontal or x-axis.  The quantiles plotted along the x-axis could be empirical ones, like the percentile equivalents or rank for each value, or they could be theoretical ones corresponding to the "p-values" of a reference distribution (e.g. a normal distribution) with the same parameters as the variable being examined.  In practice, the shape of the QQ plot is the issue:

a variety of histograms and QQ Plots

The qqnorm plot plots the data values along the y-axis, and p-values of the normal distribution along the x-axis.  qqline adds a straight line that passes through the first and third quartiles (25th and 75th percentiles) and can be used to assess (a) the overall departure of the observed distribution from a normal distribution with the same parameters (mean and standard deviation) as the observations, and (b) outliers or unsual points.

qqnorm(Pop)
qqline(Pop)

qqnorm(log10(Pop))
qqline(log10(Pop))

detach(scanvote)