Bivariate descriptive statistics
Correlation and covariance
The correlation coefficient is a simple descriptive statistic that measures the strength of the linear relationship between two interval- or ratio-scale variables (as opposed to categorical, or nominal-scale variables), as might be visualized in a scatter plot. The value of the correlation coefficient, usually symbolized as r, ranges from -1 (for a perfect negative (or inverse) correlation) to +1 for a perfect positive (or direct) correlation.
illustrations of the strength of the correlation
attach(cities)
plot(cities[,2:12]) # scatter plot matrix, omit city name
cov(cities[,2:12]) # covariance matrix
cor(cities[,2:12]) # correlation matrix
detach(cities)
An important issue in the calculation and interpretation of correlations and covariances is that they only measure or describe linear relationships. This can be illustrated by the relationship between water surface width and downstream distance at Summit Cr.:
attach(sumcr)
plot(CumLen, WidthWS)
lines(lowess(CumLen, WidthWS), col="blue", lwd=2)
cor(CumLen, WidthWS)
detach(sumcr)
Does the correlation coefficient make any sense here?
The X2 (Chi-square) measure of association (for categorical data)
Categorical data are data that take on discreet values corresponding to the particular class interval that observations of ordinal-, interval, or ratio-scale variables fall in or the particular group membership of nominal-scale variables. Before applying a particular descriptive statistic, it's always good to plot the data.
descriptive plots for categorical data—mosaic plots
The X2 statistic measures the strength of association between two categorical variables (nominal- or ordinal-scale variables, summarized by a cross-tabulation, a table that shows the frequency of occurrence of observations with particular combinations of the levels of two (or more) variables.
attach(sumcr)
ReachHU.table <- table(Reach,HU)
ReachHU.table
chisq.test(ReachHU.table)
To illustrate the application of the X2 test, the Sierra Nevada reconstructed climate data and the Oregon climate-station data can be converted to categorical (ordinal-scale) data, and the following scripts employed
# crosstab & Chi-square -- Sierra Nevada TSum and PWin groups
attach(sierra)
PWin.group <- cut(PWin, 3)
TSum.group <- cut(TSum, 3)
PWinTSum.table <- table(PWin.group, TSum.group)
PWinTSum.table
chisq.test(PWinTSum.table)
# crosstab & Chi-square -- Oregon station elevation and tann
attach(orstationc)
elev.group <- cut(elev, 3)
tann.group <- cut(tann, 3)
elevtann.table <- table(elev.group, tann.group)
elevtann.table
chisq.test(elevtann.table)
Quick look at the appropriate Chi-square distribution:
x <- seq(0, 25, by = .1)
pdf <- dchisq(x, 4)
plot(pdf ~ x, type="l")