GEOG 414/515:  Advanced Geographic Data Analysis
Multivariate distances and cluster analysis

There is a broad group of multivariate analyses that have as their objective the organization of individual observations (objects, sites, individuals), and these analyses are built upon the concept of multivariate distances (expressed either as similarities or dissimilarities) among the objects.

The organization generally takes two forms:

  • the arrangement of the objects in a lower-dimensional space than the data were originally observed on;
  • the development of "natural" clusters or the classifications of the objects.

These analyses share many concepts and techniques (both numerical and practical) with other procedures such as principal components analysis, numerical taxonomy, discriminant analysis and so on.

The analyses generally begin with the construction of an n x n matrix D of the distances between objects.  For example, in a two dimensional space, the elements dij of D could be the Euclidian distances between points,

dij = [(xi1 - xj1)2 + (xi2 - xj2)2]½

The Euclidian distance, and related measures are easily generalized to more than two dimensions.

1.  Basic distances

2.  Mahalanobis distances

The basic Euclidian distance treats each variable as equally important in calculating the distance.  An alternative approach is to scale the contribution of individual variables to the distance value according to the variability of each variable.  This approach is illustrated by the Mahalanobis distance, which is a measure of the distance between each observation in a multidimensional cloud of points and the centroid of the cloud.  The Mahalnobis distance D2 is given by 

D2 = (x - m)V-1(x - m)

where x is a vector of values for a particular observation, m is the vector of means of each variable, and V is the variance-covariance matrix. 

3.  Multidimensional scaling (MDS)

The objective of MDS is to portray the relationship between objects in a multidimensional space in a lower-dimensional space (usually 2-D) in such a way that the relative distances among objects in the multidimensional space are preserved in the lower-dimensional space.  The classic illustrative example is the analysis of geographically arrayed data, which can be done with the Oregon climate-station data:

4.  Cluster analysis

In a cluster analysis, the objective is to use similarities or dissimilarities among objects (expressed as multivariate distances), to assign the individual observations to "natural" groups.  Cathy Whitlock's surface sample data from Yellowstone National Park describes the spatial variations in pollen data for that region, and each site was subjectively assigned to one of five vegetation zones. 

Readings

Crawley (Statistical Computing...):  Ch. 40; Manley (Multivariate Statistical Methods...) ch. 9