Cluster analysis

There is a very large number of clustering procedures, and quite a few have been implemented in R and S-Plus.  Here are some examples:

# Cluster analysis of Yellowstone pollen data
# to read the data in from the file yellpolsqrt.csv:
# Veg must be recoded as a factor as follows
# ypolsqrt <- read.csv("yellpolsqrt.csv")
# ypolsqrt$Veg <- as.factor(ypolsqrt$Veg)
# levels(ypolsqrt$Veg) <-
    c("Steppe","Lodgepole","Trans","Subalpine","Alpine")

# hierarchical clustering, Ward's method, Yellowstone pollen data
attach(ypolsqrt)
X <- as.matrix(ypolsqrt[,4:35])
X.dist <- dist(scale(X))
# image(seq(1:58),seq(1:58),as.matrix(X.dist))
hier.cls <- hclust(X.dist, method = "ward")
plot(hier.cls, labels=Veg, cex=.7)

# cut dendrogram to give 5 clusters
nclust <- 5
clusternum <- cutree(hier.cls, k=nclust)
class.table.hier <- table(Veg, clusternum)
mosaicplot(class.table.hier, color=T)

# k-means clustering of Yellowstone pollen data
library(cluster)
X <- as.matrix(ypolsqrt[,4:35])
kmean.cls <- kmeans(X, centers=5)

class.table.km <- table(Veg, kmean.cls$cluster)
mosaicplot(class.table.km, color=T)

# "pam" clustering of Yellowstone pollen data
library(cluster)
X <- as.matrix(ypolsqrt[,4:35])
pam.cls <- pam(X, k=5)
plot(pam.cls)

class.table.pam <- table(Veg, pam.cls$cluster)
mosaicplot(class.table.pam, color=T)
 

The resemblance between the subjectively derived vegetation zones and the clusters can be judged using some straightforward diagnostic plots.