CLUSTERING

 

0.     Review of principal components – another unsupervised learning method

 

> attach(USArrests)    

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

> names(USArrests)

[1] "Murder"   "Assault"  "UrbanPop" "Rape"   

 

> pc = prcomp(USArrests, scale=TRUE)

> biplot(pc)

 

 

Red vectors are projections of the original X-variables on the space of the first two principal components. We can see that the first principal component Z1 mostly represents the combined crime rate, and the second principal component Z2 mostly represents the level of urbanization.

 

1.     K-means method

 

Now we use K-means clustering to find more homogeneous groups among the states.

Let’s start with K=2 clusters. The 50 states are partitioned into 2 groups, Cluster 1 with 21 and Cluster 2 with 29 states.

 

> KM2 = kmeans(X,2)

> KM2

K-means clustering with 2 clusters of sizes 21, 29

 

Cluster means:

     Murder  Assault UrbanPop     Rape

1 11.857143 255.0000 67.61905 28.11429

2  4.841379 109.7586 64.03448 16.24828

 

Clustering vector:

       Alabama         Alaska        Arizona       Arkansas     California

             1              1              1              1              1

      Colorado    Connecticut       Delaware        Florida        Georgia

             1              2              1              1              1

        Hawaii          Idaho       Illinois        Indiana           Iowa

             2              2              1              2              2

        Kansas       Kentucky      Louisiana          Maine       Maryland

             2              2              1              2              1

 Massachusetts       Michigan      Minnesota    Mississippi       Missouri

             2              1              2              1              2

       Montana       Nebraska         Nevada  New Hampshire     New Jersey

             2              2              1              2              2

    New Mexico       New York North Carolina   North Dakota           Ohio

             1              1              1              2              2

      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina

             2              2              2              2              1

  South Dakota      Tennessee          Texas           Utah        Vermont

             2              1              1              2              2

      Virginia     Washington  West Virginia      Wisconsin        Wyoming

             2              2              2              2              2

 

Within cluster sum of squares by cluster:

[1] 41636.73 54762.30

 (between_SS / total_SS =  72.9 %)

 

 

Let’s look at the position of these clusters on our biplot. There is a discrepancy of scales in biplot, so I am using a coefficient 3.5, to match points to state names.

 

> points(3.5*pc$x[,1], 3.5*pc$x[,2], col=KM2$cluster, lwd=5)

 

Use more clusters?

 

> KM5 = kmeans(X,5)

> points(3.5*pc$x[,1], 3.5*pc$x[,2], col=KM5$cluster, lwd=5)

 

2.     Hierarchical Clustering and Dendrogram

 

So, how many clusters should be used? We can apply the hierarchical clustering algorithm, which does not require to pre-specify the number of clusters.

 

> HC = hclust( dist(X), method="complete" )

 

Here, “dist” stays for distance between multivariate observations, and method can be “complete”, “single”, “average”, “median”, etc. – it is a method of determining similarity with clusters and dissimilarity between clusters.

 

We can see the dendrogram that this method has created.

 

> plot(HC)

 

We then cut the tree at some level and create clusters.

 

> cutree(HC,5)

       Alabama         Alaska        Arizona       Arkansas     California

             1              1              1              2              1

      Colorado    Connecticut       Delaware        Florida        Georgia

             2              3              1              4              2

        Hawaii          Idaho       Illinois        Indiana           Iowa

             5              3              1              3              5

        Kansas       Kentucky      Louisiana          Maine       Maryland

             3              3              1              5              1

 Massachusetts       Michigan      Minnesota    Mississippi       Missouri

             2              1              5              1              2

       Montana       Nebraska         Nevada  New Hampshire     New Jersey

             3              3              1              5              2

    New Mexico       New York North Carolina   North Dakota           Ohio

             1              1              4              5              3

      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina

             2              2              3              2              1

  South Dakota      Tennessee          Texas           Utah        Vermont

             5              2              2              3              5

      Virginia     Washington  West Virginia      Wisconsin        Wyoming

             2              2              5              5              2

 

3.     College data - K-means method

 

Our task will be to cluster Colleges into more homogeneous groups.

 

> attach(College);   names(College)

 [1] "Private"     "Apps"        "Accept"      "Enroll"      "Top10perc"   "Top25perc"   "F.Undergrad" "P.Undergrad" "Outstate"    "Room.Board"

[11] "Books"       "Personal"    "PhD"         "Terminal"    "S.F.Ratio"   "perc.alumni" "Expend"      "Grad.Rate" 

 

We need to create a matrix of numeric variables. We’ve used this command to prepare data for LASSO.

 

> X = model.matrix( Private ~ . + as.numeric(Private), data=College )

> dim(X)

[1] 777  19

> head(X)                     Instead of printing the entire matrix, “head” only shows the first few rows

 

                             (Intercept) Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD

Abilene Christian University           1 1660   1232    721        23        52        2885         537     7440       3300   450     2200  70

Adelphi University                     1 2186   1924    512        16        29        2683        1227    12280       6450   750     1500  29

Adrian College                         1 1428   1097    336        22        50        1036          99    11250       3750   400     1165  53

Agnes Scott College                    1  417    349    137        60        89         510          63    12960       5450   450      875  92

Alaska Pacific University              1  193    146     55        16        44         249         869     7560       4120   800     1500  76

Albertson College                      1  587    479    158        38        62         678          41    13500       3335   500      675  67

                             Terminal S.F.Ratio perc.alumni Expend Grad.Rate as.numeric(Private)

Abilene Christian University       78      18.1          12   7041        60                   2

Adelphi University                 30      12.2          16  10527        56                   2

Adrian College                     66      12.9          30   8735        54                   2

Agnes Scott College                97       7.7          37  19016        59                   2

Alaska Pacific University          72      11.9           2  10922        15                   2

Albertson College                  73       9.4          11   9727        55                   2

 

 

 

Now, let’s create K=5 clusters by the K-means method. No new library is needed, this command comes with basic R.

 

> KM5 = kmeans( X, 5 )

 

> KM5

K-means clustering with 5 clusters of sizes 20, 113, 162, 431, 51

 

Cluster means:

  (Intercept)      Apps    Accept    Enroll Top10perc Top25perc F.Undergrad P.Undergrad  Outstate Room.Board    Books Personal      PhD Terminal

1           1  9341.750 3606.2500 1321.9500  76.05000  91.70000    5283.200    427.2000 18119.750   6042.750 576.6000 1255.550 93.30000 96.80000

2           1  5012.602 3410.1150 1526.5310  21.56637  52.28319    8021.566   2111.3097  6709.283   3703.912 557.1416 1727.186 77.01770 83.65487

3           1  2566.364 1712.7901  521.5123  39.83333  68.96914    2067.241    282.4444 15732.512   5257.864 578.0926 1042.772 83.31481 90.24074

4           1  1140.610  869.9258  341.7007  21.40371  48.75638    1434.332    475.6450  9263.759   4110.290 530.1206 1299.220 65.03016 72.61717

5           1 13169.804 8994.7647 3438.1176  34.84314  67.15686   17836.020   3268.3529  8833.510   4374.353 593.0784 1813.784 85.54902 90.64706

 

  S.F.Ratio perc.alumni    Expend Grad.Rate as.numeric(Private)

1   6.61500    35.35000 32347.900  88.95000            2.000000

2  17.46903    14.02655  7067.257  54.91150            1.079646

3  11.43333    32.76543 13728.735  76.64198            1.993827

4  14.32343    21.36659  7677.035  63.13225            1.856148

5  15.99608    16.92157 10343.882  63.82353            1.117647

 

Clustering vector:

                 Abilene Christian University                            Adelphi University                                Adrian College

                                            4                                             3                                             4

                          Agnes Scott College                     Alaska Pacific University                             Albertson College

                                            3                                             4                                             4

                      Albertus Magnus College                                Albion College                              Albright College

                                            4                                             3                                             3

                    Alderson-Broaddus College                             Alfred University                             Allegheny College

                                            4                                             3                                             3

 

<truncated>

 

Within cluster sum of squares by cluster:

[1] 2115931982 3262290091 3917614114 5524699694 5934672728

 (between_SS / total_SS =  71.2 %)

 

Available components:

 

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"     

 

We can see the cluster assignment (truncated), multivariate cluster means (centroids), within and between sums of squares as measures of cluster purity. To explore the obtained clusters, we can plot some pairs of variables along with the assigned clusters:

 

> par(mfrow=c(2,2))

> plot( Outstate, Top10perc, col=KM5$cluster )

> plot( S.F.Ratio, PhD, col=KM5$cluster )

> plot( Apps, Enroll, col=KM5$cluster )

> plot( Room.Board, Private, col=KM5$cluster )

 

For example, we can see here that the green cluster consists of rather expensive and relatively small private colleges with a high percent of PhD degrees among faculty and small class sizes because of a low student-to-faculty ratio.

 

 

4.     College data - Hierarchical Clustering

 

Without specifying the number K of clusters, apply hierarchical clustering algorithm to the College data.

 

> HC = hclust( dist(X), method="complete" )

 

Here, “dist” stays for distance between multivariate observations, and method can be “complete”, “single”, “average”, “median”, etc. – it is a method of determining similarity with clusters and dissimilarity between clusters.

 

The full dendrogram with so many leafs would not be legible.

> plot(HC)

 

 

To illustrate the method, let’s take a small random sample of colleges and cluster them hierarchically.

 

> Z = sample(n,20)

> Y = X[Z,]

> HCZ = hclust( dist(Y), method="complete" )

> plot(HCZ)

We can choose where to “cut” this tree to create clusters. For example, we let’s create 4 clusters.

 

> HC4 = cutree(HC, k = 4)

> HC4

        Christian Brothers University         Nazareth College of Rochester

                                    1                                     1

                  Sweet Briar College                     Dartmouth College

                                    1                                     2

                       Eckerd College          Appalachian State University

                                    1                                     3

< truncated >

 

So, we get assignments of colleges into clusters.