recipes : Statistics : How do I perform k-Means clustering?

Problem

I have some data and want to try clustering it with the k-means algorithm. How do I go about doing that?

Solution
The MATLAB Statistics Toolbox has some nice functions for performing
and interpreting k-means clustering analyses. However, before delving into these
we should establish what k-means clustering *is* and what it can and can't tell you.

k-means is a clustering algorithm. Clustering algorithms are
unsupervised
techniques for sub-dividing a larger dataset into smaller groups. The term "unsupervised" indicates
that the data do not originate from clearly defined groups that can be used to label them *a priori*.
For example, say that you're exploring a group of 300 hens in a barn and you've measured a
bunch of parameters from each bird: height, weight, egg size, laying regularity, egg color,
etc. Based on all these parameters, you want to figure out if the hens fall into a small
number of distinct groups or if they constitute just a single homogeneous population. Note that
this question has two features: 1) At the outset you don't know how many groups there are and
therefore 2) you have no way of assigning a given hen to a given group. All hens are treated
equally. For problems such as this, you can use k-means clustering to objectively assign each hen to a
group. Using further techniques (described below) you can objectively test whether the assignments
you've made are reasonable.

As you might imagine, if there are "unsupervised" techniques then there must also be "supervised" techniques. Supervised techniques are often also called "classification" techniques and are applied to data sets where the identity of the groups are known beforehand. We will not discuss these approaches here but you can read about them in the introductory page on clustering and classification.

*Delving deeper into k-means*

The MATLAB help has a
brief overview
of different unsupervised clustering techniques and
a nice
example of how to apply k-means and an explanation of how it
works. The help
page for k-means provides another example. I suggest you check these out in addition to
reading this page. On this page we will provide a further example and also emphasise some different
aspects of the clustering approach, such as validation of the final clustering. So let's kick off with our simple example:

%Let's make some fake data with two groups n=75; %sample size x=[randn(n,1)+2;randn(n,1)+4.75]; y=[randn(n,1)+2;randn(n,1)+4.75]; %the true group identity groups=[ones(n,1);ones(n,1)+1]; %plot the data scatter(x,y,50,groups,'filled')

We've made two groups of data that overlap somewhat. In this case, because it's a toy example, we
know from which group each data point originates. Remember that in real-world uses you won't
have this information, but we're going to use it here to demonstrate the effectiveness of k-means.
So we know the group identities, let's see if *k-means* can figure out from which group each data point
originates. The k-means algorithm must be supplied with two things:
the raw data (obviously!) and also we must tell it into how many
groups the data should be partitioned. That's important: it means that
it's *your* job to tell the algorithm how many groups there
are. This isn't something it can determine for itself.

Just to hammer home the caveat: normally you would use a supervised classification technique (e.g. discriminant analysis) when you already know the identities of the groups. This is because a supervised technique will use the group identities to search for the boundary that best separates the groups; an un-supervised technique, such as k-means, ignores the group identities and simply asks if the data naturally partition in some way. With that in mind, let's see how k-means divides the data.

```
data=[x,y];
IDX=kmeans(data,2); %Run k-means, asking for two groups
```

The variable IDX tells you the group to which k-means has assigned each data point. We can find out the number of times it gets the group right by doing: sum(groups==IDX). That returns 147, so for 147 out of 150 data points k-means has classified things correctly. We can verify this by re-doing the scatter plot, this time with the colours being derived from the k-means results.

Good, right? Can you spot where it made the mistakes? In this case we knew there were two groups and, remember, k-means will divide up data into as many groups as we like. If we had a unimodal distribution (just one group) we could still run k-means and divide it into two. Watch:

%Let's make some fake data with two groups n=150; %sample size x=[randn(n,1)+3;randn(n,1)+3]; y=[randn(n,1)+3;randn(n,1)+3]; %plot the data subplot(1,2,1) plot(x,y,'ok','MarkerFaceColor','k') %now divide into two using k-means and plot the results data2=[x,y]; IDX=kmeans(data2,2); %plot the k-means results subplot(1,2,2) scatter(x,y,50,IDX,'filled')

You can see that k-means has divided the distribution into two even though there is only one group. Blindly running k-means isn't enough: you also have to validate what you've done. You have to demonstrate that dividing the data into *k* groups is a reasonable thing to do. MATLAB comes to the rescue by allowing to calculate a so-called silhouette value for each observation. Silhouette values can be calculated on any data that have been categorised into groups by a clustering algorithm. The silhouette value tells you both how tightly a data point is associated with its assigned cluster and how dissimilar it is from other clusters. Each data point has a silhouette value and the mean silhouette value over all points summarises the overall quality of the clustering.

*Validating cluster identity*

So how do we use k-means and the silhouette value to decide if the clustering we have produced is reasonable? The approach I'll show you here works as follows: run k-means a bunch of times, each time with a different *k*, and calculate the mean silhouette value for each *k*. If your data really fall into, say, 2 groups then you should see a peak in the silhouette values when *k=2*. Let's run this for the two cases shown above and see if this works.

%Run k-means for a range of k for k=2:6 IDX=kmeans(data,k); %The data with two groups [S,H] = silhouette(data, IDX); silA(k)=mean(S); %The mean silhoette value for two groups IDX=kmeans(data2,k); %The data with one group [S,H] = silhouette(data2, IDX); silB(k)=mean(S); %The mean silhoette value for one group end %Plot the results clf %clear the figure window hold on plot(1:6, silA,'ok-','MarkerFaceColor','k') %2 groups plot(1:6, silB,'or-','MarkerFaceColor','r') %1 group set(gca,'XTick',1:6) xlabel('k') ylabel('mean silhouette value') hold off

The black line shows the mean silhouette values for the data with two groups. The red line shows the mean silhouette values for the data with one group. The black line has a peak at 2, which indicates that the distribution may indeed truly be partitioned into two. The red has no peak, which indicates that the underlying distribution is unimodal. As ever, the devil's in the details. You would probably want to extend such an analysis by repeating it multiple times on randomly chosen sub-samples of the data in order to produce error bars for each point. You'd also want some way to assess how good the clustering is for the value at which you see a peak.

Discussion

What we've covered here is just one way of doing k-means. You'll notice that I just went with the default options. MATLAB allows you to run the k-means algorithm in various different ways; for example you have several choices for how the algorithm chooses its starting points and what measure of distance it uses. It's beyond the scope of this article to go into all of these in detail. The best advice is to play with them and use a little common sense. For example, if you have no good reason for thinking that the cityblock distance option is a good description of your data then don't use it. You might also want to check out the introductory page on clustering and classification.

**Want to continue the discussion?**

Enter your comments, suggestions, or thoughts below