recipes : Statistics : How do I cluster my data?

Problem

I want to cluster my data but I don't know where to begin.

SolutionThere are various functions for clustering and classifying data in the MATLAB Statistics Toolbox. Clustering and classifying are different but related techniques, and we'll discuss them together here. If you've never done any of these things before, however, it can be hard to know where to start. The purpose of this recipe is to help you orient yourself and get you going. Remember that clustering and classification are a big field and there's no way a single web page will tell you everything there is to know. With that caveat in mind, let's get started.

These approaches are part of a family of techniques collectively known as "machine learning." This basically means the use of computer algorithms to extract underlying features of the data. For example, an e-mail spam filter qualifies as a machine learning approach since it searches for keywords to separate real messages from junk. There are two broad classes of machine learning techniques: supervised and unsupervised.

*What is supervised learning?*

The term "supervised" means that the algorithm "knows" into which
groups the data fall. Let's clarify what that means with an example.
Say that you have a factory that manufactures widgets and this factory
produces supposedly identical widgets from 3 different production
lines. You want to know if all the lines really are behaving in the
same way. So you measure sizes, weights, and other properties of the
widgets coming off the lines. Given that you also know which line each
widget came from, you are in a position to explore whether there are
significant differences between the widgets from the different
lines. The most simple analysis you could do is plot each parameter,
such as weight, on a box plot, with one box per production line. Then
you could look for obvious differences between the weights of widgets
from the different production lines. However, you might have measured
a dozen different parameters and perhaps the difference between the
production lines is only evident if you look at the variables in
combination. Perhaps, say, widgets from Line 2 tend to be slightly
lighter *and* slightly less wide *and* be a slightly
different colour from the widgets in Lines 1 and 3. A supervised
learning technique can help you identify complicated relationships of
this sort in a way that simple independent comparisons might not.

*What is un-supervised learning?*

The term "un-supervised" means that the algorithm does *not*
"know" into which groups the data fall. In our previous example, we
had 3 production lines that could be used to segregate the widgets and
so we could actively search for differences based upon these
lines. Making three box plots, with one box per line, constituted a
simple "supervised" analysis. This is because we can organise our
thinking based upon where we know the data (the widgets) came from. If
we didn't have the benefit of the three production lines then no
supervised analysis would be possible. Thus, an example of an
un-supervised analysis scenario would be if our widget factory had
only one production line and wanted to check whether the widgets
coming off the line are uniform, or if they fall into two or more
groups. Anything is possible: perhaps the production line screws up
10% of the time and so 10% of the widgets are slightly larger than
they should be. Clearly we now have no way of dividing up the data
before plotting so all we can do is make a single histogram of the
widget weights. Such a histogram would constitute a simple,
un-supervised, data analysis.

*But why are you showing me how to make simple plots?*

Perhaps what appears to be a tutorial on box plots and histograms
isn't what you expected when you came to a page looking for how to
cluster and classify data? Think again. The key thing about these
machine learning approaches is that they don't perform magic. They're
just a formalised way of exploring structure in data. If your data
have no interesting structure then they're not going to reveal
something that isn't there. It can be worse than that: if mis-used
these algorithms may suggest the presence of structure that doesn't exist. You
have to know what your data look like in order to evaluate the results
of any machine learning approach. Thus, the first thing you should do
is plot and explore the data. If your data are already divided in some way
(e.g. you have a "multiple production line" scenario) then make sure you
bring out the identity of the groups in the plots. e.g. colour-code
the points in a scatter plot. If you don't have this advantage, then
you should focus on plots that showcase how the data from your single
group are distributed.

*I've made my plots, now what?*

Now that you have some idea what you're looking for, you're ready to
start applying more complicated tools than histograms, scatter plots,
and box plots. What tools? If have an unsupervised scenario then
you'll want to employ tools that help you search your data for
structure and that objectively sub-divide your data. If have a
supervised scenario then you want to employ tools that actively search
for differences between your groups and quantify how well separated
they are. There's nothing stopping you from using un-supervised
techniques on data that are amenable to a supervised algorithm, but
you may not get the most out of your data by doing this.

Unsupervised techniques include a range of approaches from data-visualisation to objective clustering of data. Dendrograms, are a type of plot that reveal the hierarchical relationship between data points. They show which data points are similar to each other and which are different. Dendrograms are a useful visualisation tool but are not a formal statistical test. i.e. There is no formal way of judging whether the branches in a dendrogram are in some sense significant. Clustering approaches such as k-means are also unsupervised, because the data need not come from pre-defined groups; the job of the the clustering algorithm is to divide data into classes. The results of this depend on the algorithm you've chosen (there are many) and what your data look like. Again, the fact that you've partitioned your data into clusters doesn't mean that those clusters are real. Additional work needs to be done to verify this. An example is provided on the k-means page of this site.

Supervised techniques essentially boil down
to classification
algorithms such as the simple nearest
neighbour approach, linear
classifiers, or more elaborate techniques such as
support
vector machines. Unlike un-supervised approaches, with
supervised techniques you can extract an objective measure of
classification success, since you know both the true group (where
your data actually came from) and the assigned group (where your
algorithm *thinks* the data came from). The results of a
classifier can be summarised as
a confusion
matrix
(see confusionmat.m)
or even simply as the proportion of correct classifications. A *very
import* aspect of data classification is the use
of cross-validation. Cross-validation
is employed to
avoid over-fitting. Going
into this in detail is beyond the scope of this recipe. Briefly, a
simple cross-validation approach would be to randomly select half of
your data and train the classifier on this. Then use the model
produced by the classifier to partition the remaining data into
classes. The key point is that the classifier is "trained" and
"tested" on independent sub-sets of the data. If you do not do this,
you will get inflated estimates of classification success and your
model will not generalise well to new data sets.

That was a whirlwind tour of clustering and classification. The key take-home point is that these approaches are much like any other statistical test: if you can't see an effect in your data by eye then chances are there isn't anything going on there. To move forward, I recommend you read through the MATLAB help on these topics and try some of the examples. You can also read the page on this site about k-means clustering.

**Want to continue the discussion?**

Enter your comments, suggestions, or thoughts below