recipes : Statistics : When to plot a bar chart and when a box plot?

Problem

When do I choose a bar chart and when a box plot?

Solution

We've already covered box plots, bar charts, and bar charts with overlaid data. For data that arise from a small number of groups, all of these plotting options are valid. Which to choose? Let's take a data set, plot it a bunch of different ways, and see how much information we glean from these three different plotting styles. First of all, a bar-chart with error bars being 95% confidence intervals for the mean:

bar chart example

What story does that plot convey? It tells us we have 3 groups of data with means between about 2.0 and 2.5. The confidence in the mean is not great given the absolute value of the mean. i.e. the error bars look fairly large. The error bars across the three groups are pretty similar, with B being the smallest and C the largest. It doesn't look like there are any significant differences between the three groups. Ok, so now lets plot the same data but as a box plot and compare them.

bar chart example box plot example

Wow! Now we're seeing a very different story. The large error bar in C is obviously due only to the huge outlier with a value of just under 7. Otherwise the data in C are tightly packed around the mean, so in reality it should have the smallest error bar of the three groups. The whiskers of the box plots of A and B indicate that the range of the data in those two distributions are similar. However, the blue boxes, which demarcate the region over which half the data can be found, look really different between A and B. That suggests something odd is going on, but we can't see what that is. Let's explore the data further by using a plotting style that overlays the raw data (the code can be found here):

bar chart example not box plot example

In this plot, the red lines represent the mean, the pink region the 95% confidence interval for the mean, and the blue region 1 standard deviation. At last, all is clear! Group A is actually bi-modal and not at all normally distributed. Consequently the standard error of the mean is a useless statistic for it and something is obviously "wrong" with those data. We also confirm that there is a nice, tight, distribution of data points in C. Finally, with the raw data present, its obvious what sort of sample size we're dealing with, which really helps to put the data into context.

Discussion

So which plot style is right? The answer, it seems, is that it doesn't matter all that much if you plot a box plot or a bar chart. But what does matter is that you take the time to overlay the raw data. As you can see, a simple bar chart with only a mean and error bar can hide all manner of terrible things. You won't know what your data are really doing until you plot all of them. Even with many groups and many data points, it's always possible to show all the data in a way that isn't overwhelming:

plot with many many groups

 

Want to continue the discussion?
Enter your comments, suggestions, or thoughts below

comments powered by Disqus