Potentially Useful

Writing boring posts is okay

Labeling bar charts (and other graphics) in ggplot2

in: Data Science tagged: R Dataviz

Bar charts are rightfully criticized for being potentially misleading, but they’re still useful for some data graphics. They’re a great way to display counts, for example, and non-technical audiences are comfortable interpreting them.

One way that a typical bar plot can be improved is by removing unnecessary axis labels and directly labeling the bars themselves. For example, if we wanted to see how many counties are in some of the states comprising the Midwestern US, we could use the midwest data set that’s packaged with ggplot2 and draw a simple bar plot.

library(ggplot2)

ggplot(midwest, aes(x = state)) + 
  geom_bar() +
  labs(x = "", y = "Number of counties") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 16),
        axis.title.y = element_text(size = 14))

Count of counties per state for IL, IN, MI, OH, and WI
Count of counties per state for IL, IN, MI, OH, and WI

This is fine (provided that you can accept that Ohio is part of the Midwest and Iowa isn’t). However, we can easily strip away some of the “chart junk” by removing unnecessary theme elements and labeling the bars using geom_text.

ggplot(midwest, aes(state)) +
  geom_bar() +
  geom_text(aes(y = ..count.., label = ..count..), stat = "count", 
            vjust = 0, nudge_y = 2, size = 8) +
  expand_limits(y = 110) +
  theme(panel.background = element_blank(),
        panel.grid = element_blank(),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size = 16)) +
  labs(title = "Number of counties in Midwestern states")

Counties per state (with counts displayed above each bar)
Counties per state (with counts displayed above each bar)

There’s still room for improvement—I’d consider rearranging the bars based on something more meaningful than alphabetic order, and possibly coloring them—but this simultaneously conveys more information than the previous plot (objectively) and is less “cluttered” (subjectively).

How this works

I’m surprised I’m still writing blog posts about ggplot2 after seven years, but I continue to learn new things about it even after regular use. Recently, I decided that it bothered me that there were “stats” (like stat_ecdf) and “geoms” (like geom_bar), and I really didn’t understand the difference between them—they seemed interchange but not identical. I finally looked into it and came across this response on StackOverflow, which contained the following quote from the ggplot2 book:

You only need to set one of stat and geom: every geom has a default stat, and every stat has a default geom.

It turns out that “stats” and “geoms” are largely interchangeable, because they’re both wrappers around layer(), with different ways of handling defaults. The code above works because I changed the stat parameter of geom_text (which, according to the documentation, defaults to "identity") to "count", which is the default stat of geom_bar. Looking at the documentation for stat_count, we see that it provides two computed variables, and we use use the computed count variable in our aesthetics for geom_text to provide both the vertical position and label.

Getting my head around this has helped me iterate through analyses more quickly. For instance, I frequently use stat_ecdf to get a handle on distributions, and now I can label these plots easily. Sticking with data about the Midwest, here’s the ECDF of population density for Midwestern counties, with a few values highlighted.

ggplot(midwest, aes(popdensity)) +
  stat_ecdf() + # This could also be `geom_step(stat = "ecdf")`
  geom_text(aes(label = sprintf("(%.0f: %.0f%%)", ..x.., floor(100 * ..y..))), 
            stat = "ecdf", hjust = 1, vjust = 0, check_overlap = TRUE, size = 5) +
  expand_limits(x = -15000) +
  scale_y_continuous(labels = scales::label_percent()) +
  theme_minimal() +
  labs(x = "Population density", y = "CDF")

The curve of the CDF of population density in Midwestern counties
The curve of the CDF of population density in Midwestern counties

I wouldn’t call this a beautiful graphic, but visualizations like this are often useful, especially when getting a handle on a new data set. Here we see that over a third of Midwestern counties have population densities fully two orders of magnitude smaller than the most densely populated county, and we’ve only had to draw a single plot.

It’s hard to justify putting much effort into producing plots that aren’t meant to be shared, and a typical data analysis will involve creating dozens of plots that are only useful for the person analyzing the data. Without understanding stats and geoms, I would never have bothered labeling an ECDF plot; “quick and dirty” needs to be quick after all. On the other hand, labeling the values in a bar chart is something that one should do when preparing publication-quality graphics. GGplot2 is useful in both phases of a project.

In the words of Alan Kay, “simple things should be simple, complex things should be possible.” GGplot2 does both of these consistently. Occasionally, it even makes complex things simple.