Potentially Useful

Writing boring posts is okay

Recreating a Tufte Slope Graph

in: Programming Data Science tagged: R Dataviz Design

Recently on Twitter, data visualization guru Edward R. Tufte wrote that graphics produced by R are not “publication ready”. His proposed workflow is to use statistical software to create an initial version of a plot, and then make final improvements in Adobe Illustrator.

I disagree with this advice. First I’ll show the steps a data analyst might take to create a high-quality graphic entirely in R. Then, I’ll explain why I think this is a better approach.

Publication quality graphics in R

Page 158 of Tufte’s classic book, The Visual Display of Quantitative Information (2nd ed.), features a “slope graph” that shows the change in government receipts for several countries between 1970 and 1979. Below are the first few rows of these data in a tidy data frame, receiptData, along with a quick and dirty slopegraph.

country year receipts
Belgium 1970 35.2
Belgium 1979 43.2
Britain 1970 40.7
Britain 1979 39.0
Canada 1970 35.2
Canada 1979 35.8
ggplot(receiptData, aes(year, receipts, group = country)) +
    geom_line() +
    geom_text_repel(aes(label = country)) +
    labs(x = "Year", y = "Government receipts as percentage of GDP")

A quick and dirty slopegraph
A quick and dirty slopegraph

This plot is not attractive, but it is useful for getting a handle on the data. Whether iterating through an exploratory data analysis or preparing a graphic for publication, analysts will create many ugly graphics on the path to settling on a design and refining it.

For our first round of improvements, we can change the aspect ratio of the graphic and arrange the country labels so they don’t overlap with the data. We should also remove the “chart junk” in the background, such as the background grid, and label only the years of interest on the x-axis.

ggplot(receiptData, aes(year, receipts, group = country)) +
    geom_line() +
    geom_text_repel(aes(label = country),
                    data = filter(receiptData, year == 1970),
                    nudge_x = -0.5, hjust = 1,
                    direction = "y", size = 5) +
    geom_text_repel(aes(label = country),
                    data = filter(receiptData, year == 1979),
                    nudge_x = 0.5, hjust = 0, 
                    direction = "y", size = 5) +
    scale_x_continuous(breaks = c(1970, 1979), limits = c(1966, 1983)) +
    theme(panel.background = element_blank(),
          axis.title = element_text(size = 16),
          axis.text = element_text(size = 12)) +
    labs(x = "Year", y = "Government receipts as percentage of GDP")

A better but still imperfect graphic
A better but still imperfect graphic

The ggrepel package has done a great job preventing the labels from overlapping; I usually use geom_text_repel() instead of geom_text() during exploratory data analysis. Here, however, the segments connecting the labels to data points create confusing clutter. While these can be removed within the function, our final figure will be easier to understand if we nudge labels manually so that they’re as close to their data as possible.

We can also drop the axes; the year labels will go right above the data, and we’ll show each country’s value next to its label. Since we’re losing the axis titles, we’ll also embed a caption in this version of the graphic.

Finally, we’ll change the typeface. Since we’re trying to make something that would please Tufte, we’ll use a serif font. If you haven’t done so before, you may need to run loadfonts() from the extrafont package to tell R how to find system fonts. We’ll also further boost our “data-ink ratio” by making the lines thinner.

labelAdjustments <- tribble(
                                ~country, ~year, ~nudge_y,
                           "Netherlands", 1970L,      0.3,
                                "Norway", 1970L,     -0.2,
                               "Belgium", 1970L,      0.9,
                                "Canada", 1970L,     -0.1,
                               "Finland", 1970L,     -0.8,
                                 "Italy", 1970L,      0.4,
                         "United States", 1970L,     -0.5,
                                "Greece", 1970L,      0.4,
                           "Switzerland", 1970L,     -0.3,
                                "France", 1979L,      0.8,
                               "Germany", 1979L,     -0.7,
                               "Britain", 1979L,      0.1,
                               "Finland", 1979L,     -0.1,
                                "Canada", 1979L,      0.4,
                                 "Italy", 1979L,     -0.5,
                           "Switzerland", 1979L,      0.2,
                         "United States", 1979L,     -0.1,
                                 "Spain", 1979L,      0.3,
                                 "Japan", 1979L,     -0.2

receiptData <- left_join(receiptData, labelAdjustments,
                         by = c("country", "year")) %>% 
    mutate(receipts_nudged = ifelse(is.na(nudge_y), 0, nudge_y) + receipts)

update_geom_defaults("text", list(family = "Georgia", size = 4.0))
ggplot(receiptData, aes(year, receipts, group = country)) +
    # Slope lines
    geom_line(size = 0.1) +
    # Country names (manually nudged)
    geom_text(aes(year, receipts_nudged, label = country),
              data = filter(receiptData, year == 1970),
              hjust = 1, nudge_x = -3) +
    geom_text(aes(year, receipts_nudged, label = country),
              data = filter(receiptData, year == 1979),
              hjust = 0, nudge_x = 3) +
    # Data values
    geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
              data = filter(receiptData, year == 1970),
              hjust = 1, nudge_x = -0.5) +
    geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
              data = filter(receiptData, year == 1979),
              hjust = 0, nudge_x = 0.5) +
    # Column labels
    annotate("text", x = c(1970, 1979) + c(-0.5, 0.5), 
             y = max(receiptData$receipts) + 3,
             label = c("1970", "1979"), 
             hjust = c(1, 0)) +
    # Plot annotation
    annotate("text", x = 1945, y = 58, hjust = 0, vjust = 1,
             label = paste("Current Receipts of Government as a", 
                           "Percentage of Gross Domestic\nProduct, 1970 and 1979",
                            sep = "\n"),
             family = "Georgia", size = 4.0) +
    coord_cartesian(xlim = c(1945, 1990)) +
    theme(panel.background = element_blank(),
          axis.title = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank())

A publication-ready graphic
A publication-ready graphic

Why it’s worth it

It took several iterations to get the aesthetics of this plot just right, while an adept user of a graphics editor would be able to recreate it in minutes. However, this initial time savings obscures several advantages to completing this process in code as opposed to switching to a tool like Illustrator.

R is free software

All of the programs I used to create this graphic (and share it with you) are free software, unlike Adobe’s terrific but expensive Illustrator. Inkscape is an excellent open source alternative, but it is not as powerful as its commercial competitor. On the other hand, R is arguably the most advanced and well-supported statistical analysis tool. Even if the politics of the open source movement aren’t important, free is a better price-point.

Code can be reproduced

Anybody who wants to recreate the final figure using a different set of countries, a new pair of years, or a favored economic indicator can run the code above on their own data. With minimal tweaking, they will have a graphic with aesthetics identical to those above. Reproducing this graphic using a multi-tool workflow is more complicated; a novice would first need to learn how to create a “rough draft” of it in software, and then learn how to use a graphics editor to improve its appearance.

Scripted graphics are easier to update

Imagine that an analyst discovered a mistake in their analysis that needed to be corrected. Or imagine the less stressful situation where a colleague suggested replacing some of the countries in the figure. In such cases, the work of creating the graphic will need to be redone: the analyst will need to add data or correct errors, generate a new plot in software, and re-edit the plot in their graphics editor. Using a scripted, code-based workflow, the analyst would only need to do the first step, and possibly update some manual tweaks. The initial time savings afforded by a graphics editor disappears after the first time this happens.

Automation prevents errors

Not only does scripted workflow make it easier to deal with errors, it guards against them. When analysts adjust graphic elements in a GUI, there’s a risk of mistakenly moving, deleting, or mislabeling data points. Such mistakes are difficult to detect, but code-based workflows avoid these risks; if the data source and analysis are error-free, then the graphic will also be.

These considerations are among the reasons why scripted analyses are considered best practice in statistical analysis. Most of Tufte’s advice on creating graphics is excellent, but I recommend ignoring his suggestion to make the final edits in a program like Illustrator. Learning how to make polished graphics in software will ultimately save analysts time and help them avoid mistakes.