Aug 3, 2018

Making Graphs Look Easy with ggplot2 and Tidy Evaluation

At Crunch, one of the ways we try to make data exploration simple is by providing sensible default views that take into account the properties of your data and metadata. We’re in the process of releasing some plotting methods in our crplyr R package that define methods for ggplot2’s autoplot() function. autoplot() was the ideal approach for us to encapsulate the logic of how to “just do the right thing.” You can do the analysis you want, and it will make smart choices about how to display it with no additional input–all of which you can control or override with additional ggplot2 layers, if you want.

This lets us plot Crunch variables and high dimensional survey cross-tabs easily, while making sure that the plot always fits the data.

library(ggplot2)
library(crplyr)

crunch::login()
ds <- loadDataset("Not-so-simple dataset with all types")

ds %>%
    group_by(pasta, food_groups) %>%
    summarize(count=n()) %>%
    autoplot()

ds %>%
    group_by(abolitionists, food_groups) %>%
    summarize(count=n()) %>%
    autoplot("tile")

autoplot(ds$ec, "bar")

Making it look easy, though, can be hard work. Our autoplot methods inspect the input objects to understand their dimensionality and data types and choose an appropriate visualization. Figuring out and passing the right arguments to the right places can be messy, so in order to make these functions work, we took advantage of tidyeval, a framework that systematizes non-standard evaluation in R, now also supported in the new 3.0.0 release of ggplot2.

To illustrate this pattern, this blog post goes through a simplified example using the “diamonds” example dataset and pure dplyr methods. We’ll create a plotting function that adjusts to the number of grouping variables in a tibble. This lets you pipe data from a dplyr pipeline into a single function and get a meaningful, appropriate plot.

The actual code for our Crunch autoplot methods is here.

Non-Standard Evaluation: Great Power, Great Pain

A great feature the R language has is that it lets you access and manipulate the environment in which a function is called. This non-standard evaluation (NSE) gives package authors a flexible and powerful way to build programming interfaces.

At Crunch, we use NSE a lot. One example is in reporting better error messages. If you send the wrong data to the API, you might get back a response like 400: Payload is malformed. That’s not very helpful even for users who know our API well. We use NSE to inspect the user’s calling environment, figure out which variables or data structures are causing the problem, and suggest a fix: instead of a generic validation message, we error with the more helpful ds$some_name must be a Crunch Variable, pointing back to the input that the user typed.

Despite working with NSE all the time, we’re pretty sure that we’ve never once gotten it right on the first try. The main reason is that whenever you are capturing an expression to evaluate later you need to also keep track of which environment you should evaluate that expression in. This makes it really difficult to pass unevaluated expressions between functions and have the evaluation occur without error. For instance, take this code which replaces a missing function argument with a logical value:

f1 <- function(i, j, ...) {
    args <- eval(substitute(alist(i, j, ...)))
    args <- replace_missing(args)
    return(as.character(args))
}

replace_missing <- function(args){
    out <- lapply(args, function(x){
        if (is.symbol(x)) {
            x <- tryCatch(eval(x), error = function(c){
                msg <- conditionMessage(c)
                if (msg == "argument is missing, with no default") {
                    return(TRUE)
                } else {
                    stop(c)
                }
            })
        }
        return(eval(x))
    })
    return(out)
}

This code captures an expression at the top level, and passes it down to a second function which returns TRUE if it can’t find the argument, and evaluates expression if it can. We use code very similar to this for subsetting CrunchCube objects.

There’s a tricky mistake here though: we aren’t specifying in which environment we want that evaluation to take place. So if we happen to send a variable that is used somewhere in the call stack, we’ll get the wrong result:

x <- 1
y <- 1

f1(1, , 3)

## [1] "1"    "TRUE" "3"

f1(y, , 3)

## [1] "1"    "TRUE" "3"

f1(x, , 3)

## [1] "x"    "TRUE" "3"

What’s happening here is that when the final eval(x) is happening, x is identified through lexical scoping, and so it ends up using the x that’s in the lapply environment, rather than the one that’s in the global environment. To fix this, we need to specify the environment where that evaluation should take place. This is an easy thing to forget, and presents its own problems if you move functions around, or later call them in a different order.

Enter the Quosure

Tidyeval offers a solution to this problem: it bundles the expression and its environment in a single object called a “quosure”. What this means is that as a developer, you don’t have to worry about matching expressions to environments and can pass unevaluated expressions between functions with confidence. Because the expression and the environment are bundled together, when you end up evaluating it you won’t ever be surprised by the result.

Autoplot

Back to our autoplot() project, our initial goal is to have a single function that will produce different plots based on the number of grouping variables in the tibble it receives. First, we create a general plotting function that figures out which plotting sub-function to use:

library(ggplot2)
library(dplyr)

autoplot <- function(df) {
    # Add grouping variable which was stripped by summarize
    df <- df %>%
        group_by(!!!groups(df), !!sym(names(df)[length(groups(df)) + 1]))

    if (length(groups(df)) == 1) {
        plot_fun <- plot_1d
    } else {
        plot_fun <- plot_2d
    }
    vars <- syms(names(df))
    plot_fun(df, vars)
}

The function inspects the data frame and then selects a plotting function based on the number of groups in it. It then captures the names of the dataset as a list of symbols and passes it down to the plotting function. The next step is to write the two plotting functions that actually do the work:

plot_1d <- function(df, vars){
    groups <- vars[vars %in% groups(df)]
    measure <- vars[length(vars)][[1]]
    df %>%
        select(!!groups[[1]], !!measure) %>%
        arrange(desc(!!measure)) %>%
        ggplot(aes(x = !!measure, y = !!groups[[1]])) +
        geom_point() +
        theme_minimal()
}

diamonds %>%
    group_by(cut) %>%
    tally() %>%
    autoplot()

The convenient thing about using tidy eval in this case is that we can confidently pass the unevaluated names into both the dplyr and ggplot2 code without worrying that the evaluation will fail. This means we can arrange the dataset based on the measure name, and then plot that measure even though we don’t know ahead of time what the measure will be called. We can do the same thing with the 2d plot:

plot_2d <- function(df, vars){
    groups <- vars[vars %in% groups(df)]
    measure <- vars[length(vars)][[1]]
    df %>%
        select(!!!groups, !!measure) %>%
        arrange(desc(!!measure)) %>%
        ggplot(aes(x = !!measure, y = !!groups[[1]], , color = !!groups[[2]])) +
        geom_point() +
        theme_minimal()
}

diamonds %>%
    group_by(cut ,clarity) %>%
    tally() %>%
    autoplot()

This is basically the same code as the 1D plot except that we used the splice operator (!!!) in the select call and added another grouping variable on the color dimension.

What happens when you have more than three dimensions? The ggplot2 package allows us to use tidyeval to dynamically add facets.

autoplot <- function(df) {
    # Add grouping variable which was stripped by summarize
    df <- df %>%
        group_by(!!!groups(df), !!sym(names(df)[length(groups(df)) + 1]))

    groups <- groups(df)
    if (length(groups) == 1) {
        plot_fun <- plot_1d
    } else {
        plot_fun <- plot_2d
    }
    vars <- syms(names(df))
    out <- plot_fun(df, vars)

    if (length(groups) > 2) {
        groups <- syms(groups)
        out <- out +
            facet_wrap(vars(!!!groups[3:length(groups)]))
    }
    return(out)
}

diamonds %>%
    group_by(cut, color, clarity) %>%
    summarize(number_of_diamonds = n()) %>%
    autoplot()

Conclusion

Tidyeval solves the main challenges of working with R’s non-standard evaluation by bundling expressions and environments into quosures. The new release of ggplot2 unlocks the power of using tidyeval for making powerful visualizations quickly. Though we could have made autoplot methods for Crunch objects before tidyeval support, it would have been much more complicated and error-prone. Using tidyeval and ggplot2 3.0.0 to pass quosures between functions unlocks powerful new mechanisms of building user-friendly functions. And that let’s us do what we strive to do most: get out of the way and let our users explore their data quickly in a way that matches their intuitions for how R and tidy conventions work.