At Crunch, one of the ways we try to make data exploration simple is by
providing sensible default views that take into account the properties
of your data and metadata. We’re in the process of releasing some
plotting methods in our crplyr
R package that define methods for
ggplot2
’s
autoplot()
function. autoplot()
was the ideal approach for us to encapsulate the
logic of how to “just do the right thing.” You can do the analysis you
want, and it will make smart choices about how to display it with no
additional input–all of which you can control or override with
additional ggplot2
layers, if you want.
This lets us plot Crunch variables and high dimensional survey cross-tabs easily, while making sure that the plot always fits the data.
library(ggplot2)
library(crplyr)
crunch::login()
ds <- loadDataset("Not-so-simple dataset with all types")
ds %>%
group_by(pasta, food_groups) %>%
summarize(count=n()) %>%
autoplot()
ds %>%
group_by(abolitionists, food_groups) %>%
summarize(count=n()) %>%
autoplot("tile")
autoplot(ds$ec, "bar")
Making it look easy, though, can be hard work. Our autoplot
methods
inspect the input objects to understand their dimensionality and data
types and choose an appropriate visualization. Figuring out and passing
the right arguments to the right places can be messy, so in order to
make these functions work, we took advantage of
tidyeval, a
framework that systematizes non-standard evaluation in R, now
also supported in the new 3.0.0 release of
ggplot2.
To illustrate this pattern, this blog post goes through a simplified
example using the “diamonds” example dataset and pure dplyr
methods.
We’ll create a plotting function that adjusts to the number of grouping
variables in a tibble. This lets you pipe data from a dplyr pipeline
into a single function and get a meaningful, appropriate plot.
The actual code for our Crunch autoplot
methods is
here.
A great feature the R language has is that it lets you access and manipulate the environment in which a function is called. This non-standard evaluation (NSE) gives package authors a flexible and powerful way to build programming interfaces.
At Crunch, we use NSE a lot. One example is in reporting better error
messages. If you send the wrong data to the API, you might get back a
response like 400: Payload is malformed
. That’s not very helpful even
for users who know our API well. We use NSE to inspect the user’s
calling environment, figure out which variables or data structures are
causing the problem, and suggest a
fix:
instead of a generic validation message, we error with the more helpful
ds$some_name must be a Crunch Variable
, pointing back to the input that the user typed.
Despite working with NSE all the time, we’re pretty sure that
we’ve never once gotten it right on the first try. The main reason is
that whenever you are capturing an expression to evaluate later you need
to also keep track of which environment you should evaluate that
expression in. This makes it really difficult to pass unevaluated
expressions between functions and have the evaluation occur without
error. For instance, take this code which replaces a missing
function
argument with a logical value:
f1 <- function(i, j, ...) {
args <- eval(substitute(alist(i, j, ...)))
args <- replace_missing(args)
return(as.character(args))
}
replace_missing <- function(args){
out <- lapply(args, function(x){
if (is.symbol(x)) {
x <- tryCatch(eval(x), error = function(c){
msg <- conditionMessage(c)
if (msg == "argument is missing, with no default") {
return(TRUE)
} else {
stop(c)
}
})
}
return(eval(x))
})
return(out)
}
This code captures an expression at the top level, and passes it down to
a second function which returns TRUE
if it can’t find the argument,
and evaluates expression if it can. We use code very similar to this
for subsetting CrunchCube
objects.
There’s a tricky mistake here though: we aren’t specifying in which environment we want that evaluation to take place. So if we happen to send a variable that is used somewhere in the call stack, we’ll get the wrong result:
x <- 1
y <- 1
f1(1, , 3)
## [1] "1" "TRUE" "3"
f1(y, , 3)
## [1] "1" "TRUE" "3"
f1(x, , 3)
## [1] "x" "TRUE" "3"
What’s happening here is that when the final eval(x)
is happening, x
is identified through lexical scoping, and so it ends up using the x
that’s in the lapply
environment, rather than the one that’s in the
global environment. To fix this, we need to specify the environment
where that evaluation should take place. This is an easy thing to
forget, and presents its own problems if you move functions around, or
later call them in a different order.
Tidyeval offers a solution to this problem: it bundles the expression and its environment in a single object called a “quosure”. What this means is that as a developer, you don’t have to worry about matching expressions to environments and can pass unevaluated expressions between functions with confidence. Because the expression and the environment are bundled together, when you end up evaluating it you won’t ever be surprised by the result.
Back to our autoplot()
project, our initial goal is to have a single function that will produce different
plots based on the number of grouping variables in the tibble it
receives. First, we create a general plotting function
that figures out which plotting sub-function to use:
library(ggplot2)
library(dplyr)
autoplot <- function(df) {
# Add grouping variable which was stripped by summarize
df <- df %>%
group_by(!!!groups(df), !!sym(names(df)[length(groups(df)) + 1]))
if (length(groups(df)) == 1) {
plot_fun <- plot_1d
} else {
plot_fun <- plot_2d
}
vars <- syms(names(df))
plot_fun(df, vars)
}
The function inspects the data frame and then selects a plotting function based on the number of groups in it. It then captures the names of the dataset as a list of symbols and passes it down to the plotting function. The next step is to write the two plotting functions that actually do the work:
plot_1d <- function(df, vars){
groups <- vars[vars %in% groups(df)]
measure <- vars[length(vars)][[1]]
df %>%
select(!!groups[[1]], !!measure) %>%
arrange(desc(!!measure)) %>%
ggplot(aes(x = !!measure, y = !!groups[[1]])) +
geom_point() +
theme_minimal()
}
diamonds %>%
group_by(cut) %>%
tally() %>%
autoplot()
The convenient thing about using tidy eval in this case is that we can
confidently pass the unevaluated names into both the dplyr
and
ggplot2
code without worrying that the evaluation will fail. This
means we can arrange the dataset based on the measure name, and then
plot that measure even though we don’t know ahead of time what the
measure will be called. We can do the same thing with the 2d plot:
plot_2d <- function(df, vars){
groups <- vars[vars %in% groups(df)]
measure <- vars[length(vars)][[1]]
df %>%
select(!!!groups, !!measure) %>%
arrange(desc(!!measure)) %>%
ggplot(aes(x = !!measure, y = !!groups[[1]], , color = !!groups[[2]])) +
geom_point() +
theme_minimal()
}
diamonds %>%
group_by(cut ,clarity) %>%
tally() %>%
autoplot()
This is basically the same code as the 1D plot except that we used the
splice operator (!!!
) in the select call and added another grouping
variable on the color dimension.
What happens when you have more than three dimensions? The ggplot2
package allows us to use tidyeval to dynamically add facets.
autoplot <- function(df) {
# Add grouping variable which was stripped by summarize
df <- df %>%
group_by(!!!groups(df), !!sym(names(df)[length(groups(df)) + 1]))
groups <- groups(df)
if (length(groups) == 1) {
plot_fun <- plot_1d
} else {
plot_fun <- plot_2d
}
vars <- syms(names(df))
out <- plot_fun(df, vars)
if (length(groups) > 2) {
groups <- syms(groups)
out <- out +
facet_wrap(vars(!!!groups[3:length(groups)]))
}
return(out)
}
diamonds %>%
group_by(cut, color, clarity) %>%
summarize(number_of_diamonds = n()) %>%
autoplot()
Tidyeval solves the main challenges of working with R’s non-standard evaluation by
bundling expressions and environments into quosures. The new release of
ggplot2
unlocks the power of using tidyeval for making powerful
visualizations quickly. Though we could have made autoplot
methods for
Crunch objects before tidyeval support, it would have been much more
complicated and error-prone. Using tidyeval and ggplot2
3.0.0 to pass
quosures between functions unlocks powerful new
mechanisms of building user-friendly functions. And that let’s us do
what we strive to do most: get out of the way and let our users explore
their data quickly in a way that matches their intuitions for how R and
tidy conventions work.