Variable Folders and Organization • Crunch

In the web application, variables in a dataset are displayed in a list on the left side of the screen.

Typically, when you import a dataset, the variable list is flat, but it can be organized into an accordion-like hierarchy. The variable organizer in the Crunch GUI allows you to organize your variables visually, but you can also manage this metadata from R using the crunch package.

File-system like

This variable hierarchy can be thought of like a file system on your computer, with files (variables) organized into directories (folders). As such, the main functions you use to manage it are reminiscent of a file system.

cd(), changes directories, i.e. selects a folder
mkdir() makes a directory, i.e. creates a folder
mv() moves variables and folders to a different folder
rmdir() removes a directory, i.e. deletes a folder

Like a file system, you can express the “path” to a folder as a string separated by a “/” delimiter, like this:

mkdir(ds, "Brands/Cars and Trucks/Domestic")

If your folder names should legitimately have a “/” in them, you can set a different character to be the path separator. See ?mkdir or any of the other functions’ help files for details.

Paths can be expressed relative to the current object—a folder, or in this case, the dataset, which translates to its top-level "/" root folder in path specification—and the file system’s special path segments ".." (go up a level) and "." (this level) are also supported. We’ll use those in examples below.

You can also specify paths as a vector of path segments, like

mkdir(ds, c("Brands", "Cars and Trucks", "Domestic"))

which is equivalent to the previous example. One or the other way may be more convenient, depending on what you’re trying to accomplish.

These four functions all take a dataset or a folder as the first argument, and they return the same object passed to it, except for cd, which returns the selected folder. As such, they are designed to work with magrittr-style piping (%>%) for convenience in chaining together steps, though they don’t require that you do.

Viewing the folders

To get started, let’s pick up the dataset we used in the array variables vignette and view its starting layout. We can do that by selecting the root folder (“/”) and printing it

library(magrittr)
ds %>%
    cd("/") %>%
    print()

(The print() isn’t strictly necessary here as cd will return the folder and thus it will print by default, but we’ll use different print arguments later, so it’s included here both for explicitness and illustration.)

It’s flat—there are no folders here, only variables. If you’re importing data from a data.frame or a file, like an SPSS file, this is where you’ll begin.

Creating folders

Let’s make some folders and move some variables into them. To start, I know that the demographic variables are at the back of the dataset, so let’s make a “Demos” folder and move variables 21 to 37 into it:

ds %>%
    mkdir("Demos") %>%
    mv(21:37, "Demos")

Now when I print the top-level directory again, I see a “Demos” folder and don’t see those demographic variables:

ds %>%
    cd("/")

mv() can reference variables or folders within a level in several ways. Numeric indices like we just did probably won’t be the most common way you’ll do it: names work just as well and are more transparent. Let’s move the first variable, perc_skipped, into “Demos” as well

ds %>%
    mv("perc_skipped", "Demos") %>%
    cd("Demos") ## To print the folder contents

A side note: although the last step of that chain was cd(), we haven’t changed state in our R session. There is no “working folder” set globally. cd() is a function that returns a folder; if we had assigned the return from the function (pipeline) to some object, we could then pass that in to another function to “start” in that folder.

Another way we can identify variables is by using the dplyr-like functions starts_with, ends_with, matches, and contains. Let’s use matches to move all of the questions about Edward Snowden or Bradley (Chelsea) Manning to a folder for the topical questions in this week’s survey:

ds %>%
    mkdir("This week") %>%
    mv(matches("manning|snowden", ignore.case = TRUE), "This week")

We can also select all variables in a folder using the variables function (or all folders within a folder using folders). Let’s move all remaining variables from the top level folder to a folder called “Tracking questions”. To do this, we do need to explicitly change to the top level folder.

ds %>%
    cd("/")
    mkdir("Tracking questions") %>%
    mv(., variables(.), "Tracking questions")

(Curious about the “dot” notation? See the magrittr docs.)

The reason we change to the top level folder here is that there is a subtle difference between passing ds to mv() versus cd(ds, "/"). Whatever object, dataset or folder, that is passed into mv() determines the scope from which the objects to move are selected. If you pass the dataset in, you can select any variables in the dataset, regardless of what folder they’re in. If you pass in a folder, you’re selecting just from that folder’s contents. It can be convenient to find all variables that match some criteria across the whole dataset to move them, but sometimes we don’t want that. In this case, we wanted only the variables sitting in the top level folder, not nested in other folders, so we wanted variables(cd(ds, "/")) and not variables(ds).

Now, our variable tree has some structure. Let’s use print(folder, depth = 1) to see these folders and their contents one level deep:

ds %>%
    cd("/") %>%
    print(depth = 1)

Nested folders

We can create folders within folders as well. In the “This week” folder, we have a set of questions about Edward Snowden. Let’s nest them inside their own subfolder inside “This week”:

ds %>%
    cd("This week") %>%
    mkdir("Snowden") %>%
    mv(matches("snowden", ignore.case = TRUE), "Snowden") %>%
    cd("..") %>%
    print(depth = 2)

Note how we used ".." to change folders up a level, as you can in a file system . We did that just so we can print the folder structure at the top level (and to illustrate that you can specify relative paths :).

You could also do this using the full path segments. mkdir will recursively make all path segments it needs in order to ensure that the target folder exists.

ds %>%
    mkdir("This week/Snowden") %>%
    mv(matches("snowden", ignore.case = TRUE), "This week/Snowden") %>%
    cd("/") %>%
    print(depth = 2)

Renaming folders and folder contents

Folders themselves have names, which we can set with setName():

ds %>%
    cd("Demos") %>%
    setName("Demographics")

We can also set the names of the objects contained in a folder with setNames():

ds %>%
    cd("Demos") %>%
    setNames(c("Birth Year", "Gender", "Political Ideo. (3 category)",
    "Political Ideo. (7 category)",  "Political Ideo. (7 category; other)",  
    "Race", "Education", "Marital Status", "Phone", "Family Income", "Region",
    "State", "Weight", "Voter Registration (new)", "Is a voter?",
    "Voter Registration (old)", "Voter Registration"))

Ordering within folders

Unlike files in a file system, variables within folders are ordered.

Let’s move “Demographics” to the end. One way to do that is with the setOrder function. This lets you provide a specific order, but it requires you to specify all of the folder’s contents. Let’s use that function to put “Tracking questions” first:

ds %>%
    cd("/") %>%
    setOrder(c("Tracking questions", "This week", "Demographics"))

Deleting folders

The cleanest way to delete a folder is with rmdir():

ds %>%
    rmdir("This week/Snowden")

This deletes the folder and all variables contained within it.

Next: transforming and deriving