Fork and Merge a Dataset • Crunch

One of the main benefits of Crunch is that it lets analysts and clients work with the same datasets. Instead of emailing datasets to clients, you can update the live dataset and ensure that they will see see the most up-to-date information. The potential problem with this setup is that it can become difficult to make provisional changes to the dataset without publishing it to the client. Sometimes an analyst wants to investigate or experiment with a dataset without the risk of sending incorrect or confusing information to the end user This is why we implemented a fork-edit-merge workflow for Crunch datasets.

“Fork” originates in computer version control systems and just means to take a copy of something with the intention of making some changes to the copy, and then incorporating those changes back into the original. A helpful mnemonic is to think of a path which forks off from the main road, but then rejoins it later on. To see how this works lets first upload a new dataset to Crunch.

library(crunch)

my_project <- newProject("examples/rcrunch vignette data")
ds <- newDataset(SO_survey, "stackoverflow_survey", project = my_project)

Imagine that this dataset is shared with several users, and you want to update it without affecting their usage. You might also want to consult with other analysts or decision makers to make sure that the data is accurate before sharing it with clients. To do this you call forkDataset() to create a copy of the dataset. The fork is placed in a folder following the same logic as new datasets, requiring you to specify a project folder unless you have option R_CRUNCH_DEFAULT_PROJECT set.

forked_ds <- forkDataset(ds, project = "examples/rcrunch vignette data")

You now have a copied dataset which is identical to the original, and are free to make changes without fear of disrupting the client’s experience. You can add or remove variables, delete records, or change the dataset’s organization. These changes will be isolated to your fork and won’t be visible to the end user until you decide to merge the fork back into the original dataset. This lets you edit the dataset with confidence because your work is isolated.

In this case, let’s create a new categorical array variable.

forked_ds$ImportantHiringCA <- makeArray(forked_ds[, c("ImportantHiringTechExp", "ImportantHiringPMExp")],
    name = "importantCatArray")

Our forked dataset has diverged from the original dataset. Which we can see by comparing their names.

all.equal(names(forked_ds), names(ds))

## [1] "Lengths (22, 23) differ (string compare on first 22)"
## [2] "14 string mismatches"

You can work with the forked dataset as long as you like, if you want to see it in the web App or share it with other analysts by you can do so by calling webApp(forked_ds). You might create many forks and discard most of them without merging them into the original dataset.

If you do end up with changes to the forked dataset that you want to include in the original dataset you can do so with the mergeFork() function. This function figures out what changes you made the fork, and then applies those changes to the original dataset.

ds <- mergeFork(ds, forked_ds)

After merging the original dataset includes the categorical array variable which we created on the fork.

ds$ImportantHiringCA

## importantCatArray (categorical_array)
## Subvariables:
##   $ImportantHiringTechExp  | ImportantHiringTechExp
##   $ImportantHiringPMExp    | ImportantHiringPMExp

It’s possible to to make changes to a fork which can’t be easily merged into the original dataset. For instance if, while we were working on this fork someone added another variable called ImportantHiringCA to the original dataset the merge might fail because there’s no safe way to reconcile the two forks. This is called a “merge conflict” and there are a couple best practices that you can follow to avoid this problem:

Make minimal changes to dataset forks. Instead of making lots of changes to a fork, make a couple of small change to the fork, merge it back into the original dataset, and a create a new fork for the next set of changes
Have other analysts work on their own forks. It’s easier to avoid conflicts if each member of the team makes changes to their own fork and then periodically merges those changes back into the original dataset. This lets you coordinate the order that you want to apply changes to the original dataset, and so avoid some merge conflicts.

Appending data

Another good use of the fork-edit-merge workflow is when you want to append data to an existing dataset. When appending data you usually want to check that the append operation completed successfully before publishing the data to users. This might come up if you are adding a second wave of a survey, or including some additional data which came in after the dataset was initially sent to clients. The first step is to upload the second survey wave as its own dataset.

wave2 <- newDataset(SO_survey, "SO_survey_wave2", project = my_project)

We then fork the original dataset and append the new wave onto the forked dataset.

ds_fork <- forkDataset(ds, project = "examples/rcrunch vignette data")
ds_fork <- appendDataset(ds_fork, wave2)

ds_fork now has twice as many rows as ds which we can verify with nrow:

nrow(ds)

## [1] 1634

nrow(ds_fork)

## [1] 3268

Once we’ve confirmed that the append completed successfully we can merge the forked dataset back into the original one.

ds <- mergeFork(ds, ds_fork)

ds now has the additional rows.

nrow(ds)

## [1] 3268

Merging datasets

Merging two datasets together can often be the source of unexpected behavior like misaligning or overwriting variables, and so it’s a good candidate for this workflow. Let’s create a fake dataset with household size to merge onto the original one.

house_table <- data.frame(Respondent = unique(as.vector(ds$Respondent)))
house_table$HouseholdSize <- sample(
    1:5,
    nrow(house_table),
    TRUE
)
house_ds <- newDataset(house_table, "House Size", project = my_project)

There are a few reasons why we might not want to merge this new table onto our user facing data. For instance we might make a mistake in constructing the table, or have some category names which don’t quite match up. Merging the data onto a forked dataset again gives us the safety to make changes and verify accuracy without affecting client-facing data.

ds_fork <- forkDataset(ds, project = "examples/rcrunch vignette data")
ds_fork <- merge(ds_fork, house_ds, by = "Respondent")

Before merging the fork back into the original dataset, we can check that everything went well with the join.

crtabs(~ TabsSpaces + HouseholdSize, ds_fork)

##           HouseholdSize
## TabsSpaces   1   2   3   4   5
##     Both   158 124  96 106 102
##     Spaces 214 256 282 282 228
##     Tabs   318 268 230 272 308

And finally once we’re comfortable that everything went as expected we can send the data to the client by merging the fork back to the original dataset.

ds <- mergeFork(ds, ds_fork)
ds$HouseholdSize

## HouseholdSize (numeric)
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.971   4.000   5.000

Conclusion

Forking and merging datasets is a great way to make changes to the data. It allows you to verify your work and get approval before putting the data in front of clients, and gives you the freedom to make changes and mistakes without worrying about disrupting production data.