10 min read

Fix leaky pipes in R

Matt Dray (@mattdray)

Data leaking from a pipe (via Giphy)

Data leaking from a pipe (via Giphy)

TL;DR

You can chain function calls in R with %>%. There’s a few ways to catch errors in these pipelines.

C’est un pipe

You know R’s %>% (pipe) operator by now. It lets you chain function calls. It was created for Stefan Milton Bache and Hadley Wickham’s {magrittr} package and popularised by the tidyverse. Par exemple:

# {dplyr} for data manipulation
# it also re-exports the pipe from {magrittr}
library(dplyr)

# Get mean sepal width for two iris species
iris_pipe <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>% 
  group_by(Species) %>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>% 
  mutate(`Mean width` = round(`Mean width`, 1))

# Print
iris_pipe
## # A tibble: 2 x 2
##   Species    `Mean width`
##   <fct>             <dbl>
## 1 setosa              3.4
## 2 versicolor          2.8

Ce n’est pas debuggable?

Some are critical of this approach. Long pipes obscure what’s happened to your data and make debugging hard. There’s no clear recommendation for solving this either.

I think most people create pipes interactively and check their outputs at each step. You could also make sensibly-lengthed pipes for each ‘unit’ of wrangling (read, clean, model, etc). Hadley Wickham discusses this in the pipes chapter of the R for Data Science book.

This post summarises some solutions.

This table summarises it even more:

Package Description Message View() print() Debug
{tidylog} Console-printed description of changes
{ViewPipeSteps} RStudio addin: see changes to dataset per step
{tamper} Stack trace replacement for pipe debugging
{pipecleaner} RStudio addin: ‘burst’ pipes and debug
{magrittr} debug_pipe() function
debug() R’s good old debug() function
{pipes} Special assignment operators
Bizarro pipe Replace %>% with ->.; and observe .Last.level

‘Message’ means whether it prints something informative to the console; View() and print() tell you if the dataset can be viewed at each step; and ‘debug’ if it opens the debug menu.

Read on for explanations and examples.

Ce n’est pas une probleme?

I’ve gathered some solutions into three categories (click to jump):

  1. Summary inspection
    1. {tidylog}
    2. {ViewPipeSteps}
  2. Debug mode
    1. {tamper}
    2. {pipecleaner}
    3. {magrittr}
    4. debug()
  3. Operator hacking
    1. {pipes}
    2. Bizarro pipe

1. Summary inspection

These are packages for seeing what happened to your dataset at each step of your pipeline, rather than highlighting where the problem was.

1a. {tidylog}

The {tidylog} package was written by Benjamin Elbers. It prints to the console some summary sentences of the changes that have happened to your data after each {dplyr} step.

# install.packages("tidylog")  # available from CRAN
library(tidylog)  # must be loaded after dplyr
## 
## Attaching package: 'tidylog'
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct,
##     distinct_all, distinct_at, distinct_if, filter, filter_all,
##     filter_at, filter_if, full_join, group_by, group_by_all,
##     group_by_at, group_by_if, inner_join, left_join, mutate,
##     mutate_all, mutate_at, mutate_if, right_join, select,
##     select_all, select_at, select_if, semi_join, summarise,
##     summarise_all, summarise_at, summarise_if, summarize,
##     summarize_all, summarize_at, summarize_if, tally, top_n,
##     transmute, transmute_all, transmute_at, transmute_if
## The following object is masked from 'package:stats':
## 
##     filter

You can see from the output that {tidylog} masks all the {dplyr} functions. In other words, you can continue use the {dplyr} function names as usual, but with the added {tidylog} side-effect that the changes at each step are reported in the console.

iris_pipe <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>% 
  group_by(Sepal.Width) %>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>% 
  mutate(`Mean width` = round(`Mean width`, 1))
## filter: removed 50 out of 150 rows (33%)
## group_by: one grouping variable (Sepal.Width)
## summarise: now 23 rows and 2 columns, ungrouped
## mutate: no changes

This a nice passive approach. But how does this help? We can sense-check each step. For example:

iris_pipe <- iris %>%
  filter(Species %in% c("cycliophora", "onychophora")) %>% 
  group_by(Sepal.Width) %>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>% 
  mutate(`Mean width` = round(`Mean width`, 1))
## filter: removed all rows (100%)
## group_by: one grouping variable (Sepal.Width)
## summarise: now 0 rows and 2 columns, ungrouped
## mutate: no changes

Did you spot the contrived error? I filtered for species that don’t exist in the dataset1. This was reported as filter: removed all rows (100%) in the first step.

I’ll unload {tidylog} before continuing so it doesn’t interfere with the other examples.

unloadNamespace("tidylog")

1b. {ViewPipeSteps}

The {ViewPipeSteps} package is an RStudio add-in created by David Ranzolin. Basically it runs View() or print() for each of the steps in your pipeline so you can see what’s happened to the the dataset.

# remotes::install_github("daranzolin/ViewPipeSteps")  # not on CRAN
library(ViewPipeSteps)

After installing you can simply highlight your code and select ‘View Pipe Chain Steps’ or ‘Print Pipe Chain Steps’ from the add-ins menu.

Beware if you have lots of steps in your pipeline because it’s going to fill up your console or a while bunch of tabs or windows containing each cut of the dataset.

2. Debug mode

These are packages that help highlight where a problem occurred. These take you to the debug menu, which is worth reading up on if you haven’t used it before.

2a. {tamper}

Gábor Csárdi’s {tamper} package makes pipe debugging easier with a simple, informative interface. The package is currently available but is archived.

You set the error argument of the options to tamper once installed and loaded. From now on {tamper} will override the default stack trace report you get when an error is found.

When there’s an error, {tamper} highlights the problematic line with an arrow. Typing ‘0’ will exit the {tamper} report; ‘1’ switches you back to the stack trace; ‘2’ will enter debug mode.

This is friendly for beginners especially, since the {tamper} output is more readable.

# remotes::install_github("gaborcsardi/tamper")  # not on CRAN
library(tamper)

options(error = tamper::tamper)  # set error option to tamper

iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>% 
  group_by(Species) %>% 
  summarise(`Mean width` = mean(Sepal.Girth)) %>%  # error here!
  mutate(`Mean width` = round(`Mean width`, 1))
## Error in mean(Sepal.Girth) : object 'Sepal.Girth' not found
## 
## Enter 0 to exit or choose:
## 
## 1:    Switch mode
## 2:    Take me to the error
## 
## 3:    iris %>%
## 4: ->   filter(., Species %in% c("setosa", "versicolor")) %>%
## 5:      group_by(., Species) %>%
## 6:      summarise(., `Mean width` = mean(Sepal.Girth)) %>%
## 7:      mutate(., `Mean width` = round(`Mean width`, 1))
## 
## Selection:

2b. {pipecleaner}

The {pipecleaner} package is an RStudio addin by Edward Visel. It has the best name.

You highlight your code and select ‘debug pipeline in browser’ from the RStudio addins menu. This ‘bursts’ your pipeline to one intermediate object per function call, then opens the debug menu. You can also simply ‘burst pipes’ from the addins menu without debug mode.

# remotes::install_github("alistaire47/pipecleaner")  # not on CRAN
library(pipecleaner)

# Intact, original pipeline
iris_pipe <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>% 
  group_by(Sepal.Width) %>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>% 
  mutate(`Mean width` = round(`Mean width`, 1))

# After 'debug pipeline in browser' or 'burst pipes' addins
dot1 <- filter(iris, Species %in% c("setosa", "versicolor"))
dot2 <- group_by(dot1, Species)
dot3 <- summarise(dot2, `Mean width` = mean(Sepal.Width))
iris_pipe <- mutate(dot3, `Mean width` = round(`Mean width`,1))

So effectively it steps through each new object to report back errors. But it leaves you with multiple objects (with meaningless names) to clean up – there’s no ‘fix pipes’ option to return to your original pipeline.

2c. {magrittr}

Surprise: the {magrittr} package itself has the function debug_pipe() to let you see what’s being passed into the next function.

library(magrittr)

iris_magrittr <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  group_by(Species) %>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>%
  debug_pipe() %>% 
  mutate(`Mean width` = round(`Mean width`, 1))

Not much to say about this one, but worth mentioning because %>% gets re-exported in other packages – check out usethis::use_pipe() – but debug_pipe() doesn’t.

2d. debug()

You can simply use R’s debug() function, as pointed out by Nathan Werth.

You can do this for a given function in the pipeline:

debug(summarise)

iris_magrittr <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  group_by(Species) %>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>%
  mutate(`Mean width` = round(`Mean width`, 1))
  
undebug(summarise)

Or you can even debug each step by setting up debug(`%>%`), since the pipe is itself a function, after all.

3. Operator hacking

It’s possible to make variant pipe operators. But maybe we don’t even need %>%?

3a. {pipes}

Antoine Fabri forked the {magrittr} GitHub repo to add a bunch of %>% variants that have side properties. These are available from his {pipes} package.

A few of direct relevance to this discussion:

  • %P>% to print() the datset to the console
  • %V>% will View() the full dataset
  • %D>% opens with debug menu

Others apply different functions during the piping step. There’s some nice ones for summaries, like %glimpse>% and %skim>%.

Here’s an example of %P>% that pipes forward into the next function and prints it to console. (The final output isn’t printed because I’ve assigned it to iris_pipes(), of course.)

# remotes::install_github("moodymudskipper/pipes")  # not on CRAN
library(pipes)
## 
## Attaching package: 'pipes'
## The following object is masked from 'package:dplyr':
## 
##     %>%
iris_pipes <- iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>% 
  group_by(Species) %P>% 
  summarise(`Mean width` = mean(Sepal.Width)) %>%
  mutate(`Mean width` = round(`Mean width`, 1))
## summarise(., `Mean width` = mean(Sepal.Width))
## # A tibble: 2 x 2
##   Species    `Mean width`
##   <fct>             <dbl>
## 1 setosa             3.43
## 2 versicolor         2.77

So this one could have gone in the ‘summary inspection’ section above, but it contains more functions than for printing and viewing alone.

3b. Bizarro pipe

Forget the pipe. We can create an operator that acts like a pipe but can be run so that we can check what’s happening at each step.

John Mount’s solution is the ‘Bizarro pipe’, which looks like ->.;.

The ->.; operator reads as ‘right-assign the left-hand side to a period and then perform the next operation’.

Things you might be wondering:

  • yes, you can use a -> for assignment to the right
  • yes, you can assign to a ., but you’ll need to explicitly supply it as the data argument to the next function call in your Bizarro pipeline
  • yes, you can use semi-colons in R for run-on code execution – try head(iris); tail(iris)

So what? Well, you can exeute each line in turn and check the output. But wait: an object called . is not presented in the global environment. Not unless you check ‘Show .Last.value in environment listing’ in RStudio’s settings. Now when you run the line you’ll see the ‘.Last.value’ that’s been output.

iris_bizarro <- iris ->.;
  filter(., Species %in% c("setosa", "versicolor")) ->.;
  group_by(., Species) ->.;
  summarise(., `Mean width` = mean(Sepal.Width)) ->.;
  mutate(., `Mean width` = round(`Mean width`, 1))
## # A tibble: 2 x 2
##   Species    `Mean width`
##   <fct>             <dbl>
## 1 setosa              3.4
## 2 versicolor          2.8

So it’s slightly convoluted and people looking at your code are going to be confused, but hey, no dependencies are needed.

M’aider

What’s your approach to this problem?

What have I missed?

Bonus cat-with-pipe gif (via Giphy)

Bonus cat-with-pipe gif (via Giphy)

Click for session info

## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.1
## 
## Locale: en_GB.UTF-8 / en_GB.UTF-8 / en_GB.UTF-8 / C / en_GB.UTF-8 / en_GB.UTF-8
## 
## Package version:
##   assertthat_0.2.0 base64enc_0.1.3  BH_1.69.0.1      blogdown_0.7    
##   bookdown_0.7     cli_1.0.1        compiler_3.5.3   crayon_1.3.4    
##   digest_0.6.18    dplyr_0.8.0.1    emo_0.0.0.9000   evaluate_0.13   
##   fansi_0.4.0      glue_1.3.0       graphics_3.5.3   grDevices_3.5.3 
##   highr_0.7        htmltools_0.3.6  httpuv_1.4.5.1   jsonlite_1.6    
##   knitr_1.22       later_0.8.0      lubridate_1.7.4  magrittr_1.5    
##   markdown_0.9     methods_3.5.3    mime_0.6         pillar_1.3.1    
##   pipes_0.0.0.9000 pkgconfig_2.0.2  plogr_0.2.0      promises_1.0.1  
##   purrr_0.3.1      R6_2.4.0         Rcpp_1.0.0       rlang_0.3.1     
##   rmarkdown_1.11   rstudioapi_0.10  servr_0.10       stats_3.5.3     
##   stringi_1.3.1    stringr_1.4.0    tibble_2.0.1     tidyselect_0.2.5
##   tinytex_0.10     tools_3.5.3      utf8_1.1.4       utils_3.5.3     
##   xfun_0.5         yaml_2.2.0


  1. Welcome to Biology Geek Corner. Cycliophora is a phylum containing just one genus and (probably) three species. Our own phylum – Chordata – contains 55,000 species. Symbion pandora was the first cycliophoran species found in 1995, which appears to live commensally and exclusively on lobster lips. Onychophora is the velvet worm phylum that contains wee beasties that spray slime, have little adorable claws and are, surprise, kinda velvety (one species is named ‘totoro’ because of its similarity to My Neighbour Totoro’s Catbus).