rostrum.blog - Fix leaky pipes in R

The character Data from Star Trek: The Next Generation is smoking a pipe. — Data leaking from a pipe.

tl;dr

You can chain function calls in R with %>%. There’s a few ways to catch errors in these pipelines.

Note

This post was first published before the base pipe (|>) existed. You should assume that the solutions here work for the {magrittr} pipe (%>%) only.

More solutions have emerged since this post was published, like Antoine Fabri’s {boomer} and Sean Kross’s {mario}. I may update this post in future.

C’est un pipe

R users will be familiar with {magrittr}’s %>% (pipe) operator by now, created for Stefan Milton Bache and Hadley Wickham’s {magrittr} package and popularised by the tidyverse.

It lets you chain function calls. x %>% y() evaluates the same way as y(x). Par exemple, avec les pingouins:

library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)

# Get mean mass for two 
peng_pipe <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  mutate(mean_length = round(mean_length))

peng_pipe

# A tibble: 2 × 2
  species   mean_length
  <fct>           <dbl>
1 Adelie             39
2 Chinstrap          49

Ce n’est pas debuggable?

Not everyone likes this approach. That’s okay. Multi-step pipes could obscure what’s happened to your data and might make debugging harder.

I think most people create pipes interactively and check outputs as they go, or else make sensibly-lengthed chains for each ‘unit’ of wrangling (read, clean, model, etc). Hadley discusses this in the pipes chapter of the R for Data Science book.

I’ve summarised a few solutions in this post, which can be summarised even more in this table:

Package	Description	Message	`View()`	`print()`	Debug
{tidylog}	Console-printed description of changes	✔	✘	✘	✘
{ViewPipeSteps}	RStudio Addin: see changes to data set per step	✘	✔	✔	✘
{tamper}	Stack trace replacement for pipe debugging	✔	✘	✘	✔
{pipecleaner}	RStudio Addin: ‘burst’ pipes and debug	✘	✘	✘	✔
{magrittr}	`debug_pipe()` function	✘	✘	✘	✔
`debug()`	R’s good old `debug()` function	✘	✘	✘	✔
{pipes}	Special assignment operators	✘	✔	✔	✔
Bizarro pipe	Replace `%>%` with `->.;` and observe `.Last.level`	✘	✘	✘	✘

‘Message’ means whether it prints something informative to the console; View() and print() tell you if the data set can be viewed at each step; and ‘debug’ if it opens the debug menu.

Ce n’est pas une probleme?

I’ve gathered the solutions into three categories (click to jump):

Summary inspection

{tidylog}
{ViewPipeSteps}

Debug mode

{tamper}
{pipecleaner}
{magrittr}
debug()

Operator hacking

{pipes}
Bizarro pipe

1. Summary inspection

These are packages for seeing what happened to your data set at each step of your pipeline, rather than highlighting where the problem was.

1a. {tidylog}

The {tidylog} package by Benjamin Elbers prints to the console some summary sentences of the changes that have happened to your data after each pipe step. You can install it from CRAN:

install.packages("tidylog")

Make sure you attach it after {dplyr}.

library(tidylog)


Attaching package: 'tidylog'

The following objects are masked from 'package:dplyr':

    add_count, add_tally, anti_join, count, distinct, distinct_all,
    distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
    full_join, group_by, group_by_all, group_by_at, group_by_if,
    inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
    relocate, rename, rename_all, rename_at, rename_if, rename_with,
    right_join, sample_frac, sample_n, select, select_all, select_at,
    select_if, semi_join, slice, slice_head, slice_max, slice_min,
    slice_sample, slice_tail, summarise, summarise_all, summarise_at,
    summarise_if, summarize, summarize_all, summarize_at, summarize_if,
    tally, top_frac, top_n, transmute, transmute_all, transmute_at,
    transmute_if, ungroup

The following object is masked from 'package:stats':

    filter

You can see from the output that {tidylog} masks all the {dplyr} functions. In other words, you can continue use the {dplyr} function names as usual, but {tidylog} will add a side-effect: it will print in English a summary of the changes.

peng_pipe <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  mutate(mean_length = round(mean_length))

filter: removed 124 rows (36%), 220 rows remaining

group_by: one grouping variable (species)

summarise: now 2 rows and 2 columns, ungrouped

mutate: changed 2 values (100%) of 'mean_length' (0 new NA)

This a nice passive approach. But how does this help? We can sense-check each step. For example:

peng_pipe <- penguins %>%
  filter(species %in% c("Cycliophora", "Onychophora")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  mutate(mean_length = round(mean_length))

filter: removed all rows (100%)

group_by: one grouping variable (species)

summarise: now 0 rows and 2 columns, ungrouped

mutate: no changes

Did you spot the extremely contrived error? I filtered for species that don’t exist in the data set¹. This was reported as filter: removed all rows (100%) in the first step.

I’ll unload {tidylog} before continuing so it doesn’t interfere with the other examples.

unloadNamespace("tidylog")

1b. {ViewPipeSteps}

The {ViewPipeSteps} package by David Ranzolin is an RStudio Addin available from GitHub. Basically it lets you View() or print() at each step of your pipeline so you can see what’s happened to the the data set.

remotes::install_github("daranzolin/ViewPipeSteps")
library(ViewPipeSteps)

After installing you can simply highlight your code and select ‘View Pipe Chain Steps’ or ‘Print Pipe Chain Steps’ from the RStudio Addins menu.

Beware if you have lots of steps in your pipeline because it’s going to fill up your console, or a whole bunch of RStudio tabs or windows containing each cut of the data set.

2. Debug mode

These are packages that help highlight where a problem occurred. These take you to the debug menu, which is worth reading up on if you haven’t used it before.

2a. {tamper}

Gábor Csárdi’s {tamper} package makes pipe debugging easier with a simple, informative interface. The package is currently available on GitHub but is archived.

You set the error argument of options() to tamper once installed and loaded. From now on {tamper} will override the default stack trace report you get when an error is found. Here I’ve used a column that doesn’t exist in the data set:

remotes::install_github("gaborcsardi/tamper")
options(error = tamper::tamper)

penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_girth, na.rm = TRUE)) %>%  # error here!
  mutate(mean_length = round(mean_length))

When there’s an error, {tamper} highlights the problematic line with an arrow. Typing ‘0’ will exit the {tamper} report; ‘1’ switches you back to the stack trace; ‘2’ will enter debug mode. Here’s how that looks in the console at first:

## Error in mean(Sepal.Girth) : object 'Sepal.Girth' not found
## 
## Enter 0 to exit or choose:
## 
## 1:    Switch mode
## 2:    Take me to the error
## 
## 3:    penguins %>%
## 4: ->   filter(., Species %in% c("Adelie", "Chinstrap")) %>%
## 5:      group_by(., species) %>%
## 6:      summarise(., mean_length = mean(bill_girth, na.rm = TRUE)) %>%
## 7:      mutate(., mean_length = round(mean_length))
## 
## Selection:

2b. {pipecleaner}

The {pipecleaner} package by Edward Visel is another RStudio Addin available on GitHub. It has the best name.

You highlight your code and select ‘debug pipeline in browser’ from the RStudio Addins menu. This ‘bursts’ your pipeline to one intermediate object per function call, then opens the debug menu. You can also simply ‘burst pipes’ from the Addins menu without debug mode.

remotes::install_github("alistaire47/pipecleaner")
library(pipecleaner)

# Intact, original pipeline
peng_pipe <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  mutate(mean_length = round(mean_length))

# Highlight the original pipeline and select 'debug pipeline in browser' or 
# 'burst pipes' from the RStudio Addins menu
dot1 <- filter(penguins, species %in% c("Adelie", "Chinstrap"))
dot2 <- group_by(dot1, species)
dot3 <- summarise(dot2, mean_length = mean(bill_length_mm, na.rm = TRUE))
peng_pipe <- mutate(dot3, mean_length = round(mean_length))

So effectively it steps through each new object to report back errors. But it leaves you with multiple objects (with meaningless names) to clean up; there’s no ‘fix pipes’ option to return to your original pipeline.

2c. {magrittr}

Surprise: the {magrittr} package itself has the function debug_pipe() to let you see what’s being passed into the next function.

library(magrittr)

peng_pipe <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  debug_pipe() %>%
  mutate(mean_length = round(mean_length))

Not much to say about this one, but worth mentioning because %>% gets re-exported in other packages² but debug_pipe() doesn’t.

2d. `debug()`

You can simply use R’s debug() function, as pointed out by Nathan Werth.

You can do this for a given function in the pipeline:

debug(summarise)

peng_pipe <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %>% 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  mutate(mean_length = round(mean_length))

undebug(summarise)

Or you can even debug each step by setting up debug(`%>%`), since the pipe is itself a function, after all.

3. Operator hacking

It’s possible to make variant pipe operators. Maybe we don’t even need %>%?

3a. {pipes}

Antoine Fabri forked the {magrittr} GitHub repo to add a bunch of %>% variants that have side properties. These are available from his {pipes} package.

A few of direct relevance to this discussion:

%P>% to print() the data set to the console
%V>% will View() the full data set
%D>% opens with debug menu

Others apply different functions during the piping step. There’s some nice ones for summaries, like %glimpse>% and %skim>%.

remotes::install_github("moodymudskipper/pipes")
library(pipes)

Attaching package: 'pipes'

The following object is masked from 'package:dplyr':

    %>%

Here’s an example of %P>% that pipes forward into the next function and prints it to console.

peng_pipe <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap")) %>% 
  group_by(species) %P>%  # note %P>% operator 
  summarise(mean_length = mean(bill_length_mm, na.rm = TRUE)) %>% 
  mutate(mean_length = round(mean_length))

summarise(., mean_length = mean(bill_length_mm, na.rm = TRUE))

# A tibble: 2 × 2
  species   mean_length
  <fct>           <dbl>
1 Adelie           38.8
2 Chinstrap        48.8

Noe that the final output of the chain is assigned to peng_pipe. We’re seeing the printed output of the summarise() step without the following mutate() step, given where we placed the %P>% operator.

So this one could have gone in the ‘summary inspection’ section above, but it contains more functions than for printing and viewing alone.

3b. Bizarro pipe

Forget installing a package to get the pipe. We can create an operator that acts like a pipe but can be run so that we can check what’s happening at each step.

John Mount’s solution is the ‘Bizarro pipe’, which looks like ->.;. This is a simple hack that’s legitimate base R code. The ->.; operator reads as ‘right-assign the left-hand side to a period and then perform the next operation’.

Things you might be wondering:

yes, you can use a -> for assignment to the right
yes, you can assign to ., but you’ll need to explicitly supply it as the data argument to the next function call in your Bizarro pipeline
yes, you can use semi-colons in R for run-on code execution, try head(penguins); tail(penguins)

This means you can execute each line in turn and check the output. But wait: an object called . is not presented in the global environment. Not unless you check ‘Show .Last.value in environment listing’ in RStudio’s settings. Now when you run the line you’ll see the ‘.Last.value’ that’s been output.

penguins ->.;
filter(., species %in% c("Adelie", "Chinstrap")) ->.;
group_by(., species) ->.;
summarise(., mean_length = mean(bill_length_mm, na.rm = TRUE)) ->.;
mutate(., mean_length = round(mean_length)) -> peng_pipe

Note that the name of the object comes at the end; we’re always passing the object to the right.

This might confuse your colleagues, but hey, no dependencies are needed!

M’aider

What’s your approach to this problem? What have I missed?

Environment

Session info

Last rendered: 2023-08-02 20:22:59 BST

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] palmerpenguins_0.1.1 dplyr_1.1.2         

loaded via a namespace (and not attached):
 [1] vctrs_0.6.3       cli_3.6.1         knitr_1.43.1      clisymbols_1.2.0 
 [5] rlang_1.1.1       xfun_0.39         purrr_1.0.1       generics_0.1.3   
 [9] jsonlite_1.8.7    glue_1.6.2        htmltools_0.5.5   fansi_1.0.4      
[13] rmarkdown_2.23    evaluate_0.21     tibble_3.2.1      fontawesome_0.5.1
[17] fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3   compiler_4.3.1   
[21] htmlwidgets_1.6.2 pkgconfig_2.0.3   tidyr_1.3.0       rstudioapi_0.15.0
[25] digest_0.6.33     R6_2.5.1          tidyselect_1.2.0  utf8_1.2.3       
[29] pillar_1.9.0      magrittr_2.0.3    tools_4.3.1

Footnotes

Welcome to Biology Geek Corner. Cycliophora is a phylum containing just one genus and (probably) three species. Our own phylum – Chordata – contains 55,000 species. Symbion pandora was the first cycliophoran species found in 1995, which appears to live commensally and exclusively on lobster lips. Onychophora is the velvet worm phylum that contains wee beasties that spray slime, have little adorable claws and are, surprise, kinda velvety (one species is named ‘totoro’ because of its similarity to My Neighbour Totoro’s Catbus).↩︎
Check out usethis::use_pipe() for re-exporting the pipe to use in your own package.↩︎

Reuse

CC BY-NC-SA 4.0