rostrum.blog - The life and death of the tidyverse

A name badge that says 'Hello, I am...' at the top and 'Deprecated' is written in the space at the bottom.

tl;dr

In which I try to work out what functions and arguments in the tidyverse are badged as ‘experimental’, ‘deprecated’, ‘superseded’, etc. You can jump to the results table.

Birth

The tidyverse suite of packages develops quickly and there have been many API changes over the years. For example, gather() and spread() were superseded by pivot_longer() and pivot_wider() in {tidyr}, and there was a recent introduction of experimental .by/by arguments in several {dplyr} functions¹.

The tidyverse uses the {lifecycle} package to advertise to users the current state of a function or argument, via a badge in the help files (e.g. ?tidyr::gather). There’s a good explanatory vignette about lifecycles if you want to learn more. The badges look like this:

With this in mind, wouldn’t it be fun—haha, I mean ‘informative’—to try and extract lifecycle information from tidyverse packages?².

Life

Functions

First, we get the names of tidyverse packages from within {tidyverse} itself. Preparing these as e.g. package:tidyr will help us later to ls() (list functions) and detach() (remove the package from the search path).

# Package names in the tidyverse
pkg_names <- tidyverse::tidyverse_packages(include_self = FALSE)
pkg_envs <- paste0("package:", pkg_names)
pkg_names

 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"

Badges

Then we need the badge strings and some regular expression versions that will help with string handling later. ‘Stable’ shouldn’t need to be indicated, but I thought I’d add it for completeness. ‘Maturing’ and ‘Questioning’ have been superseded (lol, so meta), but there might still be some badges in the wild, maybe. I found at least one instance of ‘Soft-deprecated’ as well, which isn’t part of the r-lib lifecycle, so I included it too.

# Badge strings in Rd
life_names <- c(
  "Deprecated", "Experimental", "Superseded",
  "Stable",
  "Maturing", "Questioning",
  "Soft-deprecated"
)

# Regex to help detect lifecycle stages
life_names_rx <- paste(life_names, collapse = "|")

# Regex to help detect lifecycle badge format: '*[Experimental]*'
badges_rx <-
  paste0("\\*\\[(", life_names_rx, ")\\]\\*")

Help files

I went down rabbitholes trying to extract help files for each function, but a Stackoverflow solution by MrFlick is exactly what I was looking for. It grabs a function’s underlying Rd (‘R documentation’) help file and outputs it to a vector with one element per string, thanks to a couple of functions from {tools}: the most underrated R package (prove me wrong).

# Function to extract function help file from Rd
get_help_text <- function(fn, pkg) {
  
  # Prepare paths to package directory
  file <- help(fn, (pkg))
  path <- dirname(file)
  dirpath <- dirname(path)
  rd_db <- file.path(path, pkg)
  
  # Read rendered function docs (Rd)
  rd <- tools:::fetchRdDB(rd_db, basename(file))  # unexported function (':::')
  
  # Convert raw Rd to text and capture it as strings
  capture.output(
    tools::Rd2txt(rd, out = "", options = list(underline_titles = FALSE))
  )
  
}

Here’s a demo showing the description block of the function documentation for tidyr::gather(), which was superseded by tidyr::pivot_longer(). You can see how the ‘Superseded’ badge is represented: surrounded by square brackets and asterisks. That’s the pattern what we’ll need to search for.

get_help_text("gather", "tidyr")[3:13]

 [1] "Description:"                                                             
 [2] ""                                                                         
 [3] "     *[Superseded]*"                                                      
 [4] ""                                                                         
 [5] "     Development on 'gather()' is complete, and for new code we"          
 [6] "     recommend switching to 'pivot_longer()', which is easier to use,"    
 [7] "     more featureful, and still under active development. 'df %>%"        
 [8] "     gather(\"key\", \"value\", x, y, z)' is equivalent to 'df %>%"       
 [9] "     pivot_longer(c(x, y, z), names_to = \"key\", values_to = \"value\")'"
[10] ""                                                                         
[11] "     See more details in 'vignette(\"pivot\")'."

And here’s how the text is laid out for an argument:

get_help_text("mutate", "dplyr")[46:50]

[1] "     .by: *[Experimental]*"                                            
[2] ""                                                                      
[3] "          <'tidy-select'> Optionally, a selection of columns to group" 
[4] "          by for just this operation, functioning as an alternative to"
[5] "          'group_by()'. For details and examples, see ?dplyr_by."

Loop-de-loop

So, the premise is to iterate over each package and, within each one, iterate through the functions to read their help pages and find any lifecycle badges. This’ll output a list (with an element per package) of lists (an element per function).

Note that I’m retrieving help files from my local computer, having already downloaded the tidyverse packages with install.packages("tidyverse").

There’s always discourse in the R community about for loops. So, as a special surprise, I decided to put a for loop in a for loop (yo dawg)³. I even pre-allocated my vectors, which is for nerds.

# Prepare 'outer' list, where each element is a package
pkg_badges <- vector(mode = "list", length = length(pkg_names))
names(pkg_badges) <- pkg_names

# Iterate over each package to get lifecycle badge usage
for (pkg in pkg_names) {
  
  # Extract package function names
  library(pkg, character.only = TRUE)
  pkg_env <- paste0("package:", pkg)
  fn_names <- ls(pkg_env)
  
  # Ignore these particular functions, which caused errors, lol
  if (pkg == "lubridate") {
    fn_names <- fn_names[!fn_names %in% c("Arith", "Compare", "show")]
  }
  
  # Prepare 'inner' list, where each element is a function
  fn_badges <- vector(mode = "list", length = length(fn_names))
  names(fn_badges) <- fn_names
  
  # Iterate over each function to get lifecycle badge usage
  for (fn in fn_names) {
    
    message(pkg, "::",  fn)
    
    txt <- get_help_text(fn, pkg)  # fetch help file
    lines_with_badges <- grep(badges_rx, txt)  # find rows that contain badges
    
    badge_lines <- NA  # default to no badges
    
    # If lines with badges exist, then extract the text
    if (length(badge_lines) > 0) {
      badge_lines <- trimws(txt[lines_with_badges])
      badge_lines <- sub("\\*[^\\*]+$", "", badge_lines)
    }
    
    fn_badges[[fn]] <- badge_lines  # add to inner list of functions
    
  }
  
  pkg_badges[[pkg]] <- fn_badges  # add to outer list of packages
  
  detach(pkg_env, character.only = TRUE)  # unclutter the search path
  
}

So here’s gather() again, with that ‘Superseded’ badge extracted, as expected. The list element will be empty if there’s no badge.

pkg_badges$tidyr$gather

[1] "*[Superseded]*"

And here’s how the badge for an argument looks in that .by example:

pkg_badges$dplyr$mutate

[1] ".by: *[Experimental]*"

Entabulate

We can convert this to a dataframe for presentational and manipulational purposes. I’m choosing to do that with stack(unlist()), mostly because I haven’t had a chance to use stack() in this blog yet. Handily, this approach also removes all the empty list elements for us.

life_df <- stack(unlist(pkg_badges))  # stack is a nice function
head(life_df)

                values                      ind
1     *[Experimental]* dbplyr.get_returned_rows
2     *[Experimental]* dbplyr.has_returned_rows
3  vars: *[Deprecated]      dbplyr.partial_eval
4 cte: *[Experimental]        dbplyr.remote_con
5 cte: *[Experimental]       dbplyr.remote_name
6 cte: *[Experimental]      dbplyr.remote_query

Then we can do a bit of awkward string manipulation to get each package name, function name, argument names (if relevant) and the associated lifecycle badge(s).

# Uncouple 'tidyr.gather' to 'tidyr' and 'gather'
life_df$Package <- sub("\\..*", "", life_df$ind)
life_df$Function <- sub(".*\\.", "", life_df$ind)

# Clean off the '*[]*' from the lifecycle badge text
life_df$values <- gsub("(\\[|\\]|\\*)", "", life_df$values)

# Arg names are captured as a string before the lifecycle badge
life_df$Args <- gsub(life_names_rx, "", life_df$values)
life_df$Args <- trimws(gsub(":", "", life_df$Args))
life_df$Args[life_df$Args == ""] <- NA

# Badges appear after args (if any)
life_df$Badges <- trimws(sub(".*\\:", "", life_df$values))
life_df$Badges <- gsub(" ", ", ", life_df$Badges)

# Select and reorder
life_df <- life_df[, c("Package", "Function", "Args", "Badges")]

So now we have a table with one row per package and function:

head(life_df)

  Package          Function Args       Badges
1  dbplyr get_returned_rows <NA> Experimental
2  dbplyr has_returned_rows <NA> Experimental
3  dbplyr      partial_eval vars   Deprecated
4  dbplyr        remote_con  cte Experimental
5  dbplyr       remote_name  cte Experimental
6  dbplyr      remote_query  cte Experimental

Results

Here’s an interactive table of the results. You can click the function name to be taken to the rdrr.io website, which hosts package help files in HTML on the web. Note that this won’t always resolve to a functioning URL for various reasons! If you’ve installed the tidyverse packages, you can of course see a function’s help page by running e.g. ?tidyr::gather.

# Factors allow dropdown search in {DT}
life_df[names(life_df)] <- lapply(life_df[names(life_df)], as.factor)

# Build URL path to rdrr.io docs
life_df$Function <- paste0(
  "<a href='https://rdrr.io/cran/", 
  life_df$Package, "/man/", life_df$Function, ".html'>",
  life_df$Function, "</a>"
)

# Build interactive table
DT::datatable(
  life_df, 
  filter = "top",
  options = list(autoWidth = TRUE, dom = "tp"),
  esc = FALSE
)

	Package	Function	Args	Badges

1	dbplyr	get_returned_rows		Experimental
2	dbplyr	has_returned_rows		Experimental
3	dbplyr	partial_eval	vars	Deprecated
4	dbplyr	remote_con	cte	Experimental
5	dbplyr	remote_name	cte	Experimental
6	dbplyr	remote_query	cte	Experimental
7	dbplyr	remote_query_plan	cte	Experimental
8	dbplyr	remote_src	cte	Experimental
9	dbplyr	src_dbi		Superseded
10	dplyr	across1	...	Deprecated

Previous1 2 3 4 5…55Next

Death

You can see a few patterns. For example:

some packages are not represented here at all, while others appear a lot (e.g. {googledrive} has a large number of deprecated functions, maybe due to a change to the API, or overhaul of package design?)
‘Questioning’ is still being used in {rlang}, despite not being part of the {lifecycle} system
{rlang} curiously has functions that are both ‘Experimental’ and ‘Soft-deprecated’ (perhaps an example of trying something and realising it wasn’t the right fit?)
sometimes it’s more than one argument that gets a badge, which can happen when the same help page is being used by multiple functions (e.g. slice() and family’s help page has ‘Experimental’ for .by, by⁴, use of which differ depending on the exact function)

Plus some other stuff I’m sure you can fathom out yourself.

Of course, this all assumes that the badges are used consistently by developers across the suite of tidyverse packages. The method I used may also miss badges I’m not aware of, like the ‘Soft-deprecated’ example mentioned earlier.

Regardless, the general approach outlined in this post might be useful for exploring other aspects of help pages, like the use of certain terms, grammar or writing styles. Documentation was a theme of the recent R Project Sprint 2023, after all.

Of course, it helps to keep badged functions around so that people’s code remains reproducible. The downside is the potential for clutter and confusion, though the tidyverse packages sometimes warn you when something is old hat and suggest the preferred new method⁵.

But I think it’s an even better idea to keep these vestiges around to remind us that we all make mistakes. Oh, and, of course, that ✨ nothing is permanent ✨.

Environment

Session info

Last rendered: 2023-09-11 14:19:57 BST

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] gtable_0.3.3        bslib_0.5.0         xfun_0.39          
 [4] ggplot2_3.4.2       htmlwidgets_1.6.2   gargle_1.5.2       
 [7] tzdb_0.4.0          crosstalk_1.2.0     vctrs_0.6.3        
[10] tools_4.3.1         generics_0.1.3      tibble_3.2.1       
[13] fansi_1.0.4         pkgconfig_2.0.3     data.table_1.14.8  
[16] tidyverse_2.0.0     dbplyr_2.3.3        readxl_1.4.3       
[19] lifecycle_1.0.3     compiler_4.3.1      stringr_1.5.0      
[22] textshaping_0.3.6   munsell_0.5.0       sass_0.4.7         
[25] htmltools_0.5.5     yaml_2.3.7          jquerylib_0.1.4    
[28] pillar_1.9.0        tidyr_1.3.0         ellipsis_0.3.2     
[31] DT_0.28             googlesheets4_1.1.1 cachem_1.0.8       
[34] tidyselect_1.2.0    rvest_1.0.3         conflicted_1.2.0   
[37] digest_0.6.33       stringi_1.7.12      dplyr_1.1.2        
[40] purrr_1.0.1         forcats_1.0.0       fastmap_1.1.1      
[43] grid_4.3.1          colorspace_2.1-0    cli_3.6.1          
[46] magrittr_2.0.3      utf8_1.2.3          broom_1.0.5        
[49] readr_2.1.4         withr_2.5.0         scales_1.2.1       
[52] backports_1.4.1     lubridate_1.9.2     googledrive_2.1.1  
[55] timechange_0.2.0    rmarkdown_2.23      modelr_0.1.11      
[58] httr_1.4.6          cellranger_1.1.0    ragg_1.2.5         
[61] hms_1.1.3           memoise_2.0.1       evaluate_0.21      
[64] knitr_1.43.1        haven_2.5.3         dtplyr_1.3.1       
[67] rlang_1.1.1         glue_1.6.2          DBI_1.1.3          
[70] xml2_1.3.5          reprex_2.0.2        rstudioapi_0.15.0  
[73] jsonlite_1.8.7      R6_2.5.1            systemfonts_1.0.4  
[76] fs_1.6.3

Footnotes

Ethan White wondered aloud recently if people are teaching learners to ungroup() then summarise(), or to use the ‘experimental’ .by argument within summarise() itself. Opinion: typically I prefer to avoid ‘deprecated’ or ‘superseded’ functions when teaching, like the mutate_*() suite that became mutate(across()). I’m a little wary of anything ‘experimental’ for teaching, for similarish reasons. But I do personally use them.↩︎
I assume a running list of these functions/args must already exist, or this has already been explored by a third party. But forget them; we’re here to have fun!↩︎
Yeah, this approach is pretty awkward. Basically I was noodling around with some code and then realised I don’t really care to refactor it. That could be a nice treat for you instead.↩︎
Having mentioned teaching earlier, could this be awkward for learners? How do you teach that sometimes it’s by and sometimes its .by, especially when the same family of functions (like slice()) is inconsistent? You should teach people to look at help files, sure, but it would be nice if it was always predictable.↩︎
I’ll leave the grumbling to you about whether all this chopping and changing of functions and arguments is A Good Thing or not; that’s not what this post is about.↩︎

Reuse

CC BY-NC-SA 4.0