Translate R to English with {r2eng}

Hexagonal logo for the r2eng package showing the package name inside a speech bubble with the URL matt-dray.github.io/r2eng.

tl;dr

I created the work-in-progress {r2eng} package (source, site) to help translate R expressions to speakable English. Inspired by Amelia McNamara and with a huge amount of help from Chung-hong Chan.

Communication is hard

Amelia McNamara (site, Twitter) gave a talk at the useR! 2020 conference called ‘Speaking R’. Watch the video on YouTube, or take a look at the slides.

To summarise greatly: R code should be speakable so that we can teach, learn and communicate with minimal friction.

‘Speakable’ means I should be able to read an R expression to you as an English sentence (rather than reading out individual characters) and we should be able to understand each other.

But this is all easier said than… said.

Ideally we’d have an agreed dictionary that maps each R token to an equivalent English phrase. But there will always be variation between users and across communities; between beginners and aficionados; and given differences in spoken language.

Then there’s context. We might agree that the operator %>% is called a ‘pipe’, but you might say the word ‘then’ when reading an expression. So data %>% print() might be spoken as ‘data then print’.

The order of parsing may also differ between people. Something as simple as x <- 1 could be ‘x gets 1’, ‘assign the value 1 to the variable named x’, or something else entirely. Now imagine that on the scale of an entire script.

I don’t think we can completely ‘solve’ any of this, but we could probably develop the conversation with the help of a tool accessible from R itself.

Hello {r2eng}

This is where the {r2eng} package comes in. The goal is to take an R expression and translate it to an equivalent speakable, English sentence, from within R itself.

The initial focus has been to:

  • concentrate on translating to English (I’m biased)
  • map the most common R operators (e.g. assignment, maths, brackets)
  • use commonly-used (but currently opinionated1) English translations for R operators (‘gets’ for <-)
  • work on a simple one-to-one, left-to-right mapping of R to English
  • keep the API simple, so you just supply an expression and get a result
  • simply begin an approach to address what was raised in Amelia’s presentation

Obviously the package is not finished, let alone perfect, and it requires more theoretical and practical considerations to make it truly useful. Clearly I don’t have all the answers and I’m certainly not an arbiter of language, but I think the purpose is defined and it certainly works for simple cases.

At worst, I hope the package will encourage discussion. There’s been some interest on Twitter and on the RStudio Community site, but do consider contributing thoughts and ideas to the {r2eng} GitHub repo.

In this vein, it’s important to highlight Chung-hong Chan’s valuable contributions to the package. Check out his website and find him on Twitter. Chung-hong was responsible for updating {r2eng} to handle non-standard evaluation in the translate() function and for replacing my original simple R-to-English lookup with a proper parsing method for interpreting R expressions.

Installation

You can install the package from GitHub using the {remotes} package. The package is in development version 0.0.0.9005 at the time of writing and this post discusses functionality for that version. Things may change.

install.packages("remotes")  # if not already installed
remotes::install_github("matt-dray/r2eng")

How it works

Recognising tokens

The secret sauce of the package is that it recognises the ‘tokens’ that make up an R expression. So the assignment operator, <-, is recognised as a single token rather than the less-than (<) and hyphen (-) characters typed sequentially.

This power is brought to {r2eng} thanks to {lintr}, a package by Jim Hester that assesses your code for possible errors and style improvements.

An important part of this process is parsing R expressions and recognising tokens using the lintr::get_source_expressions() function. For example, that <- operator is recognised as the special token LEFT_ASSIGN under the hood.

This is some deep R magic. You can see a special grammar file file in the R source code that carries these mappings.2

Put simply, {r2eng} hijacks this process, adds a step that maps each token to English terms, then recombines the text into a sentence.

Speech

By default, {r2eng} will translate an R expression and then your computer will speak it out loud.

This is relatively straightforward on a Mac: the resulting English text is handed to your machine with a system() call and is vocalised by the built-in VoiceOver text-to-speech converter. This functionality is not built into Windows by default, so the system() call fails silently.

Of course, this assumes that VoiceOver will do a good job of parsing the English expression from {r2eng}, but that isn’t guaranteed because of issues like localised pronunciation and uncommon words. I’ve written before about how text-to-speech isn’t necessarily that good at recognising R package names, for example.

In theory, and assuming that the translation gets good, {r2eng} could be used as a simple accessibility tool because it can interpret R-specific tokens in a way that VoiceOver cannot.

Using {r2eng}

You can find further examples in the package README, but I’ll explain here the main functionality.

First I’ll attach the package after having installed it.

library(r2eng)

The translate() function

There are two main functions in {r2eng}: translate() and translate_string(). They convert from a bare or quoted R expression, respectively, to English.

translate() takes advantage of non-standard evaluation: you pass bare (i.e. unquoted) R code to the expression argument and a few things happen.

translate(data <- 1 + 1)
## Original expression: data <- 1 + 1
## English expression: data gets 1 plus 1

First, it prints both the original R expression and the corresponding English translation. Second, and only if you are using macOS, the English string is passed to a system command that vocalises the string.

Here’s a more complex example using some {dplyr} functions. Note that we don’t need to attach {dplyr} to be able to translate.

translate(
  data %>% select(x, y) %>% dplyr::filter(y == "example"),
  function_call_end = ""
)
## Original expression: data %>% select(x, y) %>% dplyr::filter(y == "example")
## English expression: data then select open paren x , y close paren then dplyr double-colon filter open paren y double equals string "example" close paren

Note the function_call_end argument in this example. The default is "of ", which would would have made the translation data then select of open paren x y close paren, etc. Feedback suggested that this was how some users spoke the English translation out loud, so the functionality has been included for now.

While translate() takes a bare expression, translate_string() takes a string.

library(magrittr)  # attach for pipes
"data<-1+1" %>%
  translate_string(speak = FALSE)
## Original expression: data <- 1 + 1
## English expression: data gets 1 plus 1

We get the same sort of output as the translate() example, but this time I set the speak argument to FALSE to stop the English text from being ‘spoken’ by my computer and I also eliminated spaces from the input to demonstrate that they’re ignored.

r2eng objects

Of course, you can assign a translation to an object.

my_translation <- 
  translate(data <- 1 + 1, speak = FALSE)

The object is a special r2eng S3 class of object, which is also a list.

class(my_translation)
## [1] "list"  "r2eng"

You can apply some methods to such an object: print(), speak() and evaluate().

Print the object to see that custom console output again:

print(my_translation)
## Original expression: data <- 1 + 1
## English expression: data gets 1 plus 1

Re-call the system command that ‘speaks’ the R expression out loud (again, only on macOS).

speak(my_translation)

And you can actually evaluate the R expression you supplied to translate() in the first place. So the expression we supplied, data <- 1 + 1, is evaluated so that calling data gives us the result of 1 + 1.

evaluate(my_translation)
data
## [1] 2

r2eng-class list elements

You can also access the R expression, the English translation and the mapping between them as elements of your r2eng list object.

Here’s the R expression again:

my_translation$r_expression
## [1] "data <- 1 + 1"

And the translated output:

my_translation$eng_expression
## [1] " data gets 1 plus 1 "

And a data.frame object that contains the mapping.

map <- my_translation$translation_map
map[map$text != "", ]  # filter out text spaces
##         token text  eng
## 1      SYMBOL data data
## 2 LEFT_ASSIGN   <- gets
## 4   NUM_CONST    1    1
## 6         '+'    + plus
## 7   NUM_CONST    1    1

This element is a good summary of what {r2eng} is doing under the hood: it breaks the R expression into recognised tokens and maps words to them where it knows what the corresponding English for that token should be. So <- is recognised as the token LEFT_ASSIGN, which is mapped internally to the English text gets.

Bonus material

RStudio Addin

The print() and speak() functions can be accessed via an RStudio addin that’s installed with the package (you may need to restart RStudio after installation). To use them, highlight an R expression in your script and select from the RStudio addins menu:

  • ‘Speak R Expression In English’ to vocalise the expression through your computer’s speakers (macOS only)
  • ‘Print R Expression In English’ to output an English translation to the console

These can be mapped to keyboard shortcuts so you can highlight and translate quickly without specifically calling translate() and print() or speak().

Binder demo

Maybe you don’t want to install the package. That’s fine. Instead, you can try out the package by opening this Binder instance of RStudio with {r2eng} and the tidyverse pre-installed. Click this badge to launch it:

Launch Rstudio Binder

The downside is that you can’t use the speak method. Make sure to set the argument speak = FALSE in translate() or translate_string(), or you’ll get a warning message when you run your code.

You can find the source for this at the try-r2eng GitHub repo and you can read one of my posts on how to set up Binder using Karthik Ram’s {holepunch} package.

There’s lots to do

As mentioned, this is really just the beginning.

There’s plenty of room for simple improvements, as well as some long term possibilities. For example, we could:

  • expand the list of R tokens that can be translated
  • allow for context-dependent translation where the same token can be used for more than one application
  • do user-research to find out the most common English terms used
  • provide instructions to make the speak method possible on non-Mac platforms
  • include full script awareness so that we aren’t limited by left-to-right interpretation
  • expand the package for non-English languages, or create {r2es}, {r2fr}, etc
  • etc

Hopefully we can work on some of these things and this won’t be the last post about {r2eng} on this blog. In the meantime, do consider contributing to the GitHub repo.


Session info

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       macOS Mojave 10.14.6        
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  ctype    en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2020-11-15                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version    date       lib source        
##  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
##  backports      1.1.8      2020-06-17 [1] CRAN (R 4.0.0)
##  blogdown       0.19       2020-05-22 [1] CRAN (R 4.0.0)
##  bookdown       0.19       2020-05-15 [1] CRAN (R 4.0.0)
##  callr          3.4.3      2020-03-28 [1] CRAN (R 4.0.0)
##  cli            2.0.2      2020-02-28 [1] CRAN (R 4.0.0)
##  crayon         1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
##  cyclocomp      1.1.0      2016-09-10 [1] CRAN (R 4.0.2)
##  desc           1.2.0      2018-05-01 [1] CRAN (R 4.0.0)
##  digest         0.6.25     2020-02-23 [1] CRAN (R 4.0.0)
##  evaluate       0.14       2019-05-28 [1] CRAN (R 4.0.0)
##  fansi          0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
##  glue           1.4.1      2020-05-13 [1] CRAN (R 4.0.0)
##  htmltools      0.5.0      2020-06-16 [1] CRAN (R 4.0.2)
##  knitr          1.29       2020-06-23 [1] CRAN (R 4.0.2)
##  lazyeval       0.2.2      2019-03-15 [1] CRAN (R 4.0.0)
##  lintr          2.0.1      2020-02-19 [1] CRAN (R 4.0.2)
##  magrittr     * 1.5        2014-11-22 [1] CRAN (R 4.0.0)
##  processx       3.4.3      2020-07-05 [1] CRAN (R 4.0.2)
##  ps             1.3.3      2020-05-08 [1] CRAN (R 4.0.0)
##  purrr          0.3.4      2020-04-17 [1] CRAN (R 4.0.0)
##  r2eng        * 0.0.0.9004 2020-08-02 [1] local         
##  R6             2.4.1      2019-11-12 [1] CRAN (R 4.0.0)
##  remotes        2.2.0      2020-07-21 [1] CRAN (R 4.0.2)
##  rex            1.2.0      2020-04-21 [1] CRAN (R 4.0.0)
##  rlang          0.4.7      2020-07-09 [1] CRAN (R 4.0.2)
##  rmarkdown      2.1        2020-01-20 [1] CRAN (R 4.0.0)
##  rprojroot      1.3-2      2018-01-03 [1] CRAN (R 4.0.0)
##  sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
##  stringi        1.4.6      2020-02-17 [1] CRAN (R 4.0.0)
##  stringr        1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
##  withr          2.2.0      2020-04-20 [1] CRAN (R 4.0.0)
##  xfun           0.16       2020-07-24 [1] CRAN (R 4.0.2)
##  xml2           1.3.2      2020-04-23 [1] CRAN (R 4.0.0)
##  xmlparsedata   1.0.3      2019-09-27 [1] CRAN (R 4.0.2)
##  yaml           2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

  1. A good example of opinionated word choice is this character: (. Is it an ‘open bracket’, ‘open parenthesis’, ‘open paren’, ‘mouth for an emoticon smiley’, or something else entirely? {r2eng} uses ‘open paren’, largely for brevity.↩︎

  2. I first saw the gram.y file in action in Andrew Craig’s interesting TokyoR talk about extending R to accept function definitions in the form (x) => x + 1.↩︎