7 min read

Mail merge in 2018 with R

Matt Dray

Gif of cool guy ready to surf the web

Two-thousand and late

Clip art! Fax machines! CD-ROMs! Dial-up modems! The World Wide Web! Mail merge!

These exotic terms give me flashbacks to computer class at the turn of the millennium.

The last one–mail merge–was taught as a way to autocomplete template Word letters for made-up customers in an Access database. I no longer need to invoice ‘Mr Sonic The-Hedgehog’ for ordering ‘25 chili dogs’, but the general approach of generating the same template multiple times with different information is still pertinent.

I’ve helped a few colleagues with this recently to avoid the need for hours of copy-pasting. R is a natural choice for automating and maximising the reproducibility of this kind of task.

Who will she choose?

To keep to the theme, I’ll be autogenerating reports that detail each episode of the first season of the hit late-90s-early-2000s-weirdly-articulate-teen-angst-love-triangle masterpiece that is Dawson’s Creek. Data via Wikipedia.

I’ve created an accompanying GitHub repository where you can download or clone a whole R Project that replicates the approach used in this post.

Approach

We’ll make an R Markdown template and fill it with data for each episode of Dawson’s Creek by looping through each of them in turn. Each will be rendered to a Word (.docx) file and formatted according to a style template that we’ve created and specified.

I don’t claim this to be the most efficient method, but it’s useful for beginners and, well, it works.

Your working directory should contain three important files:

  1. your letter or report template prepared in R Markdown (filetype .Rmd)
  2. your style-reference document (.docx) that we’ll put in a folder called ‘style’
  3. a script file for reading your data and looping through each element (.R)

The folder should also contain:

  • your R Project file (.Rproj)
  • a folder to hold the output files (e.g. ‘output’)
  • any additional folders you need for input files (e.g. ‘data’ and ‘images’)
  • a README file to explain the purpose of your work (.md or something)

Something like this:

Screenshot of the folders needed in the directory

1. The template

Your report template should be an R Markdown file. In short, R Markdown files allow you to write plain text formatted with special symbols and insert R code inline or in ‘chunks’. You can find out more about R Markdown from the RStudio website. I’ve also written a short guide to writing R Markdown documents and created a small set of slides, both with beginners in mind.

The R code in your template should refer to the subset of data for each element of your dataset that you’re generating reports for. In our case we need to refer to episodes of the Dawson’s Creek dataset.

We’ll be subsetting the data to create one-row dataframes of each episode, so it’s this episode object that the template should refer to.

The template itself should start with a YAML header section that indicates the output type (word_document) and document containing style information (at the path style/dawson_style.docx in our case). (We’ll look at this style document in the next section.) You can of course add add a title, name, etc, to the YAML section if you like.

---
output:
  word_document:
    reference_docx: style/dawson_style.docx
---

You’re then free to add script to the body of the R Markdown file as you would normally. For example, the following code in the template will be filled with the episode’s production code and title:

`r paste0(episode$production_code, ": '", episode$title, "'")`

We can also do someting like this inside an R Markdown code chunk to get a table containing information about the episodes:

episode %>% 
  select(  # choose these columns for display
    Season = season,
    `Number in season` = number_in_season,
    `Number in series` = number_in_series,
    `Original air date` = original_air_date
  ) %>% 
  kable()  # print to output table

2. The style

The default output when rendering R Markdown to .docx is a bit boring. You can create a separate dummy Word document that contains modified styles to suit your theme.

You can read more about creating a style document in Richard Layton’s ‘happy collaboration with Rmd to docx’ on the R Markdown site. Basically you create a fake document in which you’ve set the formatting for each style element (title, third-level header, paragraph text, etc), store it in your working repository and refer to it from the YAML of your R Markdown report (see secton above).

Here I stole and modified a template from the dfeR package, which is intended for use in presenting official statistics.

Screenshot of the .docx style document

3. Read/loop

We now have the template and style in place.

You should have an R script (.R) file that (1) reads the data and (2) executes a loop that applies episode data to the template and saves it with a unique name to a specified subfolder. You’ll get one document for each unique element, so we’ll get 13 because there are 13 rows in our dataset; one for each episode.

After reading in the data with something like data <- readr::read_csv(file = "data/dawsons-creek-season-1.csv") you can write a loop that looks something like this:

for (i in data$production_code){  # for each unique episode...
  
  # Isolate that episode from the dataset
  
  episode <- data[data$production_code == i, ]  
  
  # The one-row dataframe object 'episode' we just created will be used as the
  # source for filling the details in on the template file
  
  # Now render ('knit') the R Markdown file to Word format, name it and save it
  
  render(
    input = "01_template.rmd",  # path to the template
    output_file = paste0("episode_", i, ".docx"),  # name the output
    output_dir = "output"  # folder in which to put the output file
  )

}  # end of loop

So this code:

  • loops through episodes given unique values in the production_code column
  • subsets the data object (a dataframe of all epiosdes) to isolate the production code for each episode in turn
  • renders the template document after fillling it with the episode data
  • gives the file a name that incorporates the production code
  • saves this file to the ‘output’ folder in our project directory

Execute

Run the R script and the files will be rendered in turn according to the formatting in the style document, and added one by one to the output folder. It’s fun to open this folder and watch them pop into existence.

Congratulations! You now have a separate file that contains the information for each of the episodes in season 1 of Dawson’s Creek! Make sure to share with your friends (assuming you’re not in an awkward love triangle).

Screenshot of the first report showing episode 101 of Dawson's Creek Screenshot of the first report showing episode 102 of Dawson's Creek

R session info

I don’t wanna wait… for this post to be oooover

Gif of the titular Dawson crying

devtools::session_info()
##  setting  value                       
##  version  R version 3.4.2 (2017-09-28)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United Kingdom.1252 
##  tz       Europe/London               
##  date     2018-06-29                  
## 
##  package   * version date       source                          
##  backports   1.1.0   2017-05-22 CRAN (R 3.4.0)                  
##  base      * 3.4.2   2017-09-28 local                           
##  blogdown    0.6     2018-04-18 CRAN (R 3.4.4)                  
##  bookdown    0.5     2017-08-20 CRAN (R 3.4.3)                  
##  compiler    3.4.2   2017-09-28 local                           
##  datasets  * 3.4.2   2017-09-28 local                           
##  devtools    1.13.3  2017-08-02 CRAN (R 3.4.1)                  
##  digest      0.6.15  2018-01-28 CRAN (R 3.4.3)                  
##  evaluate    0.10    2016-10-11 CRAN (R 3.4.0)                  
##  graphics  * 3.4.2   2017-09-28 local                           
##  grDevices * 3.4.2   2017-09-28 local                           
##  htmltools   0.3.6   2017-04-28 CRAN (R 3.4.0)                  
##  knitr       1.20    2018-02-20 CRAN (R 3.4.4)                  
##  magrittr    1.5     2014-11-22 CRAN (R 3.4.0)                  
##  memoise     1.1.0   2017-04-21 CRAN (R 3.4.1)                  
##  methods   * 3.4.2   2017-09-28 local                           
##  Rcpp        0.12.17 2018-05-18 CRAN (R 3.4.4)                  
##  rmarkdown   1.9     2018-03-01 CRAN (R 3.4.4)                  
##  rprojroot   1.2     2017-01-16 CRAN (R 3.4.0)                  
##  stats     * 3.4.2   2017-09-28 local                           
##  stringi     1.1.7   2018-03-12 CRAN (R 3.4.4)                  
##  stringr     1.3.1   2018-05-10 CRAN (R 3.4.4)                  
##  tools       3.4.2   2017-09-28 local                           
##  utils     * 3.4.2   2017-09-28 local                           
##  withr       2.1.2   2018-06-13 Github (jimhester/withr@dbcd7cd)
##  xfun        0.1     2018-01-22 CRAN (R 3.4.4)                  
##  yaml        2.1.19  2018-05-01 CRAN (R 3.4.4)