Take a {ghdump} to download GitHub repos

gh
ghdump
github
r
Author
Published

June 14, 2020

A silhouette of a dump truck dumping trash bags.

My garbage GitHub repos being dumped onto my local machine.

tl;dr

Run ghd_copy() from the {ghdump} package to either clone or download all the GitHub repositories for a given user. Intended for archival purposes or setting up a new computer.

The package comes with no guarantees and will likely be in a perpetual work-in-progress state. Please submit issues or pull requests.

Clone army

Situation:

  • Sometimes I get a new computer and want to clone all my repos to it
  • Sometimes I want to be able to archive my repos so I’m not dependent on GitHub nor any given computer
  • it would be tedious to download or clone the repos one-by-one from the GitHub interface

Wants:

  • To clone (with HTTPS or SSH) or download all of my repos with one command
  • Be able to unzip downloaded repos en masse if I want to
  • Do all this from within R, mostly for the learning experience, but also to allow for user interactivity

Observations:

  • I don’t know of a specific R function that automates mass-downloading or mass-cloning of GitHub repos
  • the {gh} package provides a lightweight GitHub API wrapper for R that’s likely to be helpful
  • R has many file-handling functions that will be helpful

{ghdump}

The result is that I wrote a function, ghd_copy(), that copies (clones or downloads) all the repos for a given user to a specified location. You can get it in the tiny {ghdump} package.

The function interacts with the GitHub API thanks to the {gh} package by Gábor Csárdi, Jenny Bryan and Hadley Wickham, while iterating over repos comes thanks to the {purrr} package by Lionel Henry and Hadley Wickham.

Update

As of May 2022 there’s also a handy rOpenSci package called {gitcellar}, by Maëlle Salmon and Jeroen Ooms, which is for downloading an organisation’s repos for archival purposes.

Get and use

Install with:

remotes::install_github("matt-dray/ghdump")
library(ghdump)

To use the package, you’ll need a GitHub account and a GitHub Personal Access Token (PAT) stored in your .Renviron file. You can do this with the following steps:

usethis::browse_github_pat()  # opens browser to generate token
usethis::edit_r_environ()     # add your token to the .Renviron
# then restart R

You can use {ghdump} to download the repos for a specified user:

ghd_copy(
  gh_user = "matt-dray",           # download repos for this user
  dest_dir = "~/Documents/repos",  # full local file path to copy to
  copy_type = "download"           # "download" or "clone" the repos
)

Or clone them:

ghd_copy(
  gh_user = "matt-dray",
  dest_dir = "~/Documents/repos",
  copy_type = "clone",
  protocol = "https" # specify "https" or "ssh"
)

If you want to use the SSH protocol when cloning, you need to make sure that you’ve set up your keys.

Interactivity

My expectation is to use ghd_copy() infrequently and in a non-programmatic way, so I’ve made it quite interactive. This means user input is required; you’ll get some yes/no questions in the console that will affect how the function runs.

Here’s an imaginary demo of the output from ghd_copy() when copy_type = "download":

ghd_copy("made-up-user", "~/Desktop/test-download", "download")
Fetching GitHub repos for user made-up-user... 3 repos found
Create new directory at path ~/Desktop/test-download? y/n: y
Definitely download all 3 repos? y/n: y
Downloading zipped repositories to ~/Desktop/test-download

trying URL 'https://github.com/made-up-user/fake-repo-1/archive/master.zip'
Content type 'application/zip' length 100 bytes
==================================================
downloaded 100 bytes

trying URL 'https://github.com/made-up-user/fake-repo-2/archive/master.zip'
Content type 'application/zip' length 100 bytes
==================================================
downloaded 100 bytes

trying URL 'https://github.com/made-up-user/fake-repo-3/archive/master.zip'
Content type 'application/zip' length 100 bytes
==================================================
downloaded 100 bytes

Unzip all folders? y/n: y
Unzipping repositories
Retain the zip files? y/n: y
Keeping zipped folders.
Remove '-master' suffix from unzipped directory names? y/n: y
Renaming files to remove '-master' suffix
Finished downloading

And now imaginary demo of the output from ghd_copy() when copy_type = "clone":

ghd_copy("made-up-user", "~/Desktop/test-clone", "clone", "ssh")
Fetching GitHub repos for user made-up-user... 3 repos found
Create new directory at path ~/Desktop/test-clone? y/n: y
Definitely clone all 3 repos? y/n: y
Cloning repositories to ~/Desktop/test-clone 
Cloning into 'fake-repo-1'...
Cloning into 'fake-repo-2'...
Cloning into 'fake-repo-3'...
Finished cloning

Note that cloning has only been tested on my own Mac OS machine at this point (June 2020) and is not guaranteed to work elsewhere yet. Please submit issues or pull requests to help improve this.

Under the hood

What are the steps to downloading repos with ghdump::ghd_copy()? Each of the functions in this section are not exported from the package, but you can access them by prefacing with ghdump::: (the rare triple-colon operator) if you want to see their code.

First, to get repo info:

  1. ghd_get_repos() passes a GitHub username to gh::gh(), which contacts the GitHub API to return a gh_response object that contains info about each of that user’s repos
  2. ghd_extract_names() takes the gh_response object from ghd_get_repos() and extracts the names into a character vector

Then to download (if copy_type = "download"):

  1. ghd_enframe_urls() turns the character vector of repo names into a data.frame, with a corresponding column that contains the URL to a zip file for that repo
  2. ghd_copy_zips() takes each zip file URL from that data frame and downloads them to the file path provided by the user
  3. ghd_unzip() unzips the zipped repos

You can, of course, use these intermediate functions if you have slightly different needs. Maybe you want to limit the repos that are downloaded; do this by filtering the vector output from ghd_extract_names() for example.

Or to clone (if copy_type = "clone"):

  1. ghd_clone_multi() that iterates cloning over the repos, itself calling ghd_clone_one()

Why bother?

What did I learn from doing this? As if I have to explain myuself to you, lol.

1. Iteration

Aside from {gh}, the package also depends on {purrr} for iterative programming.

For example, the gh_response object output from ghdump:::ghd_get_repos() is passed to map() with the pluck() function to extract the repo names.

Another example is the use of walk(), which is like map(), except we use it when the output is some ‘side effect’. By ‘side effect’, we mean that it doesn’t return an R object. For example, we can walk() the unzip() function over the path to each zip file. This doesn’t return anything in R; it results in some local files being manipulated.

2. File manipulation

R can be used to interact with files on your computer. There’s a number of these base R functions in the package:

  • dir.create() to create a new folder
  • file.remove() to remove a file or folder
  • list.files() and list.dirs() to return a character vector files and folders at some path
  • file.rename to change the name of a file or folder
  • unzip() to unpack a zipped folder

3. User input

How do you ask questions of your user and get answers? This interactivity is made possible by readline(). You pass it a string to prompt the user, whose return value can be stored.

For example, this is how it looks in the console:

answer <- readline("Do you like pizza? ") 
Do you like pizza? yes
answer
[1] "yes"

Where a user has written yes after the prompt on the second line.

4. Stickers

I’ve designed a few hex stickers with the {hexSticker} package; you can see them in my ‘stickers’ GitHub repo. This time I made the sticker for {ghdump} using Dmytro Perepolkin’s {bunny} package, which is a helper for the {magick} package from Jeroen Ooms. It’s a very smooth process with much flexibility.

This belongs in a dump

Yeah, maybe. It’s not sophisticated, but I’ve found it useful for my own specific purposes.

Environment

Session info
Last rendered: 2023-07-21 18:39:31 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.2 compiler_4.3.1    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.1       htmltools_0.5.5   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.23    knitr_1.43.1      jsonlite_1.8.7    xfun_0.39        
[13] digest_0.6.33     rlang_1.1.1       fontawesome_0.5.1 evaluate_0.21    

Reuse

CC BY-NC-SA 4.0