rostrum.blog - Typo-shaming my Git commits

A line-drawn monkey poking a typewriter. — The author at work (CC BY-SA 3.0 by KaterBegemot)

tl;dr

Nearly 10 per cent of the commits to this blog’s source involve typo fixes, according to a function I wrote to search commit messages via the {gh} package.

Note

Great news everyone, I improved. I re-rendered this post in July 2023 and the percentage had basically halved to 5%.

Not my typo

I’m sure you’ve seen consecutive Git commits from jaded developers like ‘fix problem’, ‘actually fix problem?’, ‘the fix broke something else’, ‘burn it all down’. Sometimes a few swear words will be thrown in for good measure (look no further than ‘Developers Swearing’ on Twitter).

The more obvious problem from reading the commits for this blog is my incessant keyboard mashing; I think a lot of my commits are there to fix typos.¹

So I’ve prepared a little R function to grab the commit messages for a specified repo and find the ones that contain a given search term, like ‘typo’.²

Search commits

{gh} is a handy R package from Gábor Csárdi, Jenny Bryan and Hadley Wickham that we can use to interact with GitHub’s REST API.³ We can also use {purrr} for iterating over the returned API object.

library(gh)    # CRAN v1.2.0
library(purrr) # CRAN v0.3.4

So, here’s one way of forming a function to search commit messages:

search_commits <- function(owner, repo, string = "typo") {
  
  commits <- gh::gh(
    "GET /repos/{owner}/{repo}/commits",
    owner = owner, repo = repo,
    .limit = Inf
  )

  messages <- purrr::map_chr(
    commits, ~purrr::pluck(.x, "commit", "message")
  )
  
  matches <- messages[grepl(string, messages, ignore.case = TRUE)]
  
  out <- list(
    meta = list(owner, repo),
    counts = list(
      match_count = length(matches),
      commit_count = length(messages),
      match_ratio = length(matches) / length(messages)
    ),
    matches = matches,
    messages = messages
  )
  
  return(out)
  
}

First we pass a GET request to the GitHub API via gh::gh(). The API documentation tells us the form needed to get commits for a given owner’s repo.

Beware: the API returns results in batches of some maximum size, but the .limit = Inf argument automatically creates additional requests until everything is returned. That might mean a lot of API calls.

Next we can use {purrr} to iteratively pluck() out the commit messages from the list returned by gh::gh(). It’s then a case of finding which ones contain a search string of interest (defaulting to the word ‘typo’).

The object returned by search_commits() is a list with four elements: meta repeats the user and repo names; counts is a list with the commit count, the count of messages containing the search term, and their ratio; and the messages and matches elements contain all messages and the ones containing the search term, respectively.

Fniding my typoes

Here’s an example where I look for commit messages to this blog that contain the word ‘typo’. Since the function contains the .limit = Inf argument in gh::gh(), we’ll get an output message for each separate request that’s been made to the API.

blog_typos <- search_commits("matt-dray", "rostrum-blog")

ℹ Running gh query

ℹ Running gh query, got 100 records of about 1900

ℹ Running gh query, got 200 records of about 1900

ℹ Running gh query, got 300 records of about 1900

ℹ Running gh query, got 400 records of about 1900

ℹ Running gh query, got 500 records of about 1900

ℹ Running gh query, got 600 records of about 1900

ℹ Running gh query, got 700 records of about 1900

ℹ Running gh query, got 800 records of about 1900

ℹ Running gh query, got 900 records of about 1900

ℹ Running gh query, got 1000 records of about 1900

ℹ Running gh query, got 1100 records of about 1900

ℹ Running gh query, got 1200 records of about 1900

ℹ Running gh query, got 1300 records of about 1900

ℹ Running gh query, got 1400 records of about 1900

ℹ Running gh query, got 1500 records of about 1900

ℹ Running gh query, got 1600 records of about 1900

ℹ Running gh query, got 1700 records of about 1900

ℹ Running gh query, got 1800 records of about 1900

Here’s a preview of the structure of the returned object. You can see how it’s a list that contains the values and other list elements that we expected.

str(blog_typos)

List of 4
 $ meta    :List of 2
  ..$ : chr "matt-dray"
  ..$ : chr "rostrum-blog"
 $ counts  :List of 3
  ..$ match_count : int 95
  ..$ commit_count: int 1870
  ..$ match_ratio : num 0.0508
 $ matches : chr [1:95] "Improve text, correct typos, add cheatcode to hiscore post" "Fix typo that also made it into a Mastodon post, lol" "Correct typo in games post" "Improve readability of parse post, add renkun post, fix typos" ...
 $ messages: chr [1:1870] "Re-build README.Rmd" "Remove non-existent anchor from hiscore post" "Improve text, correct typos, add cheatcode to hiscore post" "Re-build README.Rmd" ...

You can see there were 1870 commit messages returned, of which 95 contained the string ‘typo’. That’s 5 per cent.

Here’s a sample⁴ of those commit messages that contained the word ‘typo’:

set.seed(1337)
sample(blog_typos$matches, 5)

[1] "Fix potatypos"                                         
[2] "Merge pull request #72 from maelle/patch-1\n\ntypo fix"
[3] "Correct typos"                                         
[4] "Correct typo"                                          
[5] "add gapminder example, fix typo"

It seems the typos are often corrected with general improvements to a post’s copy. This usually happens when I read the post the next day with fresh eyes and groan at my ineptitude.⁵

Exposing others

I think typos are probably most often referenced in repos that involve a lot of documentation, or a book or something.

To make myself feel better, I had a quick look at the repo for the {bookdown} project R for Data Science by Hadley Wickham and Garrett Grolemund.

typos_r4ds <- search_commits("hadley", "r4ds")

The result:

str(typos_r4ds)

List of 4
 $ meta    :List of 2
  ..$ : chr "hadley"
  ..$ : chr "r4ds"
 $ counts  :List of 3
  ..$ match_count : int 450
  ..$ commit_count: int 2137
  ..$ match_ratio : num 0.211
 $ matches : chr [1:450] "fix: typo (add missing `to`) (#1529)" "Fix typos in subsection \"6.3.2 How does pivoting work?\" (#1534)\n\n* Add missing word\r\n\r\n* Fix typo" "typo fix in communication.qmd (#1523)" "Typo: \"a new\" instead of \"an new\" (#1515)" ...
 $ messages: chr [1:2137] "Small format for column (#1522)\n\nspecies column name is missing back ticks in this reference" "fix: typo (add missing `to`) (#1529)" "Use dplyr 1.1 'default' parameter in 'case_when()' (#1525)\n\n* Use dplyr 1.1 'default' parameter in 'case_when"| __truncated__ "Update arrow chapter code to avoid errors (#1517)\n\n* Add in `col_types` to specify schema\r\n\r\n* Just use open_dataset()" ...

Surprise: typos happen to all of us. I’m guessing the percentage is quite high because the book has a lot of readers scouring it, finding small issues and providing quick fixes.

In other words

Of course, you can change the string argument of search_commits() to find terms other than the default ‘typo’. Use your imagination.

Here’s a meta example: messages containing emoji in the commits to the {emo} package by Hadley Wickham, Romain François and Lucy D’Agostino McGowan.

Emoji are expressed in commit messages like :dog:, so we can capture them with a relatively simple regular expression like ":.*:" (match wherever there are two colons with anything in between).

emo_emoji <- search_commits("hadley", "emo", ":.*:")

ℹ Running gh query

ℹ Running gh query, got 100 records of about 200

str(emo_emoji)

List of 4
 $ meta    :List of 2
  ..$ : chr "hadley"
  ..$ : chr "emo"
 $ counts  :List of 3
  ..$ match_count : int 21
  ..$ commit_count: int 112
  ..$ match_ratio : num 0.188
 $ matches : chr [1:21] "need emo:: prefix in that case, bc ji_glue might be called without emo being attached. ping @batpigandme" "rm emoji keyboard (saved in separate branch) but eventually might just go in a separate :package:" "emo::ji_rx a meta regex to catch all emojis. closes #14" "bring in some extra modules (for emo::ji_rx)" ...
 $ messages: chr [1:112] "Imports CRAN glue (#54)" "no longer importing dplyr. #24" "less dependency on dplyr" "clock no longer depends on dplyr" ...

Only 19 per cent? Son, I am disappoint.

Environment

Session info

Last rendered: 2023-07-17 22:22:24 BST

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_1.0.1 gh_1.4.0   

loaded via a namespace (and not attached):
 [1] digest_0.6.31     R6_2.5.1          fastmap_1.1.1     xfun_0.39        
 [5] fontawesome_0.5.1 magrittr_2.0.3    rappdirs_0.3.3    glue_1.6.2       
 [9] knitr_1.43.1      gitcreds_0.1.2    htmltools_0.5.5   rmarkdown_2.23   
[13] lifecycle_1.0.3   cli_3.6.1         vctrs_0.6.3       compiler_4.3.1   
[17] rstudioapi_0.15.0 tools_4.3.1       curl_5.0.1        evaluate_0.21    
[21] httr2_0.2.3       yaml_2.3.7        rlang_1.1.1       jsonlite_1.8.7   
[25] htmlwidgets_1.6.2

Footnotes

Yes, I’m aware of Git hooks and various GitHub Actions that could prevent this.↩︎
Though obviously you’ll miss messages containing the word ‘typo’ if you have a typo in the word ‘typo’ in one of your commits…↩︎
I used it most recently in my little {ghdump} package for downloading or cloning a user’s repos en masse.↩︎
Very rarely do I make myself laugh, but I had forgotten that I used the commit message ‘Fix potatypos’ when correcting typos in the post about the {potato} package, lol. Also thank you to Maëlle, who fixed a typo on my behalf!↩︎
I wonder how many typos I’ll need to correct in this post after publishing. (Edit: turns out I accidentally missed a couple of words, lol.)↩︎

Reuse

CC BY-NC-SA 4.0