54 min read

Markov-chaining my PhD thesis

Matt Dray

A figure from my thesis showing the design of an invertebrate feeding experiment

Doc rot

I wrote a PhD thesis in 2014 called Effects of multiple environmental stressors on litter chemical composition and decomposition. See my viva presentation slides here if you don’t really like words.

On graduation day, a stranger came up to me and, to paraphrase, said ‘you doctors should be proud of what you’ve achieved, you’re doing a great service’. I didn’t have the heart to tell him that I wasn’t a medical doctor. No, I was something nobler and altogether more unique: a doctor of rotting leaves.

I know you’re thinking ‘gosh, what a complicated subject that must be; how could I ever hope to achieve such greatness?’. The answer is that you should simply take my thesis and use a Markov chain to generate new sentences until you have a fresh new thesis. The output will make probably as much sense as the original but won’t be detected easily by plagiarism software.

Heck, I’ll even do it for you in this post.

You’re welcome. Don’t forget to cite me.

Text generation

I’ll be using a very simple approach: Markov chains.

Basically, after providing an input dataset, a Markov chain can generate the next word in a sentence given the current word. Selection of the new word is random but weighted by occurrences in your input file.

There’s a great post on Hackernoon that explains Markov chains for text generation. For interactive visuals of Markov chains, go to setosa.io.

Text generation is an expanding field and there are much more successful and complicated methods for doing it. For example, Andrej Karpathy generated some pretty convincing Shakespeare passages, Wikipedia pages and geometry papers in LaTeX using the ‘unreasonably effective’ and ‘magical’ power of Recurrent Neural Networks (RNNs).

Generate text

Code source

I’ll be using modified R code written by Kory Becker that I found in a GitHub gist.

In a similar vein, Roel Hogervorst did a swell job of generating Captain Picard text in R from Star Trek: The Next Generation scripts, which is certainly in our wheelhouse.

Data

Because I’m helpful I’ve created a text file version of my thesis. You can get it raw from my draytasets (haha) GitHub repo.

Alternatively you could get the data from the dray package.

# download package from github
# library(devtools)
# devtools::install_github("matt-dray/dray")

# load dray package and assign data to object
library(dray)
phd_text <- dray::phd

We’ll alter the data slightly for it to be ready for passing into the Markov chain.

# remove blank lines
phd_text <- phd_text[nchar(phd_text) > 0]

# put spaces around common punctuation
# so they're not interpreted as part of a word
phd_text <- gsub(".", " .", phd_text, fixed = TRUE)
phd_text <- gsub(",", " ,", phd_text, fixed = TRUE)
phd_text <- gsub("(", "( ", phd_text, fixed = TRUE)
phd_text <- gsub(")", " )", phd_text, fixed = TRUE)

# split into single tokens
terms <- unlist(strsplit(phd_text, " "))

Script

Read the markovchain package and fit a Markov chain to the text data.

# load the package we need
# install.packages("markovchain")
library(markovchain)
## Package:  markovchain
## Version:  0.6.9.8-1
## Date:     2017-08-15
## BugReport: http://github.com/spedygiorgio/markovchain/issues
# fit the markov chain to the data
fit <- markovchainFit(data = terms)

We’re going to seed the start of each ‘sentence’ (a sequence of n words, where we specify n). We’ll do this by supplying one of 200 unique values to the set.seed() function in turn. This seed then starts the chain within the markovchainSequence() function.

markov_output <- data.frame(output = rep(NA, 200))

for (i in 1:200) {
  
  set.seed(i)
  
  markov_text <- paste(
    markovchainSequence(
      n = 50  # output length
      , markovchain = fit$estimate
    )
    , collapse = " "
  )
  
  markov_output$output[i] <- markov_text
  
}

Full output

This table shows 200 samples of length 50 that I generated with the code above, each beginning with a randomly-selected token.

Cherry-picked phrases

The output is mostly trash because the Markov chain doesn’t have built in grammar or an understanding of sentence structure. It only ‘looks ahead’ given the current state.

You can also see that brackets don’t get closed, for example, though an opening bracket is often followed by an author citation or result of a statistical test, as we might expect given the source material.

I’ve selected some things from the output that basically look like normal(ish) phrases. Simply rearrange these to build a thesis!

My favourites (my comments in square brackets):

  • Not all invertebrate species are among tree species [FACT.]
  • Effect of deciduous trees may be appreciated [They produce oxygen and fruits, after all.]
  • Species-specific utilization of Cardiff University [Well, humans go inside the uni, pigeons sit on the roof; I guess that’s ‘species-specific’.]
  • Litter was affected by Wallace , Dordrecht [Who is this Dutch guy who’s interfering with my studies?]
  • Bags permitted entry of stream ecosystem [I should hope so; I was investigating the effect of the stream ecosystem on the leaf litter stored in those bags, after all.]
  • Permutational Analysis and xylophagous invertebrates can affect ecosystem service provision [My analysis will affect the thing its analysing? The observer effect!]
  • Most studies could shift invertebrate communities [Hang on, this is the observer effect again; I thought I was studying ecology, not physics.]
  • This thesis is responsible for broad underlying principles to mass loss [Health warning: my thesis actually causes decay (possibly to your brain cells).]
  • Carbon dioxide enrichment altered chemical composition [Aha! Actually true!]

Some other things that vaguely make sense:

  • The response variables were returned to predict leaf litters
  • shredder feeding was established for nutrient and urban pollution
  • Leaf litter chemical composition are comprised of differing acidity in Ystradffin
  • the no-choice situation with deionised water availability may reflect invertebrate feeding preferences
  • ground coarsely using a wide spectrum of stream ecosystem functioning
  • cages were already apparent
  • Schematic of aquatic invertebrate species for identifying the invertebrate assemblages during model fitting
  • Populus tremuloides clone under elevated CO2 had consistently been related to remove debris dams in woodland environments
  • the need to account for microphytobenthic biofilms are particularly affected by the Linnean Society , and lignin concentration
  • These findings suggest that the roles could not differ between time and bottom-up control of decomposition
  • rural litter decomposition of litter layer of leaf litter will influence invertebrate communities
  • the effects of carbon concentration in species’ feeding responses between tree species with caution given the 1970s , regardless of four weeks
  • Results were visualised in the breakdown
  • Measurements were in altered twig decay rates
  • Litter was little work in decomposing leaf litter of litter resulted in a range of twigs , as a result in upland streams
  • The basis of carbon compounds have influenced feeding preferences
  • Annual Review of the physical toughness of rotting detritus altered chemical composition and woodlice Porcellio species
  • Nitrogen concentrations and nitrogen transformations in both leaves grown under ambient CO2 levels of trembling aspen and invertebrate assemblage
  • Odontocerum albicorne was tied to elevated CO2 and terrestrial study was lower carbon sinks
  • a range of data stretching back 25 years of litter chemical composition ( 1979 ) for all interactions used microcosm type modified by elevated CO2
  • homogeneity of invertebrates as defined by increasing local microclimate and no influence mass loss

Congratualtions on your doctorate!

Session info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.3 (2017-11-30)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2018-07-04
## Packages -----------------------------------------------------------------
##  package      * version    date       source                            
##  backports      1.1.2      2017-12-13 cran (@1.1.2)                     
##  base         * 3.4.3      2017-12-07 local                             
##  blogdown       0.6.5      2018-06-10 Github (rstudio/blogdown@ad8be3f) 
##  bookdown       0.7        2018-02-18 cran (@0.7)                       
##  compiler       3.4.3      2017-12-07 local                             
##  crosstalk      1.0.1      2018-03-09 Github (rstudio/crosstalk@0f21b45)
##  datasets     * 3.4.3      2017-12-07 local                             
##  devtools       1.13.5     2018-02-18 CRAN (R 3.4.3)                    
##  digest         0.6.15     2018-01-28 cran (@0.6.15)                    
##  dray         * 0.0.0.9000 2018-07-04 Github (matt-dray/dray@b48ad61)   
##  DT             0.4.5      2018-03-09 Github (rstudio/DT@8ba54ab)       
##  evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                    
##  expm           0.999-2    2017-03-29 CRAN (R 3.4.0)                    
##  graphics     * 3.4.3      2017-12-07 local                             
##  grDevices    * 3.4.3      2017-12-07 local                             
##  grid           3.4.3      2017-12-07 local                             
##  htmltools      0.3.6      2017-04-28 CRAN (R 3.4.0)                    
##  htmlwidgets    1.0        2018-01-20 cran (@1.0)                       
##  httpuv         1.4.3      2018-05-10 cran (@1.4.3)                     
##  igraph         1.1.2      2017-07-21 CRAN (R 3.4.1)                    
##  jsonlite       1.5        2017-06-01 CRAN (R 3.4.0)                    
##  knitr          1.20       2018-02-20 cran (@1.20)                      
##  later          0.7.2      2018-05-01 cran (@0.7.2)                     
##  lattice        0.20-35    2017-03-25 CRAN (R 3.4.3)                    
##  magrittr       1.5        2014-11-22 CRAN (R 3.4.0)                    
##  markovchain  * 0.6.9.8-1  2017-08-16 CRAN (R 3.4.1)                    
##  matlab         1.0.2      2014-06-24 CRAN (R 3.4.0)                    
##  Matrix         1.2-12     2017-11-20 CRAN (R 3.4.3)                    
##  memoise        1.1.0      2017-04-21 CRAN (R 3.4.0)                    
##  methods      * 3.4.3      2017-12-07 local                             
##  mime           0.5        2016-07-07 CRAN (R 3.4.0)                    
##  parallel       3.4.3      2017-12-07 local                             
##  pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.0)                    
##  plotrix        3.7-2      2018-05-27 cran (@3.7-2)                     
##  promises       1.0.1      2018-04-13 cran (@1.0.1)                     
##  R6             2.2.2      2017-06-17 CRAN (R 3.4.0)                    
##  RColorBrewer   1.1-2      2014-12-07 CRAN (R 3.4.0)                    
##  Rcpp           0.12.17    2018-05-18 cran (@0.12.17)                   
##  RcppParallel   4.4.0      2018-03-02 CRAN (R 3.4.3)                    
##  rmarkdown      1.9        2018-03-01 cran (@1.9)                       
##  rprojroot      1.3-2      2018-01-03 cran (@1.3-2)                     
##  shiny          1.1.0      2018-05-17 cran (@1.1.0)                     
##  slam           0.1-42     2017-12-21 CRAN (R 3.4.3)                    
##  stats        * 3.4.3      2017-12-07 local                             
##  stats4         3.4.3      2017-12-07 local                             
##  stringi        1.2.2      2018-05-02 cran (@1.2.2)                     
##  stringr        1.3.1      2018-05-10 cran (@1.3.1)                     
##  tools          3.4.3      2017-12-07 local                             
##  utils        * 3.4.3      2017-12-07 local                             
##  withr          2.1.2      2018-06-28 Github (jimhester/withr@fe56f20)  
##  wordcloud      2.5        2014-06-13 CRAN (R 3.4.0)                    
##  xfun           0.1        2018-01-22 cran (@0.1)                       
##  xtable         1.8-2      2016-02-05 cran (@1.8-2)                     
##  yaml           2.1.19     2018-05-01 cran (@2.1.19)