# Graphing the Relayverse of podcasts

Matt Dray (@mattdray)

Podcasting is becoming big business. Music-streaming giant Spotify just acquired the podcast network Gimlet for a reported $200 million. Other networks include The Incomparable, 5by5 and Radiotopia. Such networks can boost revenue and listener numbers and provide access to expertise, management and resources. Relay FM is a network that focuses largely on tech content1. It was started by Myke Hurley and Stephen Hackett in 2014 and you can find a list of shows and personnel on their site. Many of Relay FM’s hosts have hosted more than one podcast within the network. What do the relationships between them look like? This post is about preparing and visualising this ‘Relayverse’ using the R packages tidygraph for network data handling and ggraph for network visualisation (both by Thomas Lin Pedersen), along with the visNetwork package (from Benoit Thieurmel and Titouan Robert of Datastorm, wrapping Almende BV’s vis.js library) for interactive visualisation. Scroll to the bottom to find the interactive tool if you aren’t interested in the code. # Packages I’m using two suites of ‘tidy’ packages in this post: one set for data collection and manipulation, and one set for graph network building, analysis and visualisation. # Data collection and manipulation library(dplyr) # data manipulation library(rvest) # for scraping webpages library(stringr) # string manipulation library(tidyr) # tidying dataframes library(purrr) # applying functions over data # Graph networks library(tidygraph) # set up graph network library(ggraph) # visualise static graphs library(cowplot) # for plot arrangement library(visNetwork) # wrapper for javascript interactive viz # Harvest data We can use the rvest package by Hadley Wickham to scrape podcast details from the Wikipedia page for Relay FM. There are two separate tables: one for current and one for retired (discontinued) shows. read_html() gets the HTML for the selected page; html_node() identifies which element needs to be scraped2; and html_table() interprets the HTML information as a dataframe. I’ve removed the show ‘B-Sides’ because it’s clips from other shows. # Get the HTML for the selected page relay_wiki <- read_html("https://en.wikipedia.org/wiki/Relay_FM") # Get the table with current shows current <- relay_wiki %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table() %>% filter( !Podcast %in% c("Members Only", 'Paid "Members Only" Shows', "B-Sides") ) %>% mutate(Status = "Current") # label rows as current shows # Get the table with retired shows retired <- relay_wiki %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[3]') %>% html_table() %>% select(-Number of episodes) %>% mutate(Status = "Retired") # label rows as retired shows # Combine the tables into one dataframe shows <- bind_rows(current, retired) # Look at a few random hosts/podcasts from the table select(shows, Podcast, Hosts) %>% sample_n(5) %>% knitr::kable() Podcast Hosts 12 Liftoff Jason SnellStephen M. Hackett 5 Upgrade Myke HurleyJason Snell 30 Fusion Stephen M. Hackett 3 The Pen Addict Myke HurleyBrad Dowdy 38 Almanac Andy Ihnatko Unfortunately the host names are in the form ‘First LastFirst Last’ so we need a regular expression to split the string where a lowercase letter meets an uppercase letter3. This leaves us with a list column that we can unnest() to get one row per podcast-host combination. # Clean the host names shows_clean <- shows %>% mutate( Hosts = str_remove_all(Hosts, " \$$originally\$$"), Hosts = str_split(Hosts, "(?<=[a-z])(?=[A-Z])") ) %>% unnest() %>% select(Podcast, Hosts, Status) # Print a random sample of 10 sample_n(shows_clean, 10) %>% knitr::kable() Podcast Hosts Status 59 Automators Rose Orchard Current 64 CMD Space Myke Hurley Retired 12 BONANZA Myke Hurley Current 25 Material Florence Ion Current 38 Disruption Georgia Dow Current 14 Rocket Brianna Wu Current 77 Isometric Mikah Sargent Retired 6 The Pen Addict Myke Hurley Current 60 Parallel Shelly Brisbin Current 39 Disruption Brianna Wu Current Our dataframe is now tidy and ready for Tidygraph, but before moving on, we can do things like look at the host with the most active shows. shows_clean %>% filter(Status == "Current") %>% count(Hosts) %>% filter(n > 2) %>% arrange(desc(n)) %>% knitr::kable() Hosts n Myke Hurley 10 Stephen M. Hackett 5 Jason Snell 4 David Sparks 3 Mikah Sargent 3 Tiffany Arment 3 Or out of interest, the shows that have had the most hosts. shows_clean %>% count(Podcast) %>% arrange(desc(n)) %>% slice(1:6) %>% knitr::kable() Podcast n Isometric 5 Disruption 4 Connected 3 Material 3 Remaster 3 Rocket 3 # Tidygraph The tidygraph package was created by Thomas Lin Pedersen. It is: an entry into the tidyverse that provides a tidy framework for all things relational (networks/graphs, trees, etc) To use the package, we need every host combination4 for each podcast, which can be achieved with the combn() function. This gives us a pair of points (‘nodes’ in graph-speak) that can be connected by a line (an ‘edge’) to indicate their relationship. But a show with one host doesn’t have a pair of points, so I’m going to duplicate these rows first. If we don’t do this, we can’t plot these shows because they won’t have a connection between two nodes. # Isolate the shows with one host solo_vec <- shows_clean %>% count(Podcast) %>% filter(n == 1) %>% pull(Podcast) # Filter the show-host dataframe by solo-hosted podcasts solo_df <- filter(shows_clean, Podcast %in% solo_vec) Now we can bind these rows to the data and then get our host combinations. Big shout out to William Chase for a solution to getting all combinations of elements within some parent element (e.g. hosts within podcasts). # Prepare host combinations per show relay_combos <- shows_clean %>% bind_rows(solo_df) %>% # to duplicate the shows with solo hosts group_by(Podcast) %>% # operate within each podcast split(.$Podcast) %>%  # split on podcast
map(., 2) %>%  # gets vector of hosts per podcast list element
map(~combn(.x, m = 2)) %>%   # all pair combiantions
map(~t(.x)) %>%  # transpose the matrix
map(as_tibble) %>%  # convert to a tibble dataframe
bind_rows(.id = "Podcast") %>%  # list-element name to column
select(V1, V2, Podcast)

sample_n(relay_combos, 10) %>% knitr::kable()  # random sample of 10
V1 V2 Podcast
Federico Viticci Stephen M. Hackett The Prompt
Brianna Wu Mikah Sargent Isometric
Myke Hurley Tiffany Arment Playing for Fun
Federico Viticci Stephen M. Hackett Connected
Brianna Wu Georgia Dow Isometric
Federico Viticci Myke Hurley Remaster
Brianna Wu Mikah Sargent Disruption
Myke Hurley Myke Hurley Inquisitive
Federico Viticci Fraser Spiers Canvas

We can turn this dataframe of host-pair combinations into a tidygraph object with the as_tbl_graph() function. This class of object contains two dataframes (the nodes and the edges) and some metadata.

We can also use functions from the tidygraph package to help calculate various network statistics. For example, the centrality_degree() function tells us the nodes with the most connections. We can add this as a column in our node data and use this later to do things like resize nodes depending on their centrality.

You can manipulate the nodes and edges dataframes in this tidygraph object by using the activate() function to switch between them. Currently the node data are active (it says ‘active’ above the nodes dataframe), so our application of centrality_degree() to the network object will affect the node data specifically.

relay_graph <- as_tbl_graph(relay_combos, directed = FALSE)%>%
mutate(connections = centrality_degree()) %>%  # number of connections
arrange(desc(connections))  # order by number of shows

print(relay_graph)  
## # A tbl_graph: 40 nodes and 66 edges
## #
## # An undirected multigraph with 8 components
## #
## # Node Data: 40 x 2 (active)
##   name               connections
##   <chr>                    <dbl>
## 1 Myke Hurley                 21
## 2 Stephen M. Hackett          10
## 3 Brianna Wu                   9
## 4 Mikah Sargent                9
## 5 Federico Viticci             8
## 6 Georgia Dow                  7
## # … with 34 more rows
## #
## # Edge Data: 66 x 3
##    from    to Podcast
##   <int> <int> <chr>
## 1     9     9 Almanac
## 2     1    32 Analog(ue)
## 3    11    33 Automators
## # … with 63 more rows

The edges and nodes are now in the same object. You can see we have a 40 nodes and 66 edges and that the netowrk is undirected (nodes don’t ‘point’ to each other). We also have eight ‘components’, which are the isolated groups of nodes that connect to each other, but not to other groups.

# Visualisation

## Static with ggraph

The ggraph package was designed to work seamlessly with tidygraph objects and the ggplot2 plotting package. You simply pass your tidygraph object to ggraph‘s ’graph’ argument. Below are a couple of examples for demonstration purposes that show you just a handful of options. You can learn more about ggraph’s layouts, nodes and edges in a series of blog posts from Thomas Lin Pedersen.

This example arranges the nodes (hosts) circularly (plot A) and a straight line (B) and sizes them by the number of connections to other hosts. Each line (edge) represents a co-hosting relationship.

g1 <- ggraph(
graph = relay_graph,
layout = "linear",
circular = TRUE
) +
geom_node_point(aes(size = connections)) +
geom_edge_arc() +
theme_void() +
theme(legend.position = "none")

g2 <- ggraph(
graph = relay_graph,
layout = "linear"
) +
geom_node_point(aes(size = connections)) +
geom_edge_arc() +
theme_void() +
theme(legend.position = "none")

plot_grid(g1, g2, labels = c("A", "B"))

This example arranges the nodes (hosts) according to an algorithm specified by the layout argument to ggraph() and sizes them by the number of connections to other hosts. Each line (edge) represents a co-hosting relationship and multiple connections are shown separately by ‘fanning’ them out (hence the geom_edge_fan()).

ggraph(graph = relay_graph, layout = "nicely") +
geom_node_point(aes(size = connections)) +
geom_edge_fan() +
theme_void() +
theme(legend.position = "none")

I haven’t added labels on the nodes nor the points because of the visual clutter it creates for this particular example. You could use geom_node_text() to add node labels and something like aes(label = <column_name>) in your edge geom to label the connections.

This is where interactive network graphs come in handy, as you can zoom, pan and hover to get more info.

## Interactive with visNetwork

The visNetwork package wraps the vis.js library to make interactive network graphs. Its visNetwork() function takes separate dataframes of edges and nodes that are in a pre-specified format, so we won’t be able to use our tidygraph object for this.

# Dataframe of unique nodes with an ID value
nodes <- shows_clean %>%  # take our podcast-host dataset
distinct(Hosts) %>%  # get column of unique hosts
arrange(Hosts) %>%  # alphabetical order
mutate(id = as.character(row_number()), label = Hosts) %>%
select(-Hosts) #%>%  # provides pop up value on viz

# Dataframe of 'origin' and 'destination' nodes for edge drawing
edges <- relay_combos %>%
left_join(nodes, by = c("V1" = "label")) %>%
left_join(nodes, by = c("V2" = "label")) %>%
rename(from = id.x, to = id.y, title = Podcast)

Now we can just plug the data into the visNetwork() function and pipe this into some other other functions to set some options.

visNetwork(nodes, edges,
main = "The Relayverse", submain = "Nodes are hosts, edges are shows"
) %>%
visOptions(
highlightNearest = TRUE,  # on-hover highlight nearest nodes
nodesIdSelection = TRUE  # select node from dropdown
) %>%
visPhysics(  # set physics 'engine' and options
solver = "forceAtlas2Based",
forceAtlas2Based = list(gravitationalConstant = -50)
) %>%
visLayout(randomSeed = 1337)  # reproduce the same network each time

You can:

• click a node to highlight a host and their co-hosts
• hover over an edge to show a label with the show name
• select a host from the dropdown menu to highlight them and their co-hosts
• click and drag a node to move it

This is a very simple example, but lots of other visNetwork options are available.

# What now?

This was really just an introduction to tidygraph, ggraph and visNetwork. These packages make network analysis a little more consistent and can provide some interesting stats and visuals very quickly. A next step might be to produce a ‘network of podcast networks’, since Relay FM hosts appear in shows on other networks as well.

1. Some personal favourites: Analogue, Cortex, Playing For Fun, Reconcilable Differences, Top Four and Connected.

2. Try SelectorGadget or your your browser’s ‘Inspect’ tool to help isolate the xpath of the element of interest.

3. The regex (?<=[a-z])(?=[A-Z]) can be interpreted as ‘split after but not including (?<=) a lowercase letter ([a-z]), and before but not including (?=) a capital letter ([A-Z])’.

4. We want combinations, not permutations. ‘Federico Viticci to Stephen M. Hackett’ is the same as ‘Stephen M. Hackett to Federico Viticci’, for example.