Travel the NBA with {rvest}, {leaflet} and {osrm}

leaflet
osrm
plotly
rvest
sports
webscrape
Author
Published

December 24, 2018

John Stockton passes the basketball to Karl Malone who dunks it.

Classic Jazz: Stockton to Malone for the dunk (via Giphy

tl;dr

The {osrm} R package can retrieve from the OSRM API the travel duration between points. I looked at these data for NBA basketball-team arenas, whose details I scraped from the web using {rvest} and mapped with {leaflet}.

Note

The original version of this post used the {gmapsdistance} package. I updated it extensively in 2020 to use the {osrm} package, which doesn’t require an API key nor billing details.

On the road

Fans don’t have far to travel in the UK if they want to see their favourite sports team play an away match.

The USA is pretty big, though.

The National Basketball Association (NBA) compensates by separating its teams into Eastern and Western conferences, each with three divisions of five teams. This means that the majority of regular-season games aren’t too far away.

But this varies. Teams are clustered near Lakes Michigan and Erie in the Central division, but the Northwest division stretches from Portland in the Pacific Northwest to Oklahoma City in the centre-south of the country.

What would it take to be a basketball fan who wanted to drive to away games? How long would it take?

R can help

Surprise, this is all a ruse for me to practice with some R packages:

  • {rvest} for scraping web pages
  • {leaflet} for interactive mapping
  • {osrm} for calculating duration of travel between points

There’s four main parts to the post (click to jump):

  1. Scrape team data
  2. Map the locations
  3. Get travel duration
  4. Make a heatmap

Let’s start by attaching the packages we need. As always, make sure these are installed first, using install.packages().

suppressPackageStartupMessages({
  
  # Tidyverse and friends
  library(tidyverse)  # data handling and plotting
  library(rvest)      # scrape data
  library(janitor)    # misc cleaning
  
  # Geography and travel
  library(sf)         # handle geographies
  library(osrm)       # fetch travel info
  
  # Interactive elements
  library(leaflet)    # interactive maps
  library(DT)         # interactive tables
  library(plotly)     # interactive plots
  
})

1. Scrape team data

Use {rvest}

The Wikipedia page for the NBA has a table with each team and its location, including coordinates. We can use the {rvest} web-scraping package to extract that table into a data frame with these steps:

  1. Read the HTML of the page with xml2::read_html()
  2. Extract the HTML node for the table with rvest::html_nodes()
  3. Parse the HTML as a table with rvest::html_table()

Note that you have to provide to html_nodes() a CSS selector or an XPath that identifies the table’s ‘location’ in the HTML. You can find these using a tool like SelectorGadget, or with your browser’s ‘inspect’ tool (for Chrome, right-click the element on the page, select ‘inspect’, right-click the HTML for that element, go to ‘Copy’, then ’Copy full XPath). Beware: if the Wikipedia page changes, then this path could change in future.

nba_scrape <-
  read_html("https://en.wikipedia.org/wiki/National_Basketball_Association") %>% 
  html_nodes(xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[4]") %>%
  html_table(fill = TRUE, header = NA) %>%
  .[[1]]  # list was returned, so extract first list element

Here’s a preview:

glimpse(nba_scrape)
Rows: 32
Columns: 9
$ Division    <chr> "Eastern Conference", "Atlantic", "Atlantic", "Atlantic", …
$ Team        <chr> "Eastern Conference", "Boston Celtics", "Brooklyn Nets", "…
$ Location    <chr> "Eastern Conference", "Boston, Massachusetts", "New York C…
$ Arena       <chr> "Eastern Conference", "TD Garden", "Barclays Center", "Mad…
$ Capacity    <chr> "Eastern Conference", "19,156", "17,732", "19,812", "20,47…
$ Coordinates <chr> "Eastern Conference", ".mw-parser-output .geo-default,.mw-…
$ Founded     <chr> "Eastern Conference", "1946", "1967*", "1946", "1946*", "1…
$ Joined      <chr> "Eastern Conference", "1946", "1976", "1946", "1949", "199…
$ ``          <chr> "Eastern Conference", NA, NA, NA, NA, NA, NA, NA, NA, NA, …

So, the table has been returned, but it needs to be tidied up.

Wrangle the data

To summarise the main cleaning steps required:

  • remove the rogue NA-filled column
  • filter out the spanning headers that identify the conferences
  • add a column for each team’s conference
  • make numeric the arena capacity
  • separate city and state into separate columns
  • isolate the latitude and longitude by separating them from the Coordinates column
  • remove the ‘zero width no-break space’ unicode character in the longitude column
  • retain only the columns of interest
nba_wrangle <- nba_scrape %>% 
  select(-length(.)) %>%  # remove the last column (NA)
  dplyr::filter(!str_detect(Division, "Conference")) %>% 
  mutate(
    Conference = c(rep("Eastern", 15), rep("Western", 15)),
    Capacity = as.numeric(str_remove(Capacity, ","))
  ) %>% 
  separate(Location, c("City", "State"), sep = ", ") %>% 
  separate(Coordinates, c("Coords1", "Coords2", "Coords3"), " / ") %>% 
  separate(Coords3, c("Latitude", "Longitude"), sep = "; ") %>% 
  separate(Longitude, c("Longitude", "X"), sep = " \\(") %>% 
  mutate(
    Latitude = as.numeric(Latitude),
    Longitude = as.numeric(str_remove(Longitude, "\\ufeff"))  # rogue unicode
  ) %>% 
  select(
    Team, Conference, everything(),
    -Founded, -Joined, -Coords1, -Coords2, -X
  ) %>% 
  as_tibble()  # convert to tibble

glimpse(nba_wrangle)
Rows: 30
Columns: 9
$ Team       <chr> "Boston Celtics", "Brooklyn Nets", "New York Knicks", "Phil…
$ Conference <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "Eas…
$ Division   <chr> "Atlantic", "Atlantic", "Atlantic", "Atlantic", "Atlantic",…
$ City       <chr> "Boston", "New York City", "New York City", "Philadelphia",…
$ State      <chr> "Massachusetts", "New York", "New York", "Pennsylvania", "O…
$ Arena      <chr> "TD Garden", "Barclays Center", "Madison Square Garden", "W…
$ Capacity   <dbl> 19156, 17732, 19812, 20478, 19800, 20917, 19432, 20332, 179…
$ Latitude   <dbl> 42.36630, 40.68265, 40.75056, 39.90111, 43.64333, 41.88056,…
$ Longitude  <dbl> -71.06223, -73.97469, -73.99361, -75.17194, -79.37917, -87.…

Add more information

I made a table of three-letter team codes and colours for the markers and icons that will appear in the pins on the interactive map. I got these from teamcolorcodes.com. With {leaflet}. The markers can only take a small set of named colours (see ?awesomeIcons), whereas the icon can use any CSS-valid colour (like hex codes).

Click for the code that creates a data frame of team codes and colours
nba_abbr_cols <- tribble(
  ~Code, ~Franchise, ~colour_marker, ~colour_icon,
  "ATL", "Atlanta Hawks",          "red",       "#C1D32F",
  "BKN", "Boston Celtics",         "black",     "#FFFFFF",
  "BOS", "Brooklyn Nets",          "green",     "#BA9653",
  "CHA", "Charlotte Hornets",      "darkblue",  "#00788C",
  "CHI", "Chicago Bulls",          "red",       "#000000",
  "CLE", "Cleveland Cavaliers",    "darkred",   "#FDBB30",
  "DAL", "Dallas Mavericks",       "blue",      "#B8C4CA",
  "DEN", "Denver Nuggets",         "darkblue",  "#FEC524",
  "DET", "Detroit Pistons",        "red",       "#1D42BA",
  "GSW", "Golden State Warriors",  "blue",      "#FFC72C",
  "HOU", "Houston Rockets",        "red",       "#000000",
  "IND", "Indiana Pacers",         "darkblue",  "#FDBB30",
  "LAC", "Los Angeles Clippers",   "red",       "#1D428A",
  "LAL", "Los Angeles Lakers",     "blue",      "#FDB927",
  "MEM", "Memphis Grizzlies",      "lightblue", "#12173F",
  "MIA", "Miami Heat",             "red",       "#F9A01B",
  "MIL", "Milwaukee Bucks",        "darkgreen", "#EEE1C6",
  "MIN", "Minnesota Timberwolves", "darkblue",  "#9EA2A2",
  "NOP", "New Orleans Pelicans",   "darkblue",  "#C8102E",
  "NYK", "New York Knicks",        "blue",      "#F58426",
  "OKC", "Oklahoma City Thunder",  "blue",      "#EF3B24",
  "ORL", "Orlando Magic",          "blue",      "#C4CED4",
  "PHI", "Philadelphia 76ers",     "blue",      "#ED174C",
  "PHX", "Phoenix Suns",           "darkblue",  "#E56020",
  "POR", "Portland Trail Blazers", "red",       "#000000",
  "SAC", "Sacramento Kings",       "purple",    "#63727A",
  "SAS", "San Antonio Spurs",      "black",     "#C4CED4",
  "TOR", "Toronto Raptors",        "red",       "#000000",
  "UTA", "Utah Jazz",              "darkblue",  "#F9A01B",
  "WAS", "Washington Wizards",     "darkblue",  "#E31837"
)
head(nba_abbr_cols)
# A tibble: 6 × 4
  Code  Franchise           colour_marker colour_icon
  <chr> <chr>               <chr>         <chr>      
1 ATL   Atlanta Hawks       red           #C1D32F    
2 BKN   Boston Celtics      black         #FFFFFF    
3 BOS   Brooklyn Nets       green         #BA9653    
4 CHA   Charlotte Hornets   darkblue      #00788C    
5 CHI   Chicago Bulls       red           #000000    
6 CLE   Cleveland Cavaliers darkred       #FDBB30    

Now this extra information can be joined to our scraped and wrangled data frame from before.

nba_table <- nba_wrangle %>% 
  left_join(nba_abbr_cols, by = c("Team" = "Franchise")) %>%
  select(Code, everything())

glimpse(nba_table)
Rows: 30
Columns: 12
$ Code          <chr> "BKN", "BOS", "NYK", "PHI", "TOR", "CHI", "CLE", "DET", …
$ Team          <chr> "Boston Celtics", "Brooklyn Nets", "New York Knicks", "P…
$ Conference    <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "…
$ Division      <chr> "Atlantic", "Atlantic", "Atlantic", "Atlantic", "Atlanti…
$ City          <chr> "Boston", "New York City", "New York City", "Philadelphi…
$ State         <chr> "Massachusetts", "New York", "New York", "Pennsylvania",…
$ Arena         <chr> "TD Garden", "Barclays Center", "Madison Square Garden",…
$ Capacity      <dbl> 19156, 17732, 19812, 20478, 19800, 20917, 19432, 20332, …
$ Latitude      <dbl> 42.36630, 40.68265, 40.75056, 39.90111, 43.64333, 41.880…
$ Longitude     <dbl> -71.06223, -73.97469, -73.99361, -75.17194, -79.37917, -…
$ colour_marker <chr> "black", "green", "blue", "blue", "red", "red", "darkred…
$ colour_icon   <chr> "#FFFFFF", "#BA9653", "#F58426", "#ED174C", "#000000", "…

Now we have everything we need to visualise the data and fetch the travel duration times.

2. Map the locations

So where are all the arenas?

We can create a simple interactive map with {leaflet} by plotting the Latitude and Longitude columns and creating custom point markers with a basketball icon and each team’s colours, as well as an information box that appears on-click.

leaflet(nba_table) %>%
  addProviderTiles(providers$Stamen.TonerLite) %>%  # add basemap
  addAwesomeMarkers(  # add markers
    lng = ~Longitude, lat = ~Latitude,  # coordinates
    popup = ~paste0(  # HTML content for popup info
      "<b>", nba_table$Team, "</b>",  # team name
      "<br>", paste0(nba_table$Arena, ", ", nba_table$City),  # location
      if_else(  # division/conference information
        nba_table$Conference == "Eastern",
        paste0("<br><font color='#0000FF'>", nba_table$Division,
               " Division (Eastern Conference)</font>"),
        paste0("<br><font color='#FF0000'>", nba_table$Division,
               " Division (Western Conference)</font>")
      )
    ),
    icon = awesomeIcons(
      library = "ion", icon = "ion-ios-basketball", # add basketball icon
      markerColor = nba_table$colour_marker,  # colour the marker
      iconColor = nba_table$colour_icon  # colour the basketball icon
    )
  ) %>%
  addMeasure()  # add straight-line distance-measuring tool

You can drag and zoom and click the points.

3. Get travel duration

So how far between these locations?

The {osrm} R package from Timothée Giraud, Robin Cura and Matthieu Viry lets you fetch shortest paths and travel times from OpenStreetMap via the OSRM API. It defaults to driving, but you can select walking and biking too. Since we’re using the demo server for OSRM, we can only fetch duration.

Duration matrix

The osrm::osrmTable() function takes a data frame (or spatial object) where the first three columns are an identifier and coordinates. The return object is a list, where the first element is a matrix of durations for each pair of points.

nba_locs <- nba_table %>% 
  select(Code, Longitude, Latitude) %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)

nba_dur <- osrmTable(loc = nba_locs)

glimpse(nba_dur)
List of 3
 $ durations   : num [1:30, 1:30] 0 282 276 385 650 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:30] "1" "2" "3" "4" ...
  .. ..$ : chr [1:30] "1" "2" "3" "4" ...
 $ sources     :'data.frame':   30 obs. of  2 variables:
  ..$ lon: num [1:30] -71.1 -74 -74 -75.2 -79.4 ...
  ..$ lat: num [1:30] 42.4 40.7 40.7 39.9 43.6 ...
 $ destinations:'data.frame':   30 obs. of  2 variables:
  ..$ lon: num [1:30] -71.1 -74 -74 -75.2 -79.4 ...
  ..$ lat: num [1:30] 42.4 40.7 40.7 39.9 43.6 ...

Duration: all teams

Let’s take this matrix and tidy it into a data frame so there’s one row per team-pair. We can also round to the nearest minute and calculate the nearest number of hours.

nba_dur_all <-
  as.data.frame(nba_dur$durations) %>% 
  rownames_to_column("Start") %>% 
  mutate(Start = nba_locs$Code) %>%
  rename_with(~c("Start", nba_locs$Code), all_of(names(.))) %>%
  pivot_longer(
    cols = BKN:SAS,
    names_to = "End",
    values_to = "Duration (mins)"
  ) %>% 
  mutate(
    `Duration (mins)` = round_half_up(`Duration (mins)`),
    `Duration (hrs)` = round_half_up(`Duration (mins)` / 60)
  ) %>% 
  arrange(desc(`Duration (mins)`))

Here’s a {DT} interactive table sorted by duration that you can filter. Click the ‘CSV’ button to download the data.

nba_dur_all %>% 
  datatable(
    filter = "top",
    extensions = c("Buttons","Scroller"),
    class = "compact", width = "100%",
    options = list(
      dom = "Blrtip",
      scroller = TRUE, scrollY = 300,
      buttons = list("csv")
    )
  )

So an incredible 58 hours of driving to get from Miami to Portland.

Duration: by division

We can also narrow this down to get only the team-pairs that play in the same division as each other.

nba_dur_div <- nba_dur_all %>%
  left_join(select(nba_table, Code, Division), by = c("Start" = "Code")) %>% 
  left_join(select(nba_table, Code, Division), by = c("End" = "Code")) %>% 
  dplyr::filter(Division.x == Division.y, `Duration (mins)` != 0) %>% 
  select(Division = Division.x, everything(), -Division.y) %>% 
  arrange(Division, desc(`Duration (mins)`))

Again, here’s an interactive table that you can use to explore the data. Note that it’s ordered by Division and then duration in minutes. I’ve hidden the code because it’s the same as for the table above.

Click for the {DT} code
nba_dur_div %>% 
  datatable(
    filter = "top",
    extensions = c("Buttons","Scroller"),
    rownames = FALSE,
    class = "compact", width = "100%",
    options = list(
      dom = "Blrtip",
      scroller = TRUE, scrollY = 300,
      buttons = list("csv")
    )
  )

This time we can see that there’s a maximum of 33 hours of driving required between two teams in the same division: Portland to Oklahoma City.

A quick diversion: routing

We know from using osrm::osrmTable() that Miami to Portland has the longest travel duration. What’s the route?

Fortunately, {osrm} has the function osrmRoute() for fetching the routes between a pair of points.

We can grab a vector of coordinates for each team from our nba_table object and set these as our origin (src) and destination (dst) in osrm::osrmRoute(). The return object is a ‘linestring’ object that contains detail on the coordinates and coordinate system for the route.

# Function to extract latlong vectors for teams
get_ll <- function(data, team_code) {
  team_data <- dplyr::filter(data, Code == team_code)
  lng <- pull(team_data, Longitude)
  lat <- pull(team_data, Latitude)
  lnglat <- c(lng, lat)
  return(lnglat)
}

# Get route between latlong pairs
route <- osrmRoute(
  src = get_ll(nba_table, "MIA"),
  dst = get_ll(nba_table, "POR"),
  returnclass = "sf"
)
Warning: "returnclass" is deprecated.
route
Simple feature collection with 1 feature and 4 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: -122.6658 ymin: 25.78202 xmax: -80.15664 ymax: 45.8407
Geodetic CRS:  WGS 84
        src dst duration distance                       geometry
src_dst src dst 3461.665 5247.537 LINESTRING (-80.18809 25.78...

Now we can set up the same type of {leaflet} map as earlier, but we’ll include only Portland and OKC. I’ve hidden the map definition because it’s almost the same as before.

Click for the {leaflet} map definition
mia_por <- nba_table %>%
  dplyr::filter(Code %in% c("MIA", "POR"))

mia_por_map <- 
  leaflet(mia_por) %>%
  addProviderTiles(providers$Stamen.TonerLite) %>%  # add basemap
  addAwesomeMarkers(  # add markers
    lng = ~Longitude, lat = ~Latitude,  # coordinates
    popup = ~paste0(  # HTML content for popup info
      "<b>", mia_por$Team, "</b>",  # team name
      "<br>", paste0(mia_por$Arena, ", ", mia_por$City),  # location
      if_else(  # division/conference information
        mia_por$Conference == "Eastern",
        paste0("<br><font color='#0000FF'>", mia_por$Division,
               " Division (Eastern Conference)</font>"),
        paste0("<br><font color='#FF0000'>", mia_por$Division,
               " Division (Western Conference)</font>")
      )
    ),
    icon = awesomeIcons(
      library = "ion", icon = "ion-ios-basketball", # add basketball icon
      markerColor = mia_por$colour_marker,  # colour the marker
      iconColor = mia_por$colour_icon  # colour the basketball icon
    )
  ) %>%
  addMeasure()  # add straight-line distance-measuring tool

And to that map we can add the line that defines the route

mia_por_map %>% addPolylines(data = st_geometry(route))

That’s a long way.

4. Make a heatmap

A quick way to visualise the data is to create a heatmap, where we take a matrix of teams in each division and colour by duration. Here, lighter colours indicate greater travel duration.

The plot is interactive; you can hover over squares in each facet to see specific information about that pair, including the exact duration value.

p <- nba_dur_div %>% 
  ggplot(aes(Start, End)) +
  geom_tile(aes(fill = `Duration (hrs)`)) +
  xlab("") + ylab("") + 
  facet_wrap(~Division, scales = "free")

ggplotly(p)

Note the light colours in the Northwest division where teams have to travel far (like the 33 hour trip from Portland and Oklahoma City), while travel durations in the Atlantic and Central divisions are shorter. Of course, the Clippers and Lakers both play in the Staples Center in LA, so their journey time is zero.

Ending the journey

So, this post shows the the power of the {osrm} package for travel distance, duration and routing information.

Of course, it’s never usually as simple as having your geographic data ready to go, so I hope this post also provides a good use-case for {rvest} to help you collect information and {tidyverse} for wrangling it.

The plots here are pretty minimal, but they hopefully give a flavour of how to use {leaflet} for plotting points and the routing between them according to {osrm}.

This post was initially written before the travel restrictions brought about by the 2020 pandemic. Of course, the maps would have been much simpler during for the 2020 playoffs, which all took place in a ‘bubble’ at Disney World, Florida.

Environment

Session info
Last rendered: 2023-08-05 16:58:43 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plotly_4.10.2   DT_0.28         leaflet_2.1.2   osrm_4.1.1     
 [5] sf_1.0-14       janitor_2.2.0   rvest_1.0.3     lubridate_1.9.2
 [9] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1    
[13] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2  
[17] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.3            bslib_0.5.0             xfun_0.39              
 [4] htmlwidgets_1.6.2       tzdb_0.4.0              leaflet.providers_1.9.0
 [7] vctrs_0.6.3             tools_4.3.1             crosstalk_1.2.0        
[10] generics_0.1.3          curl_5.0.1              proxy_0.4-27           
[13] fansi_1.0.4             pkgconfig_2.0.3         KernSmooth_2.23-21     
[16] data.table_1.14.8       lifecycle_1.0.3         farver_2.1.1           
[19] compiler_4.3.1          googlePolylines_0.8.3   munsell_0.5.0          
[22] fontawesome_0.5.1       snakecase_0.11.0        sass_0.4.7             
[25] htmltools_0.5.5         class_7.3-22            yaml_2.3.7             
[28] lazyeval_0.2.2          jquerylib_0.1.4         pillar_1.9.0           
[31] ellipsis_0.3.2          classInt_0.4-9          cachem_1.0.8           
[34] tidyselect_1.2.0        digest_0.6.33           stringi_1.7.12         
[37] labeling_0.4.2          fastmap_1.1.1           grid_4.3.1             
[40] colorspace_2.1-0        cli_3.6.1               magrittr_2.0.3         
[43] utf8_1.2.3              e1071_1.7-13            RcppSimdJson_0.1.10    
[46] withr_2.5.0             scales_1.2.1            timechange_0.2.0       
[49] rmarkdown_2.23          httr_1.4.6              hms_1.1.3              
[52] evaluate_0.21           knitr_1.43.1            viridisLite_0.4.2      
[55] rlang_1.1.1             Rcpp_1.0.11             isoband_0.2.7          
[58] glue_1.6.2              DBI_1.1.3               xml2_1.3.5             
[61] mapiso_0.3.0            rstudioapi_0.15.0       jsonlite_1.8.7         
[64] R6_2.5.1                units_0.8-2            

Reuse

CC BY-NC-SA 4.0