Tutorial: Building Maps With Marathon Data

A brief overview of the process of creating heat maps using collected marathon data

Summary

Summary
Introduction
Setup
- A note about the data
- Manipulation
Why we want maps
Solving our problems with maps
- R US Mapping Packages
The Final Result
Conclusion

Introduction

I wanted to take a look at some historical data from the Twin Cities Marathon for an ongoing class project. This is not the finished project - this specific post is intended to highlight a means to build heatmaps in R using usmap and ggplot

Setup

Naturally, we have to add some packages -

# install.packages("dplyr")    # alternative installation of the %>%
library(dplyr)    # alternatively, this also loads %>%
library("ggplot2")


# install.packages("maps")
library(maps)     # Provides latitude and longitude data for various maps


included_years<-c(2015,2016,2017,2018,2019)

# String manipulation
library(stringr)

# Verbose regular expressions
library(rebus)

# Eases DateTime manipulation
library(lubridate)

A note about the data

This data comes from the Twin Cities Marathon results page. For each year, you can get 500 results at a time. I couldn’t find an endpoint that would let me paginate the data (which would allow me automate the process of getting the data) and I didn’t want to spend a day writing a script, so I just clicked through each page and copied and pasted the data. It copied cleanly into a tsv, which I opened in excel and saved as a CSV so I could read them here. I saved each year’s results as its own CSV (2015-2019)

Manipulation

I had to do some manual manipulation to get this data. Because this post is instructional in nature, I have omitted the details and will share them in a later post.

tcm<-read.csv('./anonymized_tcm_2019.csv')
tcm_2018<-read.csv('./anonymized_tcm_2018.csv')
tcm_2017<-read.csv('./anonymized_tcm_2017.csv')
tcm_2016<-read.csv('./anonymized_tcm_2016.csv')
tcm_2015<-read.csv('./anonymized_tcm_2015.csv')

#this is just a CSV with each US state's proper name and 2 letter postal code. Originally I wanted one with other countries, with countries' ISO codes, but this was faster and worked for the purposes of what I actually wanted to do.
us_states <- read.csv('./states.csv')

I have to manually make some changes to this data - I wanted the year preserved for results when I aggregated it into one CSV for all five years

tcm_2019 <- tcm_2019%>%rowwise()%>%mutate(Year = 2019)%>%mutate(YearsSince2015 = 2019-2015)
tcm_2018 <- tcm_2018%>%rowwise()%>%mutate(Year = 2018)%>%mutate(YearsSince2015 = 2018-2015)
tcm_2017 <- tcm_2017%>%rowwise()%>%mutate(Year = 2017)%>%mutate(YearsSince2015 = 2017-2015)
tcm_2016 <- tcm_2016%>%rowwise()%>%mutate(Year = 2016)%>%mutate(YearsSince2015 = 2016-2015)
tcm_2015 <- tcm_2015%>%rowwise()%>%mutate(Year = 2015)%>%mutate(YearsSince2015 = 2015-2015)


all_tcms <- rbind(tcm_2019, tcm_2018, tcm_2017, tcm_2016, tcm_2015)

OK, here comes the part possibly nobody else cares about. The CSVs above, after being compiled as CSVs, were run through a script which removed participants’ names (hashed with the same salt for all 5 years) in case I wanted to see if a participant ran the race multiple years. That script also made a request to Mapbox to read the participant’s city and state and add a column with latitude/longitude and distance to the start line (well, distance to the city center of Minneapolis) from that city. So my data includes distance to start line in miles and a latitude/longitude variable stored as [, ]. It's worth mentioning that the script that made the Mapbox API requests took a long time because you're limited to I think 600 requests per minute and I have 38000 columns and I didn't want to store the data (Making more requests is cheaper than crashing my macbook. Thanks, Mapbox! Love ya!) Below, I had to parse those into individual columns in the data:

all_tcms <- all_tcms%>%rowwise()%>%mutate(Lat = gsub( "\\[", "",strsplit(LatLong,", " )[[1]][1])[[1]][1] )%>%mutate(Long = gsub( "\\]", "",strsplit(LatLong,", " )[[1]][2])[[1]][1])
head(all_tcms%>%filter(State=="MN"))

## # A tibble: 6 x 17
## # Rowwise: 
##     Bib Name  Sex   Age   City  State Overall SexPl DivPl Time  LatLong
##   <int> <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr> <chr> <chr>  
## 1     4 bf90… M     29    Minn… MN    Mar-47  Mar-… 2 / … 2:15… [44.97…
## 2    17 feac… M     28    St P… MN    Jul-47  Jul-… 5 / … 2:19… [44.95…
## 3    19 66c3… M     26    Minn… MN    Sep-47  Sep-… 6 / … 2:21… [44.97…
## 4  3516 66ae… M     22    Plym… MN    Dec-47  Dec-… 2 / … 2:26… [45.00…
## 5    25 eee2… M     23    Rose… MN    14 / 6… 14 /… 3 / … 2:29… [44.73…
## 6    22 9ead… M     26    Dela… MN    18 / 6… 17 /… 10 /… 2:32… [45.04…
## # … with 6 more variables: DistanceToStartLine <dbl>, seconds <dbl>,
## #   Year <dbl>, YearsSince2015 <dbl>, Lat <chr>, Long <chr>

Why we want maps

Here’s a small example to demonstrate why we want to use maps to better visualize the data:

ggplot(data=tcm) + geom_bar(aes(x=State) , position="dodge") + theme(axis.text.x = element_text(angle=90, hjust=1))+ 
 labs(title = "2019 Twin Cities Marathon Finishers by State")

plot of chunk unnamed-chunk-7

MainStates <- map_data("state")

Ok… But can’t we remove Minnesota if we want to see the data better? Well, kind of:

ggplot(data=tcm%>%filter(State!="MN")) + geom_bar(aes(x=State) , position="dodge") + theme(axis.text.x = element_text(angle=90, hjust=1))+ 
 labs(title = "2019 Twin Cities Marathon Finishers by State without Minnesota")

plot of chunk unnamed-chunk-8

MainStates <- map_data("state")

It’s better, but more specific data would be nice.

Solving our problems with maps

It would be great if we could see PCPs hometowns on a map.

To start we need

R US Mapping Packages

# install.packages("usmap")
# install.packages("maptools")
# install.packages("rgdal")
library(usmap)
library(maptools)
library(rgdal)

I learned everything I needed to know about the mapping packages from the earthquakes example (for geom_point) here

In short, all I needed to do was this:

# I had to get use a select statement to get my dataframe to have Long and Lat as the first and second column.
#Then I wanted State so I would be free to remove Minnesota or map only a specific state
#I also wanted to preserve the time it took someone to complete the race as a whole number of hours
pcp_locations<- all_tcms%>%filter(State %in% us_states$Abbreviation)%>%filter(! is.na(State))%>%select(Long, Lat, seconds, State)%>%mutate(Long=as.numeric(Long), Lat=as.numeric(Lat))%>%mutate(Hours=seconds%/%(60*60))

#We want to remove NAs from Lat & Long
pcp_locations <- na.omit(pcp_locations)

eq_transformed <- usmap_transform(pcp_locations)

eq_transformed<-eq_transformed

#When you do the below remember x=Long, y=Lat, this mixes me up every time
plot_usmap() +
  geom_point(data = eq_transformed, aes(x = Long.1, y = Lat.1),
             color = "red", alpha = 0.1, size=0.1) +
  
  labs(title = "Marathon Finishers") +
  theme(legend.position = "right")

plot of chunk unnamed-chunk-10

As you can see, there’s still a lot of data here outside the US. I didn’t have time last night to figure out what was going on there, so I added a filter statement to remove it: %>%filter(Lat.1>=-4000000)%>%filter(Lat.1<=2000000)%>%filter(Long.1>=-2500000)%>%filter(Long.1<=2000000)

These are the approximate transformed values outside of which I didn’t want to try to display on the map_data

The Final Result

I added filters on the data to get in minnesota and outside of Minnesota data (as well as trying to remove the non US states, without a ton of luck, oh well)

pcp_locations<- all_tcms%>%filter(State %in% us_states$Abbreviation)%>%filter(! is.na(State))%>%select(Long, Lat, seconds, State)%>%mutate(Long=as.numeric(Long), Lat=as.numeric(Lat))%>%mutate(Hours=seconds%/%(60*60))

out_of_us_locations<- all_tcms%>%filter(!State %in% us_states$Abbreviation)%>%filter(! is.na(State))%>%select(Long, Lat, seconds, State)%>%mutate(Long=as.numeric(Long), Lat=as.numeric(Lat))%>%mutate(Hours=seconds%/%(60*60))

nrow(pcp_locations)

## [1] 37519

nrow(all_tcms)

## [1] 38488

pcp_locations <- na.omit(pcp_locations)

out_of_us_locations<-na.omit(out_of_us_locations)

eq_transformed <- usmap_transform(pcp_locations)

eq_transformed<-eq_transformed%>%filter(Lat.1>=-4000000)%>%filter(Lat.1<=2000000)%>%filter(Long.1>=-2500000)%>%filter(Long.1<=2000000)

eq_mn<-eq_transformed%>%filter(State=="MN")
eq_not_mn<-eq_transformed%>%filter(State!="MN")
eq_transformed_2 <- usmap_transform(out_of_us_locations)
blog_palatte <- c("#005437", "#ffabc8", "#4c476a", "#aa2e00")

plot_usmap() +
  geom_point(data = eq_mn, aes(x = Long.1, y = Lat.1),
             color = "#ffabc8", alpha = 0.1, size=0.1) +
             geom_point(data = eq_not_mn, aes(x = Long.1, y = Lat.1),
             color = "#4c476a", alpha = 0.1, size=0.1) +
  
  labs(title = "Marathon Finishers, In & Out of Minnesota") +
  theme(legend.position = "right")

plot of chunk unnamed-chunk-11

Conclusion

That’s it! There’s a lot I haven’t figured out about maps yet (I’m working on some aggregates now) but if this process walkthrough helps anyone else I think it was worth it!

If you have any recommendations on how to improve this page, just get in touch.