Tutorial: Building Maps With Marathon Data
A brief overview of the process of creating heat maps using collected marathon data
Summary
- Summary
- Introduction
- Setup
- Why we want maps
- Solving our problems with maps
- The Final Result
- Conclusion
Introduction
I wanted to take a look at some historical data from the Twin Cities Marathon for an ongoing class project. This is not the finished project - this specific post is intended to highlight a means to build heatmaps in R using usmap and ggplot
Setup
Naturally, we have to add some packages -
# install.packages("dplyr") # alternative installation of the %>%
library(dplyr) # alternatively, this also loads %>%
library("ggplot2")
# install.packages("maps")
library(maps) # Provides latitude and longitude data for various maps
included_years<-c(2015,2016,2017,2018,2019)
# String manipulation
library(stringr)
# Verbose regular expressions
library(rebus)
# Eases DateTime manipulation
library(lubridate)
A note about the data
This data comes from the Twin Cities Marathon results page. For each year, you can get 500 results at a time. I couldn’t find an endpoint that would let me paginate the data (which would allow me automate the process of getting the data) and I didn’t want to spend a day writing a script, so I just clicked through each page and copied and pasted the data. It copied cleanly into a tsv, which I opened in excel and saved as a CSV so I could read them here. I saved each year’s results as its own CSV (2015-2019)
Manipulation
I had to do some manual manipulation to get this data. Because this post is instructional in nature, I have omitted the details and will share them in a later post.
tcm<-read.csv('./anonymized_tcm_2019.csv')
tcm_2018<-read.csv('./anonymized_tcm_2018.csv')
tcm_2017<-read.csv('./anonymized_tcm_2017.csv')
tcm_2016<-read.csv('./anonymized_tcm_2016.csv')
tcm_2015<-read.csv('./anonymized_tcm_2015.csv')
#this is just a CSV with each US state's proper name and 2 letter postal code. Originally I wanted one with other countries, with countries' ISO codes, but this was faster and worked for the purposes of what I actually wanted to do.
us_states <- read.csv('./states.csv')
I have to manually make some changes to this data - I wanted the year preserved for results when I aggregated it into one CSV for all five years
tcm_2019 <- tcm_2019%>%rowwise()%>%mutate(Year = 2019)%>%mutate(YearsSince2015 = 2019-2015)
tcm_2018 <- tcm_2018%>%rowwise()%>%mutate(Year = 2018)%>%mutate(YearsSince2015 = 2018-2015)
tcm_2017 <- tcm_2017%>%rowwise()%>%mutate(Year = 2017)%>%mutate(YearsSince2015 = 2017-2015)
tcm_2016 <- tcm_2016%>%rowwise()%>%mutate(Year = 2016)%>%mutate(YearsSince2015 = 2016-2015)
tcm_2015 <- tcm_2015%>%rowwise()%>%mutate(Year = 2015)%>%mutate(YearsSince2015 = 2015-2015)
all_tcms <- rbind(tcm_2019, tcm_2018, tcm_2017, tcm_2016, tcm_2015)
OK, here comes the part possibly nobody else cares about. The CSVs above, after being compiled as CSVs, were run through a script which removed participants’ names (hashed with the same salt for all 5 years) in case I wanted to see if a participant ran the race multiple years.
That script also made a request to Mapbox to read the participant’s city and state and add a column with latitude/longitude and distance to the start line (well, distance to the city center of Minneapolis) from that city.
So my data includes distance to start line in miles and a latitude/longitude variable stored as [
all_tcms <- all_tcms%>%rowwise()%>%mutate(Lat = gsub( "\\[", "",strsplit(LatLong,", " )[[1]][1])[[1]][1] )%>%mutate(Long = gsub( "\\]", "",strsplit(LatLong,", " )[[1]][2])[[1]][1])
head(all_tcms%>%filter(State=="MN"))
## # A tibble: 6 x 17
## # Rowwise:
## Bib Name Sex Age City State Overall SexPl DivPl Time LatLong
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 4 bf90… M 29 Minn… MN Mar-47 Mar-… 2 / … 2:15… [44.97…
## 2 17 feac… M 28 St P… MN Jul-47 Jul-… 5 / … 2:19… [44.95…
## 3 19 66c3… M 26 Minn… MN Sep-47 Sep-… 6 / … 2:21… [44.97…
## 4 3516 66ae… M 22 Plym… MN Dec-47 Dec-… 2 / … 2:26… [45.00…
## 5 25 eee2… M 23 Rose… MN 14 / 6… 14 /… 3 / … 2:29… [44.73…
## 6 22 9ead… M 26 Dela… MN 18 / 6… 17 /… 10 /… 2:32… [45.04…
## # … with 6 more variables: DistanceToStartLine <dbl>, seconds <dbl>,
## # Year <dbl>, YearsSince2015 <dbl>, Lat <chr>, Long <chr>
Why we want maps
Here’s a small example to demonstrate why we want to use maps to better visualize the data:
ggplot(data=tcm) + geom_bar(aes(x=State) , position="dodge") + theme(axis.text.x = element_text(angle=90, hjust=1))+
labs(title = "2019 Twin Cities Marathon Finishers by State")
MainStates <- map_data("state")
Ok… But can’t we remove Minnesota if we want to see the data better? Well, kind of:
ggplot(data=tcm%>%filter(State!="MN")) + geom_bar(aes(x=State) , position="dodge") + theme(axis.text.x = element_text(angle=90, hjust=1))+
labs(title = "2019 Twin Cities Marathon Finishers by State without Minnesota")
MainStates <- map_data("state")
It’s better, but more specific data would be nice.
Solving our problems with maps
It would be great if we could see PCPs hometowns on a map.
To start we need
R US Mapping Packages
# install.packages("usmap")
# install.packages("maptools")
# install.packages("rgdal")
library(usmap)
library(maptools)
library(rgdal)
I learned everything I needed to know about the mapping packages from the earthquakes example (for geom_point
) here
In short, all I needed to do was this:
# I had to get use a select statement to get my dataframe to have Long and Lat as the first and second column.
#Then I wanted State so I would be free to remove Minnesota or map only a specific state
#I also wanted to preserve the time it took someone to complete the race as a whole number of hours
pcp_locations<- all_tcms%>%filter(State %in% us_states$Abbreviation)%>%filter(! is.na(State))%>%select(Long, Lat, seconds, State)%>%mutate(Long=as.numeric(Long), Lat=as.numeric(Lat))%>%mutate(Hours=seconds%/%(60*60))
#We want to remove NAs from Lat & Long
pcp_locations <- na.omit(pcp_locations)
eq_transformed <- usmap_transform(pcp_locations)
eq_transformed<-eq_transformed
#When you do the below remember x=Long, y=Lat, this mixes me up every time
plot_usmap() +
geom_point(data = eq_transformed, aes(x = Long.1, y = Lat.1),
color = "red", alpha = 0.1, size=0.1) +
labs(title = "Marathon Finishers") +
theme(legend.position = "right")
As you can see, there’s still a lot of data here outside the US.
I didn’t have time last night to figure out what was going on there, so I added a filter statement to remove it: %>%filter(Lat.1>=-4000000)%>%filter(Lat.1<=2000000)%>%filter(Long.1>=-2500000)%>%filter(Long.1<=2000000)
These are the approximate transformed values outside of which I didn’t want to try to display on the map_data
The Final Result
I added filters on the data to get in minnesota and outside of Minnesota data (as well as trying to remove the non US states, without a ton of luck, oh well)
pcp_locations<- all_tcms%>%filter(State %in% us_states$Abbreviation)%>%filter(! is.na(State))%>%select(Long, Lat, seconds, State)%>%mutate(Long=as.numeric(Long), Lat=as.numeric(Lat))%>%mutate(Hours=seconds%/%(60*60))
out_of_us_locations<- all_tcms%>%filter(!State %in% us_states$Abbreviation)%>%filter(! is.na(State))%>%select(Long, Lat, seconds, State)%>%mutate(Long=as.numeric(Long), Lat=as.numeric(Lat))%>%mutate(Hours=seconds%/%(60*60))
nrow(pcp_locations)
## [1] 37519
nrow(all_tcms)
## [1] 38488
pcp_locations <- na.omit(pcp_locations)
out_of_us_locations<-na.omit(out_of_us_locations)
eq_transformed <- usmap_transform(pcp_locations)
eq_transformed<-eq_transformed%>%filter(Lat.1>=-4000000)%>%filter(Lat.1<=2000000)%>%filter(Long.1>=-2500000)%>%filter(Long.1<=2000000)
eq_mn<-eq_transformed%>%filter(State=="MN")
eq_not_mn<-eq_transformed%>%filter(State!="MN")
eq_transformed_2 <- usmap_transform(out_of_us_locations)
blog_palatte <- c("#005437", "#ffabc8", "#4c476a", "#aa2e00")
plot_usmap() +
geom_point(data = eq_mn, aes(x = Long.1, y = Lat.1),
color = "#ffabc8", alpha = 0.1, size=0.1) +
geom_point(data = eq_not_mn, aes(x = Long.1, y = Lat.1),
color = "#4c476a", alpha = 0.1, size=0.1) +
labs(title = "Marathon Finishers, In & Out of Minnesota") +
theme(legend.position = "right")
Conclusion
That’s it! There’s a lot I haven’t figured out about maps yet (I’m working on some aggregates now) but if this process walkthrough helps anyone else I think it was worth it!
If you have any recommendations on how to improve this page, just get in touch.