I wanted to take a look at some historical data from the Twin Cities Marathon for an ongoing class project. This is not the finished project - this specific post is intended to highlight a means to build heatmaps in R using usmap and ggplot
Setup
Naturally, we have to add some packages -
A note about the data
This data comes from the Twin Cities Marathon results page.
For each year, you can get 500 results at a time.
I couldn’t find an endpoint that would let me paginate the data (which would allow me automate the process of getting the data) and I didn’t want to spend a day writing a script, so I just clicked through each page and copied and pasted the data.
It copied cleanly into a tsv, which I opened in excel and saved as a CSV so I could read them here.
I saved each year’s results as its own CSV (2015-2019)
Manipulation
I had to do some manual manipulation to get this data. Because this post is instructional in nature, I have omitted the details and will share them in a later post.
I have to manually make some changes to this data - I wanted the year preserved for results when I aggregated it into one CSV for all five years
OK, here comes the part possibly nobody else cares about. The CSVs above, after being compiled as CSVs, were run through a script which removed participants’ names (hashed with the same salt for all 5 years) in case I wanted to see if a participant ran the race multiple years.
That script also made a request to Mapbox to read the participant’s city and state and add a column with latitude/longitude and distance to the start line (well, distance to the city center of Minneapolis) from that city.
So my data includes distance to start line in miles and a latitude/longitude variable stored as [, ].
It's worth mentioning that the script that made the Mapbox API requests took a long time because you're limited to I think 600 requests per minute and I have 38000 columns and I didn't want to store the data
(Making more requests is cheaper than crashing my macbook. Thanks, Mapbox! Love ya!)
Below, I had to parse those into individual columns in the data:
Why we want maps
Here’s a small example to demonstrate why we want to use maps to better visualize the data:
Ok… But can’t we remove Minnesota if we want to see the data better? Well, kind of:
It’s better, but more specific data would be nice.
Solving our problems with maps
It would be great if we could see PCPs hometowns on a map.
To start we need
R US Mapping Packages
I learned everything I needed to know about the mapping packages from the earthquakes example (for geom_point) here
In short, all I needed to do was this:
As you can see, there’s still a lot of data here outside the US.
I didn’t have time last night to figure out what was going on there, so I added a filter statement to remove it: %>%filter(Lat.1>=-4000000)%>%filter(Lat.1<=2000000)%>%filter(Long.1>=-2500000)%>%filter(Long.1<=2000000)
These are the approximate transformed values outside of which I didn’t want to try to display on the map_data
The Final Result
I added filters on the data to get in minnesota and outside of Minnesota data (as well as trying to remove the non US states, without a ton of luck, oh well)
Conclusion
That’s it! There’s a lot I haven’t figured out about maps yet (I’m working on some aggregates now) but if this process walkthrough helps anyone else I think it was worth it!
If you have any recommendations on how to improve this page, just get in touch.