For my Data Analysis & Visualization course this fall in my Master of Marketing Analytics program, I analyzed geographical trends in the results from the Twin Cities Marathon from the years 2015-2019.
In total, this came out to about 38,000 rows of data, on which I performed logistic regressions for runner’s home state, the number of times a runner ran the race (with some caveats), and year of participation, in order to understand trends in completion time.
I wanted to take a look at some historical data from the Twin Cities Marathon.
I collected this data by navigating to the event results.
MTEC stores event results over many years (in this case, from 2001-2019), so I was able to get the event results without too much difficulty.
However, the data isn’t returned by an end-user accessible API call, so I wound up retrieving the data manually, 500 results at a time, copying each page’s results into a file, which I saved as .tsv, then converted to .csv using Excel.
Additionally, there were some more pieces of information I wanted: I wanted to be able to use this data in the future without worrying about accidentally releasing a CSV of people’s names and hometowns and ages (just because MTEC does it doesn’t mean I should), so I wrote a Python script to hash the names.
In this script, I also retrieved latitude and longitude for the provided hometown of each participant as well as “distance from start line” (which was really just approximate miles to Minneapolis, where the race starts).
Technically, Mapbox, which was my geocoding provider, limits you to 600 requests per minute.
I wasn’t 100% sure whether any requests that were rate-limited would be counted toward my monthly use.
You get 100,000 free requests per month and I have about 38,000 rows, so I only really wanted to have to run this on my complete dataset once.
For those reasons, I added a timer that would wait if we had hit out minute limit on requests (the limit was 600, so I stopped at 599 because frankly I think it’s best not to tempt fate).
I then used geopy against the geocoded latitude/longitude to retrieve distance to the start line.
I preserved latitude and longitude even though, at the time, I thought I only wanted to use distance to the start line, and I’m glad I did because that let me map the data, which was much easier to think about than aggregates like mean distance.
High-Level Findings: What factors predict a fast race time?
What factors are a good predictor of race time? To solve this, I ran several logistic regressions!
Does state influence race times?
Does the number of times you’ve run TCM between 2015-2019 influence probability of running the race in under 3:30?
I intend to one day run a marathon in under 3:30, and when I ran the Twin Cities Marathon, so I wanted to understand what factors are correlated with running the race under that amount of time.
Times Run vs State vs Sex
First, I ran a regression for Hours under 3:30 against the number of times run (as an integer) and state (as dummy-coded values) - (see regression 1 in appendix)
The significant factors (P ≤ 0.1) in this regression were:
Times you’re run this race between 2015-2019
slightly positively correlated with strong significance
Being from the following states/territories:
Positively correlated (0.563433), very strong significance
Positively correlated (0.591973) with significance
Negatively correlated (-0.449123) with very strong significance
Positively correlated (1.969353) with very strong significance
Positively correlated (0.413816) with “weak” (but still statistically significant) significance (0.087 < .1)
Positively correlated (0.60625) with significance
Positively correlated (0.699211) with strong significance
Being (registering for the race as) female:
Negatively correlated (-2.069862) with “weak” significance
The results that stand out here are the “very strong” significances - runners from Colorado, New Mexico, and Washington run faster, and runners from Minnesota run slower.
Are these results surprising? Not especially.
The lowest P value is New Mexico’s (2.63e-11) and it has the highest coefficient as well - I looked at the first 20 men to finish the race each of the five years and seventeen of those 100 total results were out of New Mexico.
New Mexico is at elevation and is hot in the summer, both of which lend themselves well to racing fast in Minnesota’s cold October weather at low elevation (Minneapolis/St. Paul have similar elevations to Austin).
New Mexico’s reputation as a training hotbed is well-documented.
Colorado is similarly unsurprising - Boulder is a training hotbed as well, with multiplepro teams forming training camps there in the past few years.
They have high elevation, lots of summer sun, and hard winters, which build mental toughness, so it’s a great place to train whether you’re a pro or an amateur like me.
Minnesota’s negative correlation makes sense in the context of the popularity of the race among locals. More locals of all times would likely bring their results closer to the mean time (just under 4:30).
Times Run as an Integer or Factor
Next, I ran a regression for Hours under 3:30 against the number of times run (as dummy-coded values) (Regression 2) and number of times ran as an integer (Regression 3) to compare results
In the integer regression, times you’ve run the race is somewhat positively correlated (0.09113) with strong statistical significance (P=2.26e-14, wow!!)
In the dummy coded regression, times you’ve run the race is positively correlated at every statistically significant factor (2, 3, 4, 5 and 10 times) with 10 being weakly significant.
This analysis requires some context on what I’ve called the common name effect, discussed below to find that discussion quickly, but in short, the analyses over 5 should be ignored because they are the effect of a participant sharing a name with another participant within or between years.
Year of Participation
Finally, I did a regression to analyze against the year of participation and number of times run as factors (Regression 4), and just the year of participation (Regression 5):
The coefficients for number of times run changed and running the race 5 times became more strongly significant, but no factors were significant in this regression that weren’t in the previous.
The base case in this regression was 2015 with a participant who only ran the race once. Every other year had a negative, coefficient (so, you’re more likely to run under 3:30 in 2015 than any other year).
However, only 2017 was statistically significant (and it was very strongly significant with P=1.18e-05) - it had a coefficient of -0.217072.
To understand this better, I looked at weather on those two days: in 2017 it was 55 degrees Fahrenheit at 8AM whereas in 2015 it was 42 and didn’t hit 55 until noon (at noon in 2017 it was 60).
Finally, in the year-only regression (Regression 5) - again, only the intercept and 2017 are statistically significant, but this time, 2016, 2016, and 2019 are negatively correlated and 2018 is positively (and not statistically significantly) correlated.
Changing Participation Rates
Participation decreased since 2015:
Chicago Marathon Effect (maybe)
The Twin Cities Marathon and the Chicago Marathon are both held in early October, so it’s worth considering whether holding the events in different weekends would lead to increased participation.
I don’t have a decade’s worth of data, so I can’t say something conclusive either way, however this doesn’t look meaningful.
This makes sense considering what a large percentage of participants are from Minnesota and likely run this race out of convenience or love of their hometown or state. Their pull to the race is likely explained by something other than a love of road racing.
A Side Note: Effect of Sponsorship
Let’s look at sponsor data:
I got the sponsor data by browsing the Internet Archive and manually compiling each page’s results into a CSV including sponsor name, year, and sponsorship level.
If I had more years of data readily accessible, I would investigate how the number of sponsors in a year impacts the rate of participation that year or in subsequent years.
Are Finish Times Slowing Down?
Let’s look at some summary statistics:
In short, not really. This is most likely largely explainable by weather.
How far are participants from the start line?
Minnesotans love this race:
More Minnesotans in Minneapolis/St. Paul participated than those outside the metro.
And, unsurprisingly, there are more non-Minnesotans who are closer to the start line who run this race than Minnesotans who are far:
Participants out of Minnesota
I wanted to do some visualizations of this data, but the overwhelming amount of participants from Minnesota makes it hard to look at, frankly, so I split Minnesota out of the results:
And I removed them altogether to look at non-Minnesotan participant density by state:
How Many Unique Participants?
We have 38488 unique records - Interestingly enough, we have 29504 unique participant names and only 3279 unique participant locations:
Results For One-Time Participants
When doing this analysis, I wanted to look at the differences between participants who had run this event once in the five years and participants who ran the event multiple times in the five years. Of course, the problem with this approach was that some participants will run the event once but have very common names (i.e. the John Smiths of the world) and it isn’t farfetched to expect at least one participant will run the event more than once but change their name between participations (i.e. Beyoncé Knowles and Beyoncé Knowles-Carter).
I originally assumed the data would be completely unusable for this reason - surely there are a large number of participants with the same name running in the same or different years.
To validate my concerns, I looked at the 2019 results on the MTEC website, and what I found surprised me a bit:
I looked at the most common surnames in Minnesota (because, as we’ve seen, most participants are from Minnesota and Minnesota’s ethnic makeup is very different than that of my home state, Texas), to make sure I wasn’t overlooking something that should be obvious.
It looks like the top chances of repeat names are:
Between those surnames for 2019, I only found one instance of triple names and a few double instances (which appeared to increase based on name commonness in and outside of Minnesota, so, more doubles for Andersons and Millers than Larsons).
Based on this soft analysis and the size of the race, I’m less worried about common names messing up the broad trends of this data.
For a larger race such as the Chicago Marathon or for an analysis using more years, I would be much more worried about this effect and how it compounds over the years.
For the remainder of this analysis, I will call this the Common Name Effect
Cleaning the Analysis
When replicating this research for another race where the same columns are provided, I will consider these steps to get a more clean analysis (these are ranked based on ease of implementation):
Create a column combining hashed name, city, and state and group for uniqueness on that column
Create a column for approximate year of birth based on age at time of racing and the year of the race and group for uniqueness on that column
The second method is better because people can and do move, but worse because repeat participants born in early October could be counted as unique participants based on whether the race is before or after their birthday that year.
Note that I am removing participants who appear in this data more than 5 times, because it has clearly been affected by the common name effect.