Data has had a big impact on sports in recent years. But coming up with helpful information involves a lot of analysis and data science. Since college football is my favorite sport, I channeled my obsession into a football machine learning project to determine whether or not teams should take the risk and go for it on a fourth down.
It Started With the Cubs
Though this is a story about college football, the tale begins years ago. It starts before that fateful 2016 World Series when the Chicago Cubs broke a 108-year losing streak. The Cubs have a long history of preferring day games, well after other teams installed lights and played night games to bring in more fans. As the focus of a university data mining final project, I decided to see if the Cubs propensity for day games and their location in player’s circadian rhythms could be a factor in their poor play.
Travel across time zones without rest days to make up for it seemed to have an impact on players’ performance. My data showed a small yet significant negative impact on a team’s chance to win. Years later it was incredibly validating to see ESPN feature a Northwestern University study that confirmed my results.
Contributing to the sports analytics community has been a way for me to practice advanced analytics. While my interest started with baseball, I’m now moving on to college football. This will be the first blog post in a case study series about this venture as more findings come forth and new topics are explored.
When Should You Go for It on Fourth Down?
In a sport like football, the gameplay has so much variability it can be difficult to find the type of information that can be coded into data. People running all over the field and moving in three dimensions means the trackable data is more about the beginning and end of each play—the decisions before the play and the outcomes after. I decided to investigate those crucial data decisions that teams make, which allowed me to simplify the complexity of the sport to a smaller set of variables.
Opportunities to score are difficult to come by, so it’s hard to pass up a chance to score some points now for a potential future chance to score even more. This is precisely the decision case on fourth downs when teams are close enough to kick a field goal. Coaches are often known to “take the points” and kick a field goal instead of going for it and scoring a touchdown or achieving a first down and hopefully scoring a touchdown later in the possession or attempting a closer, thus easier, field goal. This is considered safer.
Early in the game, the coach should make the decision that yields the highest expected value, which is very straightforward to calculate in this instance. The expected value of kicking the field goal is 3 (the points awarded for successfully booting the ball through the uprights) times the probability of making that given field goal. The expected value of going for it on fourth down is the expected points of starting first down at the line to gain times the probability of getting to that line to gain. Just comparing and selecting the highest expected point value helps make the decision for you. As a result, I had my basis for an effective machine learning model.
Confirming the Data Exists
Finding the right data for this kind of effort can be a challenge. It’s easy to find box scores for every game, but there’s no guarantee of a treasure trove of neatly (and accurately) formatted college football play-by-play data. And there was certainly no guarantee it would be free.
Luckily, the fine folks at Coaches by the Numbers didn’t start charging for their data until 2014, so the previous decade worth of data was readily available and the hundreds of thousands of plays were plenty for my needs. Securing very accurate data in simple delimited format that was ready-made for all the Python language data analysis and machine learning modules I tend to use is a gift. Typically, 90% of the effort in a data science project goes toward cleaning data and finding the necessary variables. Be prepared to do so on any artificial intelligence project in your future. I also supplemented this data with data I scraped from STASSEN. Combining multiple data sources is another common component of data science projects and something to embrace.
Creating the Models
My fourth down scenario had a couple different pieces to the puzzle, so there was ample opportunity to try several different models. The goal of each of these models was to predict the probability that a binary classification would occur:
On the field goal kick path, it was the probability of making the field goal.
On the going on fourth down path, it was the probability of converting that fourth down.
A common trope in advanced analytics is that starting with the simplest of means often provides the most success (think Occam’s Razor theory). In a classification task like this one, logistic regression is the simplest means and the first type of model I used.
Scikit-learn is a popular Python machine learning library that has many algorithms, including logistic regression. I employed scikit-learn for creating models while using Pandas (a data manipulation library) to pull the files into data frames to be worked on in-memory. Beyond logistic regression, I also used a common clustering method called k-nearest neighbors along with random forest, Naïve Bayes, and Gaussian process regression.
Choosing the Most Accurate Model
There are several methods to calculate the accuracy of a model so you can compare different options and configurations of those options. One of the most popular is splitting your data into a group to train the models while setting aside a random portion of the data for testing purposes.
For each of the different algorithms I chose, there are settings called hyperparameters. Hyperparameters are like knobs you can turn to tune your model and try to reach greater accuracy. I wrote some quick functions to test the accuracy of each algorithm on the training as compared to the test set as well as plot a graph for each.
Looking at these allowed me to see that logistic regression was the best option for each while also allowing me to pinpoint the value I should use for C, the main hyperparameter for logistic regression. C allows for tweaking the complexity of the model with a large disparity in the values of the variables. The goal is to use C to avoid perfectly predicting outcomes on the training data but poorly predicting outcomes on new data.
At Credera, we value bringing a diversity of experiences and backgrounds to better solve problems, so I set about finding some help within my network of experienced practitioners. By sharing what I had so far, I was able to make some big improvements:
Former colleague and big data OG Chris Gerken suggested using the decision engine to tweet the results at teams and coaches in real time.
Discussions with another coworker with data science interests, Anna Grace Franklin, about the advanced analytics pieces of the project directly led to a half dozen backlog items to help improve the overall system.
Finally, after a couple weeks of public tweeting and some interactions with fans of the University at Buffalo Bulls, I met with my favorite social media strategy czar, Johna Rutz, and we developed a more complete and focused Twitter strategy. A cadence for content publication and a methodology for increased followership is in the works. My plan is to take her advice to “be flexible and trust the data,” and let it dictate the path of the project moving forward.
Results and Next Steps
My machine learning model has discovered that teams should go for it on almost every fourth down that is five or fewer yards. But that’s just the beginning.
I’m excited to reconvene in the future for part two of the blog series. Now that a foundation has been laid for my model, I’m looking forward to the upcoming stages in this project’s growth. There is a full backlog of typical back-end fixes and process improvements now that I’ve clarified the direction I want to take, as well as more functional additions in progress. Some of the new functionality will probably be influenced by asking the Twitterverse what they would like to see in the future and for next season. At this point, top priorities include finding a way to include individual kicker accuracy, new non-field goal related decisions, calculating what amount of expected points increase leads to an additional win, and more.
So combining my interests in college football and machine learning has led to some helpful results, the creation of the @topbuttonmafia Twitter account, and interactions with fellow football fans from Central Texas to Upstate New York. I hope this has encouraged many of you to dive in and experiment with machine learning. I think this has shown that curiosity and passion for a field can take you a long way.
Feel free to reach out to me at firstname.lastname@example.org, or find information about our Data Science (ML & AI) focus area online if you have any questions. We are always excited to have a conversation around how we can help you answer your most difficult questions and break away from heuristics using advanced analytics.