The NFL is highly unpredictable. Week to week, the experts tout their favorite teams and squabble amongst themselves about who’s going to beat who. The Seattle Seahawks were supposed to be unbeatable this season, yet it took only two weeks to blemish their record with a stunning loss to the San Diego Chargers. To the other extreme, the Dallas Cowboys were written off before the season started but, as of Nov. 12, are battling for the top spot in the NFC East. Meanwhile, It’s safe to assume my perennial Super Bowl pick (the Atlanta Falcons) is looking pretty poor, which means I’m sitting somewhere near 0-14 on lifetime Super Bowl picks.
We’re all subject to our own personal biases, but what if we could remove this selection bias by using machine learning? Too often we base predictions on statistics we believe are important, but perhaps how great a team plays on a Monday when there’s a full moon and the temperature is below 80 degrees is irrelevant. Many people successfully tackled this challenge by building complex models with up-to-the-minute data, but we wanted to see how successful we could be by using fairly basic and readily available statistics. Following the process outlined below, we built our predictive model to challenge the experts. Each week we’ll make a prediction for all the weekend’s games and publish who the model predicts will win in the following table. We’ll also be keeping track of how our model is doing throughout the season in the table at the bottom of the page.
If you’re interested to see the process we followed to create the model, the rest of the post will take a look at how we tackled the problem and produced our predictions.
Predicting the winners of sporting events using modeling isn’t a new concept, so we researched what others had done to try to learn what worked and what didn’t. AdvancedFootballAnaltyics.com was a great source of knowledge on the subject as they have deeply explored predicting NFL games using statistics. We took an approach similar to that of Jim Warner (http://www.cs.cornell.edu/courses/cs6780/2010fa/projects/warner_cs6780.pdf) by picking game winners based on the statistics from previous games from the current season.
Next, we needed to establish a baseline to give our accuracy percentage context. The Las Vegas betting lines set the standard for deciding who’s going to win a game, so we decided to compare our model with theirs, though this does set a rather high standard. The accuracy of the betting lines vary from year to year from 60% to even 80%, but we estimated that on average, the betting lines are correct between 65% and 70% of the time.
2. Data Collection
Data was collected fromwww.pro-football-reference.com for seasons 2000-2013. We converted the data into moving averages by team for each game. While the website has a number of statistics, we used summary statistics from each game: first downs, rushing yards and TDs, passing yards and TDs, turnovers, penalties, and sacks. AdvancedFootballAnalytics.com recommends converting these stats into ‘efficiency stats’ by making them averages, e.g. yards per play, rushing yards per attempt, etc., so we took their suggestion when transforming our data.
3. Model Selection
Using R, we tried various classification algorithms to produce a variety of models, training them on data from seasons 2000-2011. We compared the models by measuring the accuracy of their predictions on 2012-2013 data and could eliminate models that performed poorly.Once we limited the number of models to choose from, we incorporated new data and transformed existing fields to see if we could improve our predictions. After comparing new accuracies, we combined predictions from models to see if we could improve our accuracy any further. We finally settled with one model and proceeded to collect data for 2014.
The accuracy of our model on our holdout set (seasons 2012 and 2013) was 66.8%. This doesn’t necessarily break any barriers and fell short of consistently beating ‘the experts,’ but we’ve shown that one can create a respectable model using fairly basic statistics gathered from the current season.
Below we’re keeping track of our model’s accuracy as the 2014 season plays out:
While this is a fun application of machine learning, Credera leverages these same methodologies and ideas to create solutions in the business world. If you’d like to learn more about Credera’s capabilities follow us in LinkedIn or Twitter or feel free to contact firstname.lastname@example.org.