After hours and days of trial and error (and error and trial again) I feel confident enough to release the culmination of my two previous articles (part 1 & part 2) — a Machine Learning / Artificial Intelligence fantasy football 2017 cheat sheet.
The R squared scores were .84 and .78 for the top models run on QB and RB respectively!
Did I lose you? Read below… and then buy my cheat sheet and let’s try this thing.
I’m just looking for enough people to show interest show I can justify some really good hours dedicated to making this dream come true! A predictive model for fantasy football.
Here’s a quick look at the top list on the quarterbacks:
PART 1: Dominate your 2017 Fantasy Football League with Artificial Intelligence and Machine Learning
September is just around the corner which means back-to-school shopping specials and more importantly — fantasy football drafts!
I can still recall joining my first fantasy football league circa 1994. Pen and paper in hand with a pitiful “cheat sheet” I had purchased at the grocery store — I threw myself into the 12-team draft as a novice and came out as a seriously damaged novice still.
20+ years later and I really haven’t made that much improvement on my game. Last year I signed up for 3 different online services to give me the upper hand and try to win the $250 first place prize. My fellow VP said her strategy was to draft the players with the curtest butts — she won.
It’s draft time again.
This time I’m bringing the big guns and using my favorite tools like the excellent services at BigML.com to apply some modern-day modeling to my upcoming fantasy football draft.
Let’s set expectations here at the outset.
Predicting player performance is more tenuous than predicting stock prices but not all AI and ML approaches need to do beat the humans. Sometimes the best applications of artificial intelligence and machine learning simply uncover new insights or confirm insights already in hand.
Some previous attempts at predicting fantasy football prowess have focused on very detailed player characteristics, schedule robustness, week-over-week outputs. These are all well and good but my two decades of ff play have given me a motto: just do your best — and have fun. In the end I’m always going to pick a major player from the 49ers — I’m hardwired to do so (side note: I sang the national anthem at the 1985 Super Bowl between the 49ers and the Dolphins at Stanford stadium!)
So, let’s stay at a high level and try to use the last few years of football data to predict top performers and improve our draft.
First, we need the data.
After going through half-a-dozen different sources I settled on collected stats from 2012 to 2016 via The Huddle.
Now, how to clean and slice the data.
For supervised machine learning approaches it makes sense to map the multi-year data for each player in a single set of rows. This became immediately problematic because data from the years 2012–2015 used “Aaron Rodgers” as the name format and 2016 used “Rodgers, Aaron.”
Thankfully, Excel has some pretty good tools to split up text columns.
So, now I have a table of about 1100 players over 5 seasons with basics ff scoring stats and a rank for each year. One foreseeable issue: I have fewer than 300 players who registered stats in all 5 seasons. However, it’s a good first pass so let’s see what we can do.
Using BigML.com I uploaded my source files, created a dataset for a span of years and tried to predict the 2016 player points and rank. If you want a quick tutorial on how to start see my article: “Machine Learning in Action.”
I’m going to do several passes until I see information that I think is valuable. Here’s a rough outline on how I want to proceed:
- Evaluate the entire data source using the dynamic scatterplot tool at BigML.com for some simple correlation.
- Configure a “random forest” model to understand the field conflicts and filtering we may need to do. We’ll see our first predictions here.
- Create a filtered dataset and exclude “conflicting” fields against the entire dataset. Configure another model on this data.
- Manually create 2 datasets from specific season spans. Match one model to another model and see how things come out in predicting 2017 points.
1) Simple Correlation
The first step I take in any exercise like this is to start with correlations: taking two data points related together and asking… do they travel together in a predictable way? This is something you can do in Excel but the advanced features provided by BigML.com are extremely valuable. In many instances I uncover the most valuable insights using these simple scatterplots.
First turn the entire sheet into a dataset with the highlighted button at the top.
Next we need to set our target field — the field we are trying to predict. We’ll use 2016 rank (we could also use 2016 points but let’s keep it simple).
Click on the “Dynamic Scatterplot” button on the top left.
We’ll put the field “2016-rank” on the Y Axis and then change the X-axis to various other data points. I’ll put “2015 points” as the color of the data point.
This makes a lot of intuitive sense. We’ve clicked on the regression line button at the top and this gives us a good sense as to where things will line up. But something is funny. I should not be seeing a straight set of perfect points in here. At this point — being totally transparent — I realize my failure. I have duplicate data. Aaron Rodgers is listed with his 2012–2016 datapoints as individual rows.
OK. I went back and narrowed the source data down to about 1110 unique players — this should give us a better dataset. Let’s try that again:
Now we’re finding that the 2015 rank is indeed a very good indicator of how people perform year-over-year.
Notice that the lines towards the higher numbers (lower ranks) line up accordingly. This makes more sense. If you’re in the lower ranks the previous year you’ll likely be there the next year too. In fact we can test this out by zooming in with their dyanamic selection tool.
We’ll highlight 2016 and 2015 ranks to see if the correlation deepens. You might be surprised to learn that it doesn’t. While it seems like the data points at the bottom of the rankings line up more fervently the number of lines outside the plot line are numerous and the correlation is not as strong.
Let’s look at the other side of the spectrum and the top 200 players 2016 vs 2015.
So we can infer from this that overall previous year rank is a fairly good indicator of where their rank will be the subsequent year but if we’re looking to identify top point scorers or exclude the bottom rung scorers this may not be a great help.
Speaking of points let’s see if we see a stronger correlation here. We’ll try 2016 points on the y-axis and 2015 points on the x-axis. Given that the rank is merely an incremental log of the rank we should see the same results.
The granularity hasn’t changed the correlation but you can now see the grouping of the low point-getters. So, in the end, we can look at last year’s performance and the best we can confidently say is that if they sucked last year the more likely they are to suck this year. We can infer the opposite but not with strong confidence:
The score and rank of each year are made up of the various stats for that year. If we do intra-year scatterplots they’re sure to yield some strong correlations that are really self-referential. (e.g. 2016-TDs will correlate with a higher 2016 score).
Let’s move beyond the simple relational approaches and try something new on this data.
2) Let’s Do Some Modeling
Next we’ll be looking to build some models from the data. Our first pass is to do a model across all fields. This is going to yield some “well, duh” moments but it’s instructive to find all the bad boy fields to filter out.
Here we simple click the button at the top to “Configure Model.” You can cut to the chase by choosing “1-click model” to get to the results. Here we specify the target field which is “2016 rank.” A shorthand for what happens next. The computer in the cloud at BigML.com will start running paths through all of the data points we’ve given it to determine which path is most likely to predict an outcome on our ranking in 2016. Can you guess what the first item will be?
As you can see 2016 points is the best determinate for our rank in 2016. Say it together: “well, duh.” But some fun ones appear as well as we trace one of the routes. It wants us to exclude Quarterbacks to get there and anyone who has the name “ronnie” :)
Quarterbacks are few and far between and as we traverse the right side of the tree (the high rank number — lower performance rank) quarterbacks are at the top. The model is also looking to traverse to the predict the largest number of people in its model and quarterbacks make up only a fraction of the players. As to the “ronnie” reference? See if you can figure it out.
At this point we know that we have some filtering to do:
- We’ll have to get rid of most of the 2016 numbers. (After all, we’re trying to create a model that can help us predict future performance from previous and past performance.)
- Let’s also plan to filter out name, position, and team. We’ll concentrate just on the numbers.
- Lastly, let’s shift our focus to predict points and not rank to make our data more granular and our left-right direction (lower points → higher points) more readable.
To keep things clean let’s create a new dataset from the source and run the model tools again.
The first thing we notice confirms an assumption we made from the scatterplots: it’s a lot easier to identify who’s gonna suck. The thickness of the tree branch denotes the breadth of data points adding strength to that assumption path.
We can adjust the output and the expected error range to concentrate on anyone who scored 108 pts or higher in 2016 (that would put them in the top 100 players) and assume a very generous 100 point swing up or down for the year (something like a 7 point swing per game over the regular season).
Here’s our first indicator that our quarterback data may just skew our results but it confirms our thoughts that previous rank is of first importance.
Fortunately, we can traverse the model trying our best to avoid position-specific stats and see what we can find. For example, we can identify some important bookends on the 2015 rank.
Be sure to watch the details at each node. The farther down you go with the thinner lines means that the data is only representative of a small portion of the dataset. Here we stopped after 2 nodes indicating that if your 2015 rank was higher than 114 and on the next node higher than 324 you are very likely to end up on the bottom of the heap again.
A quick gut-check of 2017 pre-season rankings vs. 2016 rankings bears this out pretty well. Using stats from Fantasy Pros 2017 no player in the top 100 projected scores was at or below 120 in the 2016 rankings. Note: there are plenty of fresh names and faces that we don’t have stats on — can’t help you there.
3) Filter the dataset down further
Let’s start looking at position. We’ll concentrate on RBs since they can use many stats minus the QB passing field and see if we find some goodies.
This time we chose the sunburst model. One of the things we notice right away is that pairing things down to position really clamps down on the confidence level we can exude.
But we can glean some insights. Following the RB path we can see that 2015 rank falls behind 2015 rushing yards in field importance.
I find that many of the insights I garner from these ML efforts become sub-conscious decision points that help me tip the balance in my favor in the heat of a draft!
4) Lastly, let’s take a model from our larger set of data and see if we can apply that to 2016 to predict a 2017 outcome.
We’ll do this very simply. We’ll take data from 2014–2015 hoping to predict 2016 points and then take that trained model and batch predict what will happen to 2017 players based on their 2015–2016 stats.
Setting up the two datasets was fairly straight forward. Take the 2014–2015 data points and create a basic model. Then click up on the icon for prediction and choose “batch prediction.” (See here for an example of batch prediction matching).
Most importantly, map 2014 to 2015 fields and 2015 to 2016 fields:
Next put in the names for your prediction and confidence fields. This will generate a predicted score for each person in the 2015–2016 dataset. I called it “2017 points”.
Once it runs you can download the matched batch as a CSV file. I sorted by the 2017 points score, created a sequential rank and compared my QB, RB and WR results with the 2017 projected ranking from FantasyPros.
I used Excel to place a standard deviation at the top of each column. My quarterbacks came out pretty darn close to what FantasyPros.com is predicting! My running backs and wide receivers are noticeably different from what the pros say but not altogether outrageous.
In my next article I’ll try to refine some of the models and do some batching!
Then we’ll throw caution to the wind and try out had at deep learning to predict outcomes.
Machine Learning Meets Fantasy Football — Part 2
In part 1 of our efforts to apply machine learning to Fantasy Football projections we came up with some positive, some mixed and some very bad results. As Matthew Worley points out my RBs and WRs really missed the mark (Torrey Smith will not be on the top 10 list).
But I’m not giving up!
As with all data analysis you should refine what works and tackle what doesn’t after that. Let’s take another approach and focus just on QB analysis.
In part 1 we attempted to predict the rank of 2017 players across all positions using the basic stats of fantasy football. What’s more we did this honing in on players as the unique instances (rows in regular nomenclature) and mapped multi-year data onto each player.
I thought it would be useful to take a different approach and somewhat “anonymize” the players out of the data and refine our goals on successive years to get more instances of the data.
- Part 1 data:
Player Name, 2015 rank, 2016 rank, 2014 TDs, 2015 TDs, … predict rank
- Part 2 data:
QB rank1, 2015 pts, 2015 data… predict 2016 pts
QB rank2, 2015 pts, 2015 data… predict 2016 pts
QB rank1, 2014 pts, 2014 data… predict 2015 pts.
By using year-over-year predictions we can see if the model we propose holds up over multiple interactions of these successive seasons. The commonality across these data points might be more elusive but if find one it should be pretty strong to apply this to 2016 data points and predict 2017 pts.
First, plan your approach:
- Scrape and refine new data for multi-year stats for the QB position
- Train and test our first pass at the data.
- Tweak the modeling and test again.
- Add data elements and repeat.
Scrape and Refine Data
First, I pulled all of the player data using information from FootballDB.com.
I used old fashioned scraping — copy and paste and clean in Excel. There are definitely quicker ways to do this but whenever I’m using new data sources I like to do this manually to get to know the data more intimately. It can be mundane but pulling down the top players by position for multiple years took me under an hour to accomplish.
Of course, all of the sources for our data have different structures and for some reason this source appends team name to the player name.
Let’s fix that in Excel:
Look for the “Text to Columns” functionality and use a comma as the delimiter to pull name from the “name, team” field. In the end, we don’t particularly care about name but we have to map a few data points from multiple years.
Organize the data for successive years (e.g. 2014 data points will host the field with the 2015 pts of the player). For the first round I kept the data to the basics provided by FootballDB.
You should get to know how to use VLOOKUP in Excel real quick but here’s a primer:
- Get the data into 2 different tabs:
2. Create a new column next to 2014 pts to pull in the 2015 pts from the other tab using VLOOKUP (notice my cleaned player name data):
Lastly, after you have done this for successive years (I did it back to 2011: 2011 → 2012; 2012 →2013; 2013 →2014; 2015 →2016). These should all be in one sheet so you might rename the previous stats “year2pts” and “year2TDs” and the successive year “year1pts.”
Of course we’ll be using our favorite tool BigML.com to do all of this analysis.
First Pass: Train and Test
Once you’ve uploaded the source to BigML.com (I find .csv files work best) convert it to a dataset and then split the data into test and training… there’s a one-click function for this or more manual process but it’s almost instantaneous either way:
BigML.com speeds you up and puts you right into the 80% dataset you created. We’ll use that other data set to test our theory after we create models. See my previous articles on how this is done.
Now… in this example above I actually used ALL of the player positions for my first pass. It was only later that I narrowed the scope down to QBs. I did this for 2 reasons:
- I might get lucky and find a magic formula that would acrosss positions and
- I’ll want to do the same data scraping and cleaning for ALL of the positions eventually so why not do it all at once now :)
I’m gonna skip ahead a bit and get right into the meat of the matter because, as you can see from the image on the left, I did A LOT of tests and refinements before I found my momentum.
After every dataset split into testing and training I did the following:
- I created a single model on the 80% database using the 1-click feature.
- Next, I did a ensemble model with the default settings of 10 iterations.
- Then I mixed it up and did a boosted tree option with larger iterations and played with some of the more advanced settings.
Each time I got a model and I did an evaluation against the 20% using the evaluate pulldown.
Event my best results were disappointing at first:
The R SQUARED number is the one to focus on. It denotes the proportion of the variance of your training dataset applied to the “line” the testing dataset which was set aside.
The closer that gets to 1 the better I feel about our 2017 predictions.
Keep in mind, I’m taking a very pragmatic approach to this. I’m looking for:
- General trends in the data points to make me smarter about how to choose the best QB
- The ability to rule someone in or out of a final choice when I have to choose them or play against them.
- In the end, I love data… and I love applying it to something fun with charts and crap :)
Improve the Data and Models
This is a goldilocks enterprise — sometimes you have a try a lot of beds before you find the one that feels just right.
I decided to add in a few data points including the points of successive years for each player (so yeah, I went back on my approach and did a hybrid from part 1). Then, since I was focusing just on QBs, I decided to map in a QBR rating for each player and then add in the post-season QBR as well.
This gave another ranking beyond the points score to order the quality of the QB. It also provided a way to show post-season quality as I put the other players to blank who didn’t make it past that.
Yes, I had to go back and put QBRs for every season… but it was totally worth it because here is the results of an ensemble model using boosted trees with 245 iterations:
Topping .7 would be a HUGE advantage and I’ve tried adding in team and strength of schedule score but this model with just the added regular and post-season QBRs seems to work the best.
Here is the fieldset in order of importance. Notice regular season QBR was a key factor in getting to the .67:
1. year2PassingTD: 21.31%
2. reg-QBR: 9.89%
3. year2RushingAtt: 9.47%
4. year2Pts*: 8.43%
5. year2RushingYds: 7.58%
6. year2PassingAtt: 6.73%
7. year2Bye: 5.59%
8. year2PassingCmp: 5.16%
9. year2FumblesFL: 4.37%
10. year2PassingInt: 4.26%
11. year2PassingYds: 4.13%
12. 2012: 2.97%
13. 2011: 2.53%
14. year2RushingTD: 2.50%
15. 2013: 2.32%
16. year2Rushing2Pt: 0.73%
17. year2ReceivingYds: 0.65%
18. year2Passing2Pt: 0.62%
19. year2ReceivingRec: 0.52%
20. post-QBR: 0.21%
21. 2014: 0.02%
22. year2ReceivingTD: 0.01%
Any thoughts on what I can do to improve it even more? Let me know in the comments.