In part 1 of our efforts to apply machine learning to Fantasy Football projections we came up with some positive, some mixed and some very bad results. As Matthew Worley points out my RBs and WRs really missed the mark (Torrey Smith will not be on the top 10 list).
But I’m not giving up!
As with all data analysis you should refine what works and tackle what doesn’t after that. Let’s take another approach and focus just on QB analysis.
In part 1 we attempted to predict the rank of 2017 players across all positions using the basic stats of fantasy football. What’s more we did this honing in on players as the unique instances (rows in regular nomenclature) and mapped multi-year data onto each player.
I thought it would be useful to take a different approach and somewhat “anonymize” the players out of the data and refine our goals on successive years to get more instances of the data.
- Part 1 data:
Player Name, 2015 rank, 2016 rank, 2014 TDs, 2015 TDs, … predict rank
- Part 2 data:
QB rank1, 2015 pts, 2015 data… predict 2016 pts
QB rank2, 2015 pts, 2015 data… predict 2016 pts
QB rank1, 2014 pts, 2014 data… predict 2015 pts.
By using year-over-year predictions we can see if the model we propose holds up over multiple interactions of these successive seasons. The commonality across these data points might be more elusive but if find one it should be pretty strong to apply this to 2016 data points and predict 2017 pts.
First, plan your approach:
- Scrape and refine new data for multi-year stats for the QB position
- Train and test our first pass at the data.
- Tweak the modeling and test again.
- Add data elements and repeat.
Scrape and Refine Data
First, I pulled all of the player data using information from FootballDB.com.
I used old fashioned scraping — copy and paste and clean in Excel. There are definitely quicker ways to do this but whenever I’m using new data sources I like to do this manually to get to know the data more intimately. It can be mundane but pulling down the top players by position for multiple years took me under an hour to accomplish.
Of course, all of the sources for our data have different structures and for some reason this source appends team name to the player name.
Let’s fix that in Excel:
Look for the “Text to Columns” functionality and use a comma as the delimiter to pull name from the “name, team” field. In the end, we don’t particularly care about name but we have to map a few data points from multiple years.
Organize the data for successive years (e.g. 2014 data points will host the field with the 2015 pts of the player). For the first round I kept the data to the basics provided by FootballDB.
You should get to know how to use VLOOKUP in Excel real quick but here’s a primer:
- Get the data into 2 different tabs:
2. Create a new column next to 2014 pts to pull in the 2015 pts from the other tab using VLOOKUP (notice my cleaned player name data):
Lastly, after you have done this for successive years (I did it back to 2011: 2011 → 2012; 2012 →2013; 2013 →2014; 2015 →2016). These should all be in one sheet so you might rename the previous stats “year2pts” and “year2TDs” and the successive year “year1pts.”
Of course we’ll be using our favorite tool BigML.com to do all of this analysis.
First Pass: Train and Test
Once you’ve uploaded the source to BigML.com (I find .csv files work best) convert it to a dataset and then split the data into test and training… there’s a one-click function for this or more manual process but it’s almost instantaneous either way:
BigML.com speeds you up and puts you right into the 80% dataset you created. We’ll use that other data set to test our theory after we create models. See my previous articles on how this is done.
Now… in this example above I actually used ALL of the player positions for my first pass. It was only later that I narrowed the scope down to QBs. I did this for 2 reasons:
- I might get lucky and find a magic formula that would acrosss positions and
- I’ll want to do the same data scraping and cleaning for ALL of the positions eventually so why not do it all at once now :)
I’m gonna skip ahead a bit and get right into the meat of the matter because, as you can see from the image on the left, I did A LOT of tests and refinements before I found my momentum.
After every dataset split into testing and training I did the following:
- I created a single model on the 80% database using the 1-click feature.
- Next, I did a ensemble model with the default settings of 10 iterations.
- Then I mixed it up and did a boosted tree option with larger iterations and played with some of the more advanced settings.
Each time I got a model and I did an evaluation against the 20% using the evaluate pulldown.
Event my best results were disappointing at first:
The R SQUARED number is the one to focus on. It denotes the proportion of the variance of your training dataset applied to the “line” the testing dataset which was set aside.
The closer that gets to 1 the better I feel about our 2017 predictions.
Keep in mind, I’m taking a very pragmatic approach to this. I’m looking for:
- General trends in the data points to make me smarter about how to choose the best QB
- The ability to rule someone in or out of a final choice when I have to choose them or play against them.
- In the end, I love data… and I love applying it to something fun with charts and crap :)
Improve the Data and Models
This is a goldilocks enterprise — sometimes you have a try a lot of beds before you find the one that feels just right.
I decided to add in a few data points including the points of successive years for each player (so yeah, I went back on my approach and did a hybrid from part 1). Then, since I was focusing just on QBs, I decided to map in a QBR rating for each player and then add in the post-season QBR as well.
This gave another ranking beyond the points score to order the quality of the QB. It also provided a way to show post-season quality as I put the other players to blank who didn’t make it past that.
Yes, I had to go back and put QBRs for every season… but it was totally worth it because here is the results of an ensemble model using boosted trees with 245 iterations:
Topping .7 would be a HUGE advantage and I’ve tried adding in team and strength of schedule score but this model with just the added regular and post-season QBRs seems to work the best.
Here is the fieldset in order of importance. Notice regular season QBR was a key factor in getting to the .67:
1. year2PassingTD: 21.31%
2. reg-QBR: 9.89%
3. year2RushingAtt: 9.47%
4. year2Pts*: 8.43%
5. year2RushingYds: 7.58%
6. year2PassingAtt: 6.73%
7. year2Bye: 5.59%
8. year2PassingCmp: 5.16%
9. year2FumblesFL: 4.37%
10. year2PassingInt: 4.26%
11. year2PassingYds: 4.13%
12. 2012: 2.97%
13. 2011: 2.53%
14. year2RushingTD: 2.50%
15. 2013: 2.32%
16. year2Rushing2Pt: 0.73%
17. year2ReceivingYds: 0.65%
18. year2Passing2Pt: 0.62%
19. year2ReceivingRec: 0.52%
20. post-QBR: 0.21%
21. 2014: 0.02%
22. year2ReceivingTD: 0.01%
Any thoughts on what I can do to improve it even more? Let me know in the comments.