For an avid sports fan and a “numbers” guy (whatever that means), daily fantasy sports is kinda the perfect intersection of all the things I find awesome.
Basically how daily fantasy sports (DFS) work is the following
- All players playing today are available and have some cost, which is determined by the daily fantasy sports site
- You have some salary cap of fake dollars that you use to buy players
- Your goal is to assemble a team of players that will score the most points, given some scoring structure
- The top scoring teams then get paid out based on the tournament type and the payout structure.
- Here’s breakdown of the scoring structure on Draftkings
Obviously, a natural question is whether or not it’s possible to predict player performance accurately enough to inform what players one should pick.
I’ve been developing a library (which is in active development) that tries to answer this question — it scrapes data from Basketball Reference, stores the data in MySQL, and does some simple linear regression on the aforementioned Basketball Reference data.
As I mentioned previously, I use daily boxscore data that is scraped from Basketball Reference (example URL).
Above is an example of predicted player performance for all NBA players playing on 4/10/2015 vs. their actual performance.
Things to note
- The training data set was from 3/1/2015 to 4/5/2015 and the date for predicted player performance (4/10/2015) is outside this training set.
- R-squared is around 0.47, which essentially indicates a lot of unexplained variance between the predicted score and the actual score. However, this might not necessarily have to do with the structure of the prediction model but rather the nature of events being predicted — I’d make the argument that NBA player performances have large variances, by nature.
Let me quickly outline the very simple linear regression model I used to generate the above result.
First, I started with the following logic
player performance = f(historical player performance, teammate, opponents, schedule)
Basically, I think about a player’s performance as dependent on four factors — how a player has been playing, what teammates are playing, what opponents are playing, and the player’s schedule.
Now, the specific variables I use to represent these four factors are the following:
- Weighted historical performance
- I calculate this variable using the following weighting: last game’s performance + 0.6 x performance over previous 7 days + 0.3 x performance over previous 14 days + 0.1 x performance over previous 28 days
- Average Draftkings score that the opponent has conceded to the player’s position over the past 28 days
- Example: Over the past 28 days, point guards have scored, on average, 28.6 points against the Houston Rockets
- Missing teammates’ Draftkings scores
- Example: Today, there are 3 teammates that are sitting out for the Oklahoma City Thunder. They have averaged 12, 15, and 25 points on Draftkings, respectively. Thus, the number of “missing” points is 52.
- Is the game a back-to-back?
- I should split up historical performance by week, and add each week’s performance as it’s own variable
- I should actually calculate these regressions on a player-by-player basis rather than aggregating all player information and then running the regression.
- What I mean by this is that I should aggregate all historical data for a particular player, like Russell Westbrook, and then run a regression to predict Russell Westbrook’s next performance. Then move on to the next player, say, Tim Duncan, and do the same thing.
- My reasoning is that I believe that the impact of these variables, like say, whether a game is a back-to-back will vary greatly from player to player. For somebody that could very well be the human Energizer Bunny (see Westbrook, Russell) a back-to-back may have very little impact. For somebody who could pass as the new “Jake from State Farm” (see Duncan, Tim) back-to-backs could be harder to recover from.