I Accurately Predicted the Outcomes of MLB Games.
With the NBA offseason in full swing and the excitement of free agency coming to an end, all of my sports attention has shifted towards baseball. I find the analytics and mathematical approach to the game just as interesting as the sport itself. The other day I came across Win Probability Added (WPA), a performance measure which is derived from predictive modeling instead of counting statistics. The concept of WPA inspired me to write a simulation to predict the outcomes of MLB games and see how it stacks up to some of the best baseball simulations out there.
What is WPA?
If you have ever been to FanGraphs or watched a TV broadcast of a close MLB game, you may have seen a chart depicting win probability. Put simply, WPA quantifies how much a particular at bat contributed to a team’s chance of winning the game. If you want a more concise answer: this video by Foolish Baseball explains the fundamental idea of WPA from 0:47 to 3:06. If you find WPA interesting, this particular video also discusses Championship Win Probability Added (cWPA), which is the same concept, but extends the influence of the at bat from winning the game to winning the World Series. However, since WPA is contained within a single game, it makes for a much more accessible and straightforward measure to calculate.
How does the simulation work?
The code I wrote contains all standard MLB rules: 3 outs end a half-inning, balls in play and walks have the potential to drive in runners, and if there is a tie after 9 innings, extra innings start with a runner on second. It is worth mentioning that this simulation is relatively simple, so it doesn’t include the possibility for errors, wild pitches, passed balls, or stolen bases.
In my model, the actual stats of hitters and pitchers during the 2021 season drive the outcomes of each at bat. For example, if Juan Soto (the MLB leader in On Base Percentage with .447 at time of writing) has an at bat against a league average pitcher, the probability of him getting on base will be around .447. And if Soto has a matchup against a pitcher prone to giving up walks or hits, his chance of getting on base may crawl all the way up to .470. The same concept applies for pitchers: a pitcher that gives up less walks and hits than the league average will decrease a batter’s chance of getting on base. For example, if Jacob deGrom (a player known by many as the best pitcher in the MLB) matches up with Juan Soto, the odds of Soto reaching base will be noticeably lower than .447. On top of the considering the individual stats of players, I thought that accounting for the home field advantage dynamic would also be important. After looking at trends in overall MLB hitting splits for home and away games, I gave home players a 2% boost to their chance of getting on base and applied a 1.5% penalty to away hitters. The simulation also considers a player’s hit distribution, so a player like Shohei Ohtani (the MLB leader in Home Runs at time of writing) will hit home runs more frequently than a player who walks and singles proportionally higher like the aforementioned Juan Soto. Altogether, these components determine the probability of any given batter getting on base. If you are interested in the datasets I’ve created, I uploaded both position player stats (batters) and pitcher stats to Kaggle.
As the game simulates, it records the counting stats of hitters and pitchers, tracks baserunners, and determines pitching changes depending on the pitcher’s stats. If you are familiar with baseball, you may know that batters usually reach base more frequently when facing the same pitcher for a second or third time in the same game. In order to keep things simple, I decided to calculate every at bat independently - previous at bats do not affect the current one.
I just gave a general overview of how the code works: if you want to investigate the intricacies of the code yourself, the project’s GitHub repository can be found here.
How are the win probabilities calculated?
Before the first at bat and after every following at bat, the code simulates from that point to the end of the game 1000 times. The code returns the amount of times that it predicted the home team to win, divides that number by 1000, then adds that number to a dataframe where the rest of the generated data is stored.
Results
To test my models effectiveness, I simulated two actual games from 9/4/21: Dodgers @ Giants and Astros @ Padres.
Due to the stochastic nature of a simulation derived from over 60 independent events (at bats), there are going to be different results every time the code runs. However, the iteration that runs before any at bats should output very similar win probabilities every time. That is what I consider the baseline measure of my model’s effectiveness.
Nearly every sportsbook I checked had the Dodgers as heavy favorites against the Giants and had the Padres as slight favorites against the Astros.
Here are HTML webpages I generated from R Markdown if you want to see all the data from the games between the Dodgers @ Giants and the Astros @ Padres.
The actual scores of the games were 6-1 Dodgers and 10-2 Padres.
My simulation predicted 7-5 Dodgers and 7-6 Padres, initially giving the Dodgers a 60.3% chance of winning the game and the Padres a 58.4% chance of winning the game.
Conclusion
I’m actually very happy with how my simulation turned out. Ultimately, the only aspect that really mattered was the win percentage calculated before any at bats were determined. Due to the inherent ”randomness“ of baseball, successfully predicting outcomes of specific at bats is practically impossible. However, considering that the model correctly predicted which team won and produced reasonable and intuitive win probabilities, I would say that this project was an overall success.