In this project we are going to study Baseball data from 2002 till 2022 to predict some important features which can help managers in making decisions and improve their team outcome.
The aim of our study is to:
- Extract, clean and pre-process data for models.
- Build a machine learning model to predict whether an MLB runner will be able to successfully steal a base based on various aspects.
- Build a ridge regression linear model to predict next BsR number (Base Running) for batters.
- Build a ridge regression linear model to predict next WAR number (Wins Above Replacement) for pitchers
- Use HTML, CSS and Flask to create a webpage and visualize models' prediction.
- Visualize previous baseball games.
- Python 3.7.13
- Jupyter Notebook
- Visual Studio 1.69.1
- HTML/CSS
- Flask
- JavaScript
The data was imported using pybaseball
package, a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs. It retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more.
We used from pybaseball import batting_stats
to import all batters stats, from pybaseball import pitching_stats
to import all pitchers stats.
To create batters DataFrame (See Table 1) we have chose games from 2007 to 2022 where the number of players' appearances is at least 200 times per year.
To create pitchers DataFrame (See Table 2) we have chose games from 2007 to 2022 for pitchers who have pitched 50+ innings.
Finally, the data was stored on local database.
Table 1 - Raw Data for batters from 2007 to 2022.
Table 2 - Raw Data for pitchers from 2007 to 2022.
Bsr or Base running is the base running component of WAR at FanGraphs. It's a number that calculates the player's ability to steal the bases. A great BsR rate would be equal to 6, an above average BsR is 2, an average BsR is 0 and below average BsR is -2.
To predict the next BsR number for each batter we have trained a linear regression ridge model.
Data
We have used batter’s stats from 2007 to 2022 (See Table 1). These data have 5090 rows and 320 columns.
Preprocessing Data for Model
- We have removed all players who played in 1 season only.
- We have written a function that takes BsR value from previous year and add it to a new column
Next_BsR
to create the dependent variable. - We have cleaned the data by removing all columns with null values and some object data types columns. Finally, we have dummified the
team_code
column by assigning code number instead of name to each team. This has reduced the DataFrame to 4127 rows and 195 columns.
Train Ridge Regression Model
- We have initialized the model.
- We have used sequential features selector
SequentialFeatureSelector
to find the best features for the model. The number of features selected was 20. - We have split the data using
TimeSeriesSplit
to make sure we are using previous data to predict future one. - We have scaled the data using
MinMaxScaler
- To iterate the model through all years we have created a function that will go through each year, assign train data as all previous years and test data as current year. The output of this function is a DataFrame of actual and predicted BsR numbers.
Evaluate the model
To evaluate the model, we have calculated the following:
- Mean absolute error (MAE) = 1.89
- Mean squared error (MSE) = 5.96
- Square root of Mean squared error (RMSE) = 2.44
- Median absolute error (MAE) = 1.51
- Explain variance score = 0.41
- R2 score = 0.41
Model Optimization
To get a better accuracy we decided to use data stats from 2002 till 2022 instead of 2007 till 2022, which increased the R2 score to 0.43. An attempt of dropping some columns was tested but it didn’t make the accuracy better (See Table 3)
Table 3 - Model evaluation used to predict Next_BsR
WAR or Wins Above Replacement is a number that measures a player's value in all facets of the game by deciphering how many more wins he's worth than a replacement-level player at his same position. The higher the WAR number, the better the stat.
To predict the next WAR number for each pitcher we have trained a linear regression ridge model following the same steps that we did when training the ridge regression model to predict next BsR.
In Table 4 we can see the model evaluation results.
Table 4 - Model evaluation used to predict Next_WAR
To visualize our predictions, we took the following steps (See Video below):
- Created HTML/CSS files.
- Created the Flask app and connected it to
app.py
- Used
postman
to test connection between the server and the webpage. - Created a responsive
navbar
with multiple sub-pages. - Wrote
JavaScript
code to:
- Connect the predictions data to the webpage.
- Plot predicted values and calculated one for each player.
- Added an autocomplete function to autocomplete player's names.
Visualization.mp4
We faced multiple challenges during this project:
- Finding the right data: Even if there's plenty of available free data for baseball finding the right one was challenging, and it took us time to find the
pybaseball
package while we couldn't find the right data to predict stealing bases. - Optimizing the linear ridge regression model: While we were able to predict accurately the next BsR and WAR numbers for some players, the prediction for others was so far from actual values. Therefore, some players could be miscategorized as BsR and WAR numbers would drop in case of injury, but players will appear in less games and this model is only taking pitchers who have pitched 50+ innings and batters that appeared in at least 200 times per year. Moreover, we have started by studying data from 2007 to 2022 we added more data (2002 till 2022) for BsR prediction, but we couldn't do the same for WAR predictions as the
pybaseball
package for pitchers stopped loading. - Project timeline was less than what's needed to reach all goals.
At the end of this project, we were able to create a webpage where user can navigate among multiple webpages to:
- Insert a batter's name to predict his next BsR number (Base Running).
- Insert a pitcher's name to predict his next WAR number (Wins Above Replacement).
- Plot predicted values and calculated one for each player.