Supervised Machine Learning
Regression Analysis

After initial data processing and cleaning, our dataset contained 1384 rows and 31 columns, corresponding to MLB stats from 1970 to 2019, including 36 teams, and the following metrics and variables:

yearwalkssaves
team_namestrikeouts_by_battersouts_pitches
games_playedstolen_baseshits_allowed
winscaught_stealinghomeruns_allowed
lossesbatters_hit_by_pitchwalks_allowed
runs_scoredsacrifice_fliesstrikeouts_by_pitchers
hitsopponents_runs_scorederrors
doublesearned_runs_alloweddouble_plays
triplescomplete_gamesfielding_percentage
homerunsshutoutsclass'

We wanted to understand how much of the variability in the response variable y (number of “wins”) can be explained by changes in an X number of variables. Additionally, we wanted to find whether the number of variables can be reduced without affecting the model score significantly. Our approach was to perform a Stepwise Regression Analysis, which is the iterative construction of a model where independent variables are removed in succession and testing for statistical significance after each iteration (Kwok, 2021). This process is illustrated in the figure below.



After this process, our final model contained 10 independent variables (down from 24). Furthermore, collinearity was tested using the Variance Inflation Factor, and one additional variable was dropped. The final list of dependent variables are shown in the table below.

runs_scoredsaves
walksouts_pitches
opponents_runs_scoredwalks_allowed
complete_gamesstrikeouts_by_pitchers
shutoutserrors

We also tested the model performance and the results are listed below:
- R2 Score: 0.9304714673783749
- Training Score: 0.9303466731994289
- Testing Score: 0.9297116942632093
A pairplot was generated to visually explor how each independent variable correlated with the dependent variable ("wins"). The figure below shows the top row.

Kwok, Ryan. 2021. Stepwise Regression Tutorial in Python.