Updated: May 21
IT'S HERE! IT'S HERE! Finally! After an entire year of waiting, the 2022 NCAA Division 1 Women's Softball Tournament has arrived. Not coincidentally, we are excited to share with you our most recent predictions for the women's tournament. We are going to be putting our Round 1 predictions up for everyone to see. In this blog, we are going to stay pretty high level in order to make it palatable. We are going to discuss what features we used to train the model and what the model output is going to provide. We will display our predictions day by day, so come back and look at what the model is saying! At the end, we will discuss some interesting aspects of the results and the immediate reaction. Without further ado, let's get into it!
courtesy of NCAA
Recapping what has been discussed in previous blogs, our model is a game prediction model.
The ML model which we chose is tree-based that takes a set of 12 statistics from the 2 teams and makes a prediction on the eventual score. Quick note: these combination of statistics and their weights are only unique to the model that we developed. There are an large amount of combinations of statistics which would provide different weights. Hence, don't read too much into the relative weighting as you watch the games. The team statistics are (In order of weights, or in other words importance to the model):
Win-Loss Record (22%)
Earned Run Average (9%)
Fielding Percentage (7.8%)
Triples per Game (7.1%)
Batting Average (7%)
Stolen Bases per Game (6.8%)
Doubles per Game (6.8%)
Double Plays per Game (6.6%)
Slugging Percentage (6.2%)
Home Runs per Game (6.2%)
Hit Batters (5.6%)
The model uses the fields natural variability in an approach similar to the random forest algorithm to create 1000 game scenarios for each unique team combination, with the output of each scenario a score prediction.
The eventual output is the average game score between the two teams and a % chance of each team winning.
Round 1 Predictions
Some interesting comments about the predicted results:
In a couple of instances (Auburn-ULL and Minnesota-TAMU), the team which loses more game in the simulations also has a higher average score than their opponent.
Upsets: Miami (OH) over Kentucky, Weber St. over Texas, LMU over Ole Miss, Murray St. over Stanford, and South Dakota St. over Michigan.
Potentially tighter games than the eye test would tell you: Clemson and UNCW, Missouri and Missouri St.
Alabama, Oklahoma, Arkansas, and Florida State are heavy favorites over their opponents.
Immediate reaction to the results:
How can the losing team at times score more than the winning team? This is possible, as the predicted score at this point is simply an average and it is possible that the losing team had larger margins of victory in their wins. In future model iterations, we will most likely try to incorporate some logic to provide a likely winning score rather than just an average of the simulations.
Taking into account that these results are produced simply from historical game results and stats, this appears to be a generally good model. Now it is the first round which is generally easier to predict, so a little pause is given to its overall validity. We will come back as the tournament goes on to provide a critique of the results and places for improvement.
Because it is based on stats, it also looks to overvalue teams that the eye test would tell you are not as good as their opponents (refer back to the upsets in the previous section). RPI, which is a pretty good measure of opponents strength, was considered in our data exploration, but found to have a high correlation to other statistics already included. I imagine some refinement in the model could help to eliminate what appear to be some odd predicted results.
There are also some really good predictions of matchups: Oregon St. and Ohio St., Auburn and ULL, Georgia Tech and Wisconsin, and Minnesota and TAMU I predict to be quite a good and evenly matched game and so does the model.
Overall, we am pretty happy with the first iteration of this game prediction model. The model has been completely trained on team stats that we were able to scrape off of the NCAA website. There are no individual stats or game performance stats (i.e. individual plays in game) included in this model. Where could that be an issue: think about a staff with one ace pitcher like Georgina Corrick from USF in a win or go home situation. Those kind of insights are lost in the team aggregated data where we know she performs at a much higher level relative to the rest of the USF pitchers. We saw this phenomenon occur last year when Keely Rochard almost took out ULCA in the super regionals to make it to the WCWS.
So what next? Keep up to date on the predictions as we blog our way through the exciting NCAA Tournament and find out what new adaptations are in store for future versions of the game prediction model.