Are You Not Entertained?
It's June 15th, 2021 and Oklahoma has just won the 2021 Women's College World Series as the #1 seed in the tournament. This is only the 6th #1 seed to win the tournament since this format started in 2005 (previous #1 seed champions are the 2005 Michigan Wolverines, 2007 Arizona Wildcats, 2011 Arizona State Red Devils, 2013 Oklahoma Sooners, and 2015 Florida Gators). Ironically, was probably the least impressive stat of the tournament. For the first time in tournament history, an unseeded team, James Madison University, made it to the semi-final game, after winning game 1 from the eventual champion Oklahoma, and beating Oklahoma State in the winner's bracket game. What about the unseeded, never down and out Georgia Bulldogs making it to the WCWS as well. That makes the first time since 2014 that two unseeded teams reached the WCWS. How about an unseeded Virginia Tech on the arm of Keely Rochard beating the national seed Arizona State Red Devils before pushing the previous champion UCLA Bruins to the brink of elimination in the super regional? And for the first time in tournament history, two teams climb out of the losers bracket to meet one another in the WCWS Championship game, the Oklahoma Sooners and the Florida State Seminoles. The NCAA Tournament was a wild one in term of its historical precedents. And ratings were up up up again, making this the most watched WCWS and the most watched Championship ever! As the Sooners might say "Are you not entertained?!"
It's hard to say if this year will stand out as an outlier among future tournaments or if this a trend. In 2016, we saw the #10 seed Oklahoma Sooners win a national championship (the highest seed to date). Additionally, the average seed to win a national championship has gone up by 2 positions since 2005 and continues to show an upward trend. What could be possibly contributing to the increase in parody? Is there a correlation between the increase in parody and exposure? In a recent report, Softball is the fastest growing sport in the NCAA. We just had the first softball game to be televised on a national network, and the women's WCWS Hall of Fame stadium expanded to 13,000 seats, including an upper deck. And this last WCWS was the most watched of all time! If there is a trickle-down component to that exposure and revenue, then we may be observing a wave of more talented, more prepared female athletes joining the NCAA softball ranks. And with revenue and money spent on softball coinciding with a time where technology has afforded some of these programs precision training and scouting, that means the size-complexity of data being captured is rising. Here at PatternSnobs, we believe we can positively contribute to the softball discussion by building well-conditioned models to leverage that data.
In this post, we will outline our approach to building our most complicated model to date: the game prediction model. Where other previous models relied on end-of-the-year statistics to make predictions about the relative strength of teams, their RPI, and a teams WCWS performance, we are incorporating all available statistics on the NCAA website at the time a game is played to make predictions about the outcome of a certain game. Because this blog is not only intended to be informative, but instructive, we are going to outline and describe our workflow the game prediction model below.
Modeling 3.0- The Game Prediction Model
The game prediction model (GPM) is a predictive model that calculates the scores of the two opposing teams and hence decides the winner and margin. The previous model predicted final game rankings, including their likelihood of their placement in the NCAA tournament. The previous model is restricted with only using final season statistics. For making game predictions that include the regular season, it was time to improve the dataset used to train our next model.
In the previous model, the data included nearly 300 teams and over 7 years of final statistics (roughly 15 parameters). That constitutes essentially ~32,000 data points. In our new model, we are including any weekly statistics accumulated, generally on the order of 12-15 sets of stats/year for a team schedule around 50 games. Additionally, we are including 1 more year of data plus their game outcomes and scores. Now we are considering ~2 million data points. That is a huge jump in size and complexity. In order to properly handle the model development efficiently, we came up with a workflow that would best suit our needs:
In the next sections, we are going to describe how each component of the workflow.
1) Data Source
The data is going to be scraped from the NCAA website. The data scraped will include:
All snapshots of statistics for D1 schools for 2013-2021 (sans 2020 due to COVID)
dates of games, individual game scores, and the winner
2) Extract, Transform, Load (ETL)
The process for gathering (extract), organizing (transform), and storing (load) data is the art of data collection. There are a multitude of ways to extract data from a website. In the beginning, we used to copy and paste from NCAA website, which is slow and inefficient. In recent models, the data was scraped using python software packages BeautifulSoup and selenium. These packages allowed us to wrote algorithms to automate scraping from the website in a much faster way.
Once the data was extracted from the NCAA website, it was compiled into two DataFrames, a game data DataFrame and a statistics DataFrame. The game DataFrame includes the date of a contest, the name of both teams, team 1 game score, team 2 game score, the outcome of the contest, and whether it was a home or away game. This is summarized in the table below. For our own purpose, we added a game ID is a unique identifier for each contest that identifies the team, the date, and the outcome.
Table: Summary of columnar game data
The columns in the statistics DataFrame is the same as the previous model, of course include statistics from updated stats throughout the softball season.
The data is stored in .csv files on a shared Dropbox folder that is linked to our GitHub repository.
3) Feature Engineering
The single-most important part of any data science project is the feature engineering portion. One can create the most well conditioned, tested, optimized model possible, but if the data you are feeding to it is not optimized for the outcome, the results are still going to be subpar. For this reason, a majority of the model development will be spent in this phase. If done properly, the the dataset will maximize the information without overlap while using the least amount of data points possible.
Whenever a dataset is compiled, there are always issues with the data. For example, missing data points, wrong values, or swapped values. Detecting these errors can be accomplished through the data cleaning phase. Let's start with outliers. Outliers can be detected during the data exploration phase. We can plot all of the variables (here as a function of time) and observe stark changes to the data. If we do see these stark changes, we can develop a threshold to detect outliers based on the data we know to be correct. Then these values can be flagged, compared with other variables to make sure some shift in a column didn't occur, and removed.
From our personal experience scraping the NCAA website:
A given statistic table (e.g. Batting Average) would be inexplicably absent
Certain characters would offset a column and create mismatches
A Team would have multiple versions of their name
Imputation is the process of using statistical properties of the present data to replace missing values. This process is a bit of an art, as there are many statistics to pull from. The mean is an often used value. Because our data is a time series, we could use tendencies to estimate the missing values to better represent the evolution of the statistics. Let me provide an explanation. In Model 2.0, we represented statistics, for example Batting Average (BA) as the value given at the end of the season, the final season statistics. An easy example of that that equation looks like is:
BA = C,
where C is some constant. In this case, it is more appropriate to use a mean value to represent the missing value. However, in model 3.0 we are using time series data of BA, which looks more like:
BA(i) = f(BA(i-1),delta BA),
where f is some function of the previous state and the difference between the previous state and the new state. As the season progresses, the delta BA should decrease because the number of data points is increasing. Hence, using a mean value across a cumulative time series will not be the best way to represent the missing values. Instead, we can use some representation of the expected delta BA to best replace the values.
Dimension reduction involves identifying variables that do not have a high correlation with the score or are highly correlated with other predictive values. One method can be to remove a variable all together. This can happen when high collinearity exists between two or more variables. Then a subset of variables can be selected which will be fed to the machine learning model. This is called feature selection. An example of two variables that have high correlation with one another in softball is Batting Average and On Base Percentage.
Another way to lower the dimensionality is through feature extraction using Principal Component Analysis (PCA) or Factorial Analysis of Mixed Data (FAMD). For our set of data that contains categorical and numerical values, it is better to use FAMD. The reason is PCA is going to mistreat the encoded categorical values. However, both create orthogonal functions out of the original dataset to extract the features that are important while reducing the dimensionality of the data. In this way, we are doing feature extraction by keeping the most important variables while throwing away the ones that do not contribute much at all to the orthogonal functions.
Exploratory Data Analysis
The process of visually representing the inter-relationships between variables and uncovering insights that the model or weights along cannot tell you. This is one of the most important parts of data science because it is difficult for a visual scatter plot between two variables to hide as to what underlying function best represents the relationship. It is also a good check on how the feature selection process performed, which is why the feature engineering portion of the model development is iterative. A good example of this is explained in our previous model discussion.
4) Model Selection
The model selection is still fluid since we have yet to go through the feature engineering state.
What we do know about the models:
They will need to take in two sets of data, a set of statistics for each team.
Models will need to be a Multi-output model (Team A score, Team B score)
The datasets will be split into training and validation sets
Each model will undergo an optimization where it finds the minimum in the loss function
The bias of the models will be difficult to judge because they contain a magnitude and a direction; so conditioning will be focused on explaining the variance.
Models will be compared through validation sets. The models will be judged by how they capture the variance in the data.
This means we must use machine learning models that can produce multiple and related outputs correctly. The kind of multi-output models we will explore are: Convoluted Neural Networks , Regression, Decision Trees, and Random Forests. This list is nowhere near complete, but the beginning point.
With the ETL part of the workflow completed, we shall move on to the feature selection portion. Future blogs will describe the challenges and successes we encountered while prepping the data for model selection and discuss what parameters might be including in a future model generation.
As an aside, with the completion of this blog- the ratings from the WCWS and the MCWS are in. The result- the women crushed the men in ratings. What does that mean? These steps we are taking are even more relevant and important to complete by the 2021-2022 season. The horizon is bright for women's softball and we are happy to be at the frontier that success. We hope that we can say for a long time with great saliency- Are you not entertained?