Software is deterministic or repeatable since it produces the same output when given the same input no matter how many times it is run. Whereas with Machine learning Models, the outcome of training a model can change significantly based on the changes in the underlying data.
Once the model is operational, we must immediately consider how to keep it running smoothly. After all, it’s now delivering business value. Any disruption in model performance results in a direct business loss. As a result, Models have to be continually monitored to detect performance degradation and model staleness and must be retrained and remodeled often. We must ensure that the model works. Not simply as a piece of software that responds to API requests but also as a machine learning system that we can trust to make judgments.
Machine learning model monitoring refers to how we track and understand our models’ performance in production from both data science and operational standpoint. Inadequate monitoring can result in inaccurate models being left in production unchecked, stale models that no longer add business value, or subtle flaws in models that go undetected over time.
One important parameter that needs to be monitored is drift: This includes data drift, target drift, and concept drift. Different statistics methods can be applied to check the distribution of the features and if that drift is of much importance. In this post, we will be focusing on the two model drift measurement techniques, namely Population Stability Index (PSI) and Characteristics Stability Index (CSI).
Consider a case where you build a model to predict credit card customer dropout rate in healthy economic conditions and then tested it against a sample from recession-hit times; the model may not be able to predict accurately because the population distribution in different income segments may have drastically changed, causing the actual churn rate to regress.
However, now that we understand this, we can check the population distribution shifts between the training time and the current time to see if the model conclusions are reliable or not. PSI and CSI, as crucial monitoring metrics, which help to achieve this.
Population Stability Index and Characteristic Stability Index are two of the essential monitoring tools used in various fields, particularly credit risk. They try to capture the shift in the population distribution.
Let’s try to explore these metrics one by one:
Population Stability Index (PSI):
It’s a statistic for determining how much a variable’s distribution has altered between two samples over time. It’s commonly used to track changes in a population’s features and diagnose model performance issues — it’s often a useful indicator of whether the model has stopped forecasting effectively owing to large changes in the population distribution.
Population Stability Index (PSI) was designed for the credit risk scorecard analytics to track changes in the distribution between production time and the development(training) period samples. It is, however now being used to investigate distributional shifts for model-related features as well as in overall sample populations, including both dependent and independent variables- which is called CSI.
PSI tends to focus on the overall drift in population whereas CSI covers every individual feature fed to the model.
Covered below are a few of the examples to show you why there can be a shift in the population:
- If there is a change in the source of data.
- If there is a change in underlying assumptions
- Impact on social or economic conditions
- Change in the internal policy of an organization
PSI can be used to compare the similarity/dissimilarity of any samples, such as education, wealth, and health status between two or multiple populations in social-demographic studies because a distributional shift does not have to entail a dependent variable.
Until now we have seen the complete theoretical information about PSI, now let’s try to understand it mathematically. Here are the steps you can follow:
- The first step is to decide which sample to use as reference, for model testing, training data would be the reference sample and during performance monitoring testing data would be the reference sample. To explain, I would consider the scenario of model testing
- We try to sort the scoring variable in descending order from the reference sample.
- We split the data into buckets of 10(decile). When you have categorical data, perform this step only when you have more than 10 classes. You can combine a few of them
- Calculate the threshold points for every reference bin.
- We calculate the percentage of elements in each group from training samples, say A,
- Use the same threshold as the reference sample to create bins for the testing variable.
- Calculate the percentage of elements in each group from testing samples, say B
- We then try to find out the difference between the results we obtained in Step 5 and Step 7, say C
- Finally, we take the natural log of the result of A/B, say D
- In the end, we multiple C and D, the result is what we call is PSI. The sum of this column gives us a score which can be interpreted as below.
If PSI < 0.1. Then we do not require any changes
If PSI >=0.1 and PSI <0.2. We need to frequently monitor our model.
If PSI >=0.2, then its time that we need to change, or redeploy, or retrain our model
Now that with PSI, say for example we know that our prediction is drifted, so how do we do the root cause analysis for the same. There are many features, how should I check or find which values or features led this to happen. No worries, friends, you have your warrior CSI to save you time.
Characteristics Stability Index (CSI):
It determines which feature is responsible for a change in population distribution. It compares the distribution of an independent feature in the inference data set to the distribution of an independent feature in a training data set. It detects changes over time in the distributions of input variables that are used for predictions. It will help you to detect which feature led to this drift.
When calculating CSI, we always follow the same processes as PSI. Simply said, decisions are made based on the training sample values of a certain variable, which are binned and established as hard breakpoints. Then, when computing the frequency values for any validation/test sample, use the same thresholds and execute the same algorithm as when calculating PSI.
CSI and PSI are simple to look at metrics that can provide a lot of information. During the exploratory phases of model development, we look at the distributions, basic stats and infer things based on looking at a lot of combinations. CSI and PSI could be used to make things simpler and initial analysis faster. On top of they also show if rework on the model is needed while it is running in production. And also, pointing out which independent variables are causing the shift in the distribution of the scores predicted.
Conclusion
The world is changing rapidly, along with the universe of scoring data that’s being requested from deployed models. It’s risky to arbitrarily schedule times to refresh models, and even riskier to not plan for model refreshing or not monitor models at all. Skew and Drift are the silent killers of your ML models. With the Katonic MLOPs platform, you can detect drift, training-serving skew, and get alerts when model performance changes to ensure the needs of the use case are still being met.
That’s all about these metrics, I hope you enjoy reading the article. Stay tuned more to follow under this monitoring series.
Attaching the reference paper that is the origin of the PSI concept:
Reference: https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf