Did you build and test your prediction model correctly?
August 11, 2015
Post Views: 32
The advertiser company (E-commerce/ OTA/ Classifieds) would typically pay us some amount for each purchase attributed to us using a last-click attribution method. Users who made purchases that got attributed to us are called converters (RevX-converters to be precise). In this blog, we would like to share our experience with building and testing such complex prediction models.
Justifying multiple measures of model performance
Various performance metrics “test” various aspects of a prediction model. There are two ways to come up with a new model.
By adding new features,
By changing the prediction algorithm, or the parameters of the algorithm.
In either case, we need a way or a set of performance metrics to compare different models. We use model performance metrics such as AUC score (equivalent to Wilcoxon-Mann-Whitney statistic), Log Likelihood (or the normalized form of it – the LogLoss), etc. to compare different models. From our experience, we have not found a single metric that captures all the qualities of a prediction model, therefore our approach has been to test the model’s statistical performance on multiple performance metrics. For example, during a sales day for an advertiser if one decides to flatly multiply the model’s probability values with a boost factor (because people are having higher propensity to buy on sales days), and tests the “boosted” model using AUC score, he/she may be surprised to note that the AUC score for that day for that particular advertiser does not move at all! After some thinking, the no-shift in AUC would seem legitimate (Thinking in terms of Wilcoxon-Mann-Whitney statistic may be helpful). For the boosting heuristic employed in our ad-serving, a more appropriate metric to expect a significant shift on would be the over-prediction percentage [100 *(sum(probability_of_purchase) – total_actual_purchases)/total_actual_purchases]. A data scientist may be pushing too hard for improving a particular metric by employing seemingly very intuitive heuristics, which actually improve some other metric(s), and not necessarily the one he is trying to improve. Therefore, we think that testing the model on multiple performance metrics is a better practice than relying on one traditionally used performance metric.
Did we, by mistake, learn from the “future”?
Let us say, we have some high cardinality ordinal variables in the data, which need to be discretized, and we run a supervised discretization algorithm over some data. Then, for building a prediction model, we split the same data into training data and test data. Using the output of the discretization algorithm, that is, a variable bucketization scheme, we bucketize (or discretize) the variables in both training and test data. Then we train the model using training data, and test it on test data and report performance metrics. Even though the model was trained on training data only, what has wrongly gone into the model is some knowledge learnt from “future” (Here “future” data is the test data. Test data is supposed to be emulating the future/unseen data on which the live model will be queried). By learning the bucketization scheme using the entire data (that included training data and test data), and applying the bucketization scheme on training data, we have made the model borrow some knowledge from the test data! Since typically, the training set is much larger than test set, one may think that buckets learnt after applying discretization algorithm using entire data wont shift much compared to when learnt from training data only. Of course, indeed, this may not make too much of a difference in many cases, but when it does, there could be a marked difference between offline and live traffic test results and one might be left wondering where is the “bug”? In the same lines, as emphasized in academia too, the cross validation data (used to arrive at some parameters of the model) should not overlap with the test data.
Did the model learn well enough for a new advertiser?
For a new advertiser, we typically do not play our prediction model right away as we think it may be worthwhile to make the model learn enough about the new advertiser before we use the model to bid for that advertiser. We do flat bidding (with a high value) for some days to ensure that we can win enough bids and are able to place the advertiser’s ad enough number of times across multiple websites. This is done to ensure that we explore good amount of events to train the model with. The question is- how much data is required for the model to learn well enough about the new advertiser? One may be excited to think in terms of sample complexity of the prediction model, but we would also like to share a practical approach (or a hack) to “solve” this problem. We track AUC scores (and other metrics) for last few days for each model (we have multiple models in production), per advertiser, etc. (even if it is a new advertiser for which the model is not enabled, the AUC score gets tracked for this advertiser too.). In order to answer “has the model learnt enough for this advertiser?” – we can see the trends in AUC score for the advertiser over the past 15 days (say) and get a sense of whether the AUC score has been on the rise or has reached some kind of stability (for example, if the AUC scores for last two Mondays are not very different, and the story is similar for other days of the week). If we feel it has reached some kind of stability, it may be about time to enable the prediction model for the new advertiser. Of course, with sparse campaigns (advertisers with less data), we will see a lot of fluctuations in daily AUC scores. For that, increasing the test data from one day to two or more days may be a good idea to keep fluctuations in performance metrics under control. Other hacks like – “Wait until 100 purchases (or 100 success events) and then deploy the model” may not be a great thing to do specially for fast moving campaigns for which we see 100 conversions (success events) within a day or two.
Are you split-testing with 5% data in one split?
Let’s say we have a split test enabled on a few of the advertisers and have only 5 % users in one split. In that case, daily reporting the margins, revenues, etc. might not be a good thing to do especially if we see lot of fluctuations. In case of fluctuations, there may be an indication that the 5% or say even 10% data is either not getting “stratified-sampled” well enough (We have observed that many people believe that simple random sampling on large data is good enough to ensure that either split won’t be that unlucky to get all poor quality users) or is too sparse (say having 5 conversions in a day per advertiser) for daily reporting. In the latter case, instead of looking at daily performance, we should look at overall performance over a long period of time, like say 20 days, and should not think that the model is not stable and therefore reject the model completely.
To summarize, we recommend the following:
Use multiple performance metrics to test the model on various aspects.
Make sure that we separate out training and test data in the beginning itself and then learn the discretization scheme or any hyper parameters of the model only from the training data.
Ensure that before we use a prediction model for a new advertiser the model has learnt well enough about the new advertiser by tracking the stability of the performance metrics.
Think carefully before rejecting a model based on split-testing results.
Our data science team at RevX is working hard to make sure that we do our stuff correctly. We believe that investing our efforts on our prediction system is the key to the success of the entire ecosystem – A model with increased performance would target the right purchasers at the right time for our advertisers which in-turn would fetch us high revenues and margins and which in-turn would enable us to scale further and which in-turn, again, would fetch us more training data to build an even better model!