We model the problem of predicting an ad click as a binary classification problem with two possible outcomes – ad is clicked or not. The prediction model typically returns an estimation of the outcome as a probability between 0 and 1 given the context in which the ad will be served. Starting with some initial seed weights for each feature, the model is trained using an optimization algorithm like gradient descent and feature weights are continuously updated based on the gradient of a cost function like log loss. The training iterations continue until the model returns the lowest possible value of the log loss function. Traditional optimization algorithms use the entire batch of training samples to adjust the features weights in every iteration and are often termed as batch learning algorithms.
Offline Machine Learning
However, often in practice, a large dataset might make the process of computing the cost and gradient very slow and sometimes intractable on a single machine especially if the dataset is too big to fit in main memory. Another issue with batch optimization method is that it doesn’t allow an easy way to stream new data to train the model incrementally. For example, user behavior might change abruptly during special events like a flash sale and it becomes necessary to re-train the model with latest data batches to capture this new trend. Online learning algorithms like Stochastic Gradient Descent (SGD) allowed us to address both these issues as it follows the gradient of the cost function after seeing only a single or a few training examples. SGD enables us to incrementally update the existing model with new batches of data making the process computationally much faster and resource efficient.
Online Machine Learning
The standard gradient descent algorithm updates the parameters θ of the objective J(θ) as,
θ=θ-α∇ θ E[J(θ)]
where the expectation E[J(θ)] is approximated by evaluating the cost and gradient over the full training set in every iteration. Stochastic Gradient Descent (SGD) simply computes the gradient of the parameters using only a single training example instead of the full training set.
The new update is given by,
θ=θ-α∇ θ E[J(θ;x (i),y (i) )]
with a pair (x
(i),y (i) ) from the training set.
Batch gradient descent would have taken just one single gradient descent step after taking a pass through your entire training set. But stochastic gradient descent takes one gradient descent step every training data sample and leads to faster convergence to optima.
To ensure model optimality while using online algorithms, following important precautions need to be taken care of –
Uneven distribution of data:
In online advertising, events like impression, clicks and conversions can be clustered together and distributed unevenly throughout the day. This can distort the online model learning and lead to poor convergence. An effective way to overcome this is to shuffle the delta training data prior to each epoch of training so that positive/negative samples are distributed randomly.
Batch size determines how many examples you look at before making a weight update. Lower batch size leads to noisier training signal and higher batch size leads to longer time to compute the gradient for each step. Algorithmically speaking, using mini-batches in SGD allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
Use of learning rate:
Using a large learning rate speeds up the learning process but may cause the gradient descent to toggle around the global optima and never converge. Using a small learning rate will lead to guaranteed optimum solution but may slow down convergence severely. A good trade-off is to use an adaptive learning rate where we start off with a high learning rate and keep reducing it as we approach the global optima. We recommend using different adaptive learning rates per model feature for the most optimum solution. .
L1 regularization is used to select features with high importance hence reduces model complexity. In online learning we use mini-batch to update the feature weight; so some features that might be important overall might be discarded by L1 because of less importance in that specific mini batch. We recommend that you use L1 regularization only after you have tested it thoroughly in production setup.
Before we moved to online learning, we built our prediction models using big Spark clusters and model builds typically completed in 10-15 minutes. With online learning we are able to build models on single EC2 instance (15G RAM, 4 CPU cores) in less than a minute without losing predictive accuracy. This has helped us roll out frequent model updates to our bidder servers to reflect most recent changes in market and campaign dynamics.
RevX is an app retargeting platform that powers growth for mobile businesses through dynamic retargeting. The platform is built on integrated and transparent technology combining four key pillars – audience intelligence, programmatic media, personalized ads, and ROI optimization. Mobile marketers across verticals like e-Commerce, travel, lifestyle, hyperlocal and gaming use RevX to enhance user engagement by activating new users, converting existing users and re-activating lapsed users.
If you have any doubts, queries or ideas, do reach out to us at email@example.com