Predicting clicks in mobile advertising: An experiment
Advertising is a multi-billion dollar industry that acts as a bridge between companies and their customers. While most people are conscious of the ads around them, they likely underestimate the power of those ads and the influence of advertising in general. Research suggests that simply making someone aware of products, events, and brands increases the odds of that person actually buying those products, attending those events, or supporting those brands. Further, if an ad captures a person’s attention to the extent that he or she has an immediate, positive reaction to it, those odds of direct product engagement increase even more.
Mobile advertising is a form of advertising that takes place on mobile devices, such as smartphones and tablets. Mobile ads are served via Real Time Bidding (RTB), an auction process that happens in mere milliseconds. A mobile device user, together with the available ad space on his or her device, comprise a bid request. Advertising companies bid to serve ads to these bid requests, and the winning bid results in that company's client's ad appearing on the mobile device. Such an ad, referred to as an impression, also provides the user with the option of obtaining more information about the ad by directing them to a website when the ad is touched; this is called a click. Mobile advertising companies use the ratio of clicks to impressions, known as click through rate (CTR), to gauge the success of their clients' ads. The process is well described by the following graphic from insight.venturebeat.com:
For this capstone project, we - Aaron Owen, Kathryn Bryant, and Paul Ton - served as consultants for the mobile advertising company Ads Anonymous (not their real name). Ads Anonymous’ main objectives are to maximize the number of client impressions served via the RTB process, and more importantly, to maximize client CTR. Our role was to build a machine learning framework that could be used to inform Ads Anonymous' bidding strategies by providing accurate click probabilities for incoming ad-space auctions, which in turn, can help Ads Anonymous more effectively achieve its goals.
Our approach to providing accurate click probabilities was founded on two assumptions. The first assumption is that the click probability of a bid request for a particular ad depends on:
- the specifics of the ad served (relevance, allure, brand recognizability, etcetera)
- the specifics of the viewer/user (age, gender, state of origin, etcetera)
To illustrate these points, consider the following hypothetical scenarios.
- The average click probability across all demographics for a generic ad is 0.15, but the average click probability for only individuals in Texas is 0.08. That is, Texans are less likely than the general population to click on generic ads.
- The average click probability for only Texans is 0.08 for a generic ad and 0.2 for a popular truck brand ad. That is, Texans are more likely to click on this popular truck brand ad than on a generic ad.
- The average click probability across all demographics for a generic ad is 0.15, but the average click probability across all demographics for the popular truck brand ad is 0.2. That is, everyone is more likely to click on a popular truck brand ad than on a generic ad.
These three situations are captured by the following graphic:
Yellow = 0.08, Orange = 0.15, Red = 0.2.
The above assumption led us to our second one, which is that Ads Anonymous' clients would prefer click probability predictions that have been customized for their campaigns (a collection of ads), but only if those customized click probability predictions are more accurate than general click probability predictions!
Thus, we endeavored to determine if campaign-specific models, those that were trained only on data relating to a specific campaign, were more accurate in predicting click probabilities than a general model that was trained using all available data.
The general methodology we employed to investigate this question was to conduct an experiment in which we compared the performance of two campaign-specific models with the performance of a general model on campaign-specific data. This allowed us to make a direct campaign-to-general model comparison. The figure below describes our workflow; each section of the diagram will be discussed in more detail below.
Ads Anonymous provided us with three data sets:
- 36 billion bid requests, which constituted about 30 TB (8 compressed)
- 300 million impressions, the subset of the bid requests that Ads Anonymous bid on and won
- 500,000 clicks, the subset of the impressions that a user clicked
Unfortunately, Ads Anonymous was unable to obtain data for the bid requests on which they bid but did not win. For this reason, we chose to exclude the bid requests data set and used only the impressions and clicks sets to train our models. Because the clicks data set was a subset of the impressions data set, we were able to merge the two together to determine which impressions did or did not lead to a click, effectively leaving us with a single data set. Finally, at Ads Anonymous' suggestion, we filtered the data for only those impressions that occurred in the U.S. and those that belonged to an ad campaign with a non-zero CTR. This reduced, "general" data set of 73 million impressions with a CTR of 0.63% is what we used to train our general model. We created our two campaign-specific data sets from this general data set, but only after cleaning and preparing the general set for modeling.
A caveat to our approach is that because we were unable to analyze lost bid request information, our models were inherently biased towards bid requests that Ads Anonymous bid on and won. Therefore, we made a (nontrivial) assumption that the impressions on which we trained our model were representative of all possible impressions.
Our now single, general data set included several types of information for each impression, including variables relating to the user, the user's device, the ad, the app on which the ad was served, and the bid request from which the impression came. Detailed below are examples of how we treated different variables.
Our data set included some variables that possessed little to no variance, and thus provided limited predictive power. For example, all our data came from impressions that occurred in September 2016, thus the variables Month and Year were uniform for all impressions. Our solution was to drop these variables.
The variables Ad and Campaign were labeled with unique numeric identifiers, but these numbers do not have an inherent ordinality. That is, Campaign 234 does not have an ordered relationship with Campaign 235. Thus, the Ad and Campaign variables are actually categorical, rather than numeric, with each unique identifier representing a different class or level. Machine learning algorithms, however, would treat these numbers as having an ordinal relationship. Our solution was to convert these numeric identifiers into strings.
While the variables themselves contained information useful for prediction, often additional information exists "hidden" in the values after manipulating them in some way. To access this hidden information, our solution was to make new variables by either extracting information from a single variable or creating an interaction between two variables.
An example of a feature we extracted was derived from the variable, Day, which included values ranging from 1 to 22 corresponding to the days in September from which the data were collected. We hypothesized that people may have different click behavior depending on different days of the week. Thus, we created the new variable, Weekday, which described the day of the week (e.g., Monday, Tuesday, etc.) the ad was served.
Other examples of extractions are:
- iabFirstCat and iabSecondCat; extracted from iabCategories, a list of tags describing the app or site on which the ad was served
- Lat and Long; extracted from Location, a list of the latitude and longitude
- Hour; extracted from the Timestamp, which included a more refined time estimate (to the millisecond) than we thought useful
An example of a feature interaction we made was the combination of Weekday and Hour. Similar to our rationale regarding the creation of the Weekday variable, we hypothesized that click behavior might be different both on different days and at different times of different days. People may be less active on their mobile devices during weekdays than on weekends, but this may differ depending on the time of day. That is, Friday evenings might be more similar to Saturday evening click behavior than during that same time on Sunday.
Other examples of interactions are:
- Hour and Age; peak device use likely differs among age groups
- Hour and State; we considered this feature an estimate of timezone
- operatingSystem and State; there may be regional differences in device preferences
- Gender and State; there may be regional differences among the genders
A large majority of the variables in the dataset were categorical and several of them contained a large number of levels. Many of these levels, however, contained the same core information, and thus would be more appropriately treated as the same level. Our solution was to reduce levels to their most important, core information.
An example of a variable with messy levels that we reduced was BestVenueName, which describes the app or site on which the ad was served. Seen below on the left are values for BestVenueName for three different impressions. Each observation contains the core information, "Meetme", but if left untreated, a machine learning algorithm would have treated them as separate levels. We cleaned these impressions such that each value was replaced with only the value on the right, "meetme."
Some categorical variables still contained a large number of levels even after cleaning the messy ones. In several cases, a majority of the impressions only included a small set of these levels whereas other levels were seen only a few times. Our solution was to group very infrequent levels together into an "other" level.
An example of such releveling was the variable, Carrier, which describes the platform from which the device is receiving Internet service. This variable contained more than 350 levels, most of which corresponding to only a small number of impressions (see below top). We releveled Carrier by grouping all levels that constituted less than 0.5% of all impressions into an "other" group (see below bottom).
The Location variable, comprised of the latitude and longitude of the user at the time the bid request occurred, was roughly 4% missing. This geospatial data is import in understanding the demography of the user, and thus is useful in targeting the appropriate audience for a particular ad. Machine learning algorithms do not handle missing data well, so our solution was to remove those impressions whose Location variable was missing.
When preparing datasets for modeling, variables with high amounts of missingness are often either dropped, losing any information in the variable, or the missing values are imputed in a systematic way, artificially populating the variable. While both strategies have advantages and disadvantages, the one employed is often based on the perceived importance of the variable in question.
For our data, almost half of the impressions were missing the variable, Gender.
Ad research suggests that gender is an important factor in a determining a person's response to advertising. In a meta-analysis of more than 30 years of research, a scientist found that women will purchase products marketed towards both genders, whereas men will only purchase products marketed towards men. Based on this information, we considered gender to be an important factor in predicting clicks, therefore we opted to impute the missing values with machine learning.
To execute our imputation we used a random forest to predict the missing values. We only included variables that were descriptive of the user and their device, and excluded information pertaining to the ad or click. After a coarse cross-validation process, our best random forest model included 40 trees, a max depth of 20, and included four variables at each split. The model resulted in an 85% accuracy in predicting gender.
Once our data had been cleaned and all feature engineering was complete, we created our two campaign-specific data sets. For the general model, we ended up using ~73 million impressions (with clicked/unclicked labels), and from this, we sectioned off all impressions related to company Hair Care and all impressions related to company Sports Bar. Hair Care included 3.2 million impressions and a 1.45% CTR, and the other campaign, Sports Bar, was comprised of 1.9 million impressions and a 0.55% CTR. These campaigns were chosen for their manageable size, their duration, and for click-through rates that straddled the general data set's CTR (i.e., 0.63%).
For each of the campaign-specific datasets, we had to further split it into training, validation and test sets. To approximate a production setting in which we would be using past data to predict on future data, we opted to do our train/validation/test splitting by time. Our data spanned from 9/1 to 9/22 in 2016, and we used impressions from 9/1 - 9/17 as our training set and impressions from 9/17 - 9/22 as our test set; this meant that our training set constituted the beginning 80% of the campaign-specific data and the test set constituted the last 20%.
To obtain a validation set, we further split the above training set. For this, we used a random split such that our new, smaller training set constituted roughly 60% of the overall campaign-specific data and the validation set constituted roughly 20%. In hindsight, it would have been more consistent to also use time for this split. However, our approach was motivated by wishing to replace the cross-validation step on the 80% training set with a single validation step on the 20% validation set, and we believed that using a random subset of that 80% set was still justified.
Predicting click-through rate is a binary classification problem, which we decided to approach using logistic regression with stochastic gradient descent. The advantages of using logistic regression on this particular data set are many:
- Robustness to high dimensionality: With multiple nominal categorical features with large numbers (> 1000) of levels, one hot encoding would lead to a final data set with thousands of columns. Logistic regression is more robust to high dimensionality than are, say, tree-based methods.
- Speed of training: Given the constraints of the project in time and computing power for such large data, the speed of training logistic regression models relative to more advanced models like neural networks was significant; it allowed us to run our models many times, both when tuning and when experimenting with different degrees of undersampling.
- Interpretability: The output of a logistic regression model is a probability of "success" for a given input, where success is simply an event outcome of interest. The underlying theory of logistic regression is highly transparent in that it only employs a combination of basic functions (sigmoid, linear) and basic probability rules (independence of events), making it superior in interpretability compared to more sophisticated models.
- Appropriateness: Due to the number of observations, support vector machines were inappropriate and due to the presence of rare combinations between features, linear discriminant analysis and Naive Bayes approaches were thrown out as well.
- Possible robustness to imbalanced classes: The two levels of our 'Clicked' variable were 'clicked' and 'not clicked', and 'clicked' constituted an extremely minor class. From the data only 1 out of 160 impressions yields a click, making the proportion of 'clicked' roughly 0.006. Rumor has it that although logistic regression (like all other machine learning algorithms) struggles to predict extremely minor classes well, there are relatively easy ways to address this issue; altering the intercept or scaling the prediction threshold are two such approaches.
Information surrounding Logistic Regression and its tolerance to imbalanced classes is both limited and mixed, so we decided to run a sub-experiment. One common way of addressing class imbalance is through under-sampling the majority class. That is, by throwing out some percentage of the majority class observations in order to diminish the degree of class imbalance in the model training set. The risk of under-sampling too much is that a model can be left with too little data to train on, which can lead to under-fitting.
To see how under-sampling affected our models, we trained models with different degrees of under-sampling. We trained each model three times: once on the entire training set (no under-sampling), once on a training set where we under-sampled the majority class to 10%, and once on a training set where we under-sampled the majority class to 1% (reaching near parity between clicked and not clicked). The results of these different under-samplings on our models are given in below in Prediction and Evaluation.
The other side of an under-fitting problem is an over-fitting problem, which happens if model complexity is too high and the model fits to noise in the underlying data. To prevent this, Spark's logistic regression has built-in regularization in the form of Elastic-Net; it has an alpha parameter that controls the mixture of Ridge and Lasso (L2 vs L1 penalty to the cost function) and a lambda parameter that controls the size of the penalty term.
Since we were working with big data, we had to be deliberate in any choice that could result in long running times. As such, we chose to reduce the number of parameters we tuned from two to one. We used pure Lasso regularization (alpha = 1) and only tuned the lambda penalty parameter.
For our grid search on lambda, we also had to make some practical sacrifices. Rather than run a k-fold cross validation, we used one validation set and we kept our grid search fairly coarse. We evaluated the fit of our models using log-likelihood. As a guide for an appropriate starting lambda, we used MLLib's objective history for the un-regularized logistic regression model because it captures the training log likelihoods. With an idea for a starting lambda value, we then searched 5 values at different orders of magnitude around that.
Predictions and Evaluation - Evaluation Metric Choice
After training our models, we can feed in new bid requests and the model will return a probability of whether that bid request will result in an impression that will be clicked. In order to make a prediction of "clicked" of "not clicked" from the probability, we needed to decide on a decision threshold. To decide on a best threshold, we had to decide how we would evaluate the quality of our predictions.
With any classification model that predicts two classes (here, "clicked" or "not clicked"), there are four categories of outcomes: true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). In order to choose an appropriate evaluation metric for comparing models, we needed to understand what these four outcomes would mean in the context of mobile advertising. Consider the following (diagram credit: Wikipedia):
- True positives are impressions (won bid requests) that get clicked. From the business standpoint, true positives can be viewed as “money returned.”
- False positives are impressions (won bid requests) that do not get clicked. From the business standpoint, false positives can be viewed as “money lost.”
- False negatives are lost/un-bid-on requests that got clicked. These can be viewed as "money lost-out on."
- True negatives are lost/un-bid-on requests that did not get clicked. These can be viewed as "money well saved."
Suppose TP denotes the total number of true positive outcomes for a model, FN denotes the total number of false negatives, FP denotes the total number of false positives, and TN denotes the total number of true negatives. These outcomes can be used to compute various basic measures of success for a model.
One obvious, easy measure of success is accuracy, defined as (TP + TN)/(TP +FN + FP + TN). We opted to immediately discard accuracy as a candidate for model evaluation due to the extreme class imbalance in our target variable. With such an imbalance, any model could achieve high accuracy by simply predicting every bid request to correspond to an "unclicked" outcome. But, since we care more about predicting the minor class than about predicting the major class this is a highly inappropriate way to evaluate the success of our models. We also threw out specificity (or true negative rate), defined as TN/(TN + FP), because our focus is on clicks (positives) rather than non-clicks (negatives).
Thus, the two basic measures we wanted to consider were recall (sensitivity) and precision.
Recall is defined as TP/(TP + FN) and also goes by the name of true positive rate. In context, recall answers the question “Of all the clicks to be had, what proportion did you get?”. High recall can be interpreted as taking good advantage of relevant advertising opportunities. Low recall corresponds to taking poor advantage of relevant advertising opportunities, i.e. losing money in an abstract/hypothetical way; the company doing the advertising does not show its ad(s) to as many receptive people as it could have. Note that recall can always be maximized by simply buying every single bid request (thereby making FN = 0), but this is unrealistic for any company with a budget (that is, all companies).
Precision is defined as TP/(TP + FP) and can be thought of as positive prediction accuracy. In context, precision answers the question “Of all the impressions you bought, what proportion yielded clicks?”. High precision can be interpreted as putting money in the right place. Low precision corresponds to putting money in the wrong place, i.e. losing money in a tangible way; the company doing the advertising gets very little return on the money they spent on impressions. Precision can be maximized by only buying bid requests for which the click probability is nearly 1 so that FP is roughly 0 (or in the extreme, not buying any bid requests at all so that FP = 0), but of course this defeats the purpose of mobile advertising.
With the goal of making Ads Anonymous' customers happy, we recognized that precision and recall needed to be balanced so that AA's clients maximize both the number of receptive viewers to which they show their ads and the number of clicks that result from the impressions they buy. Two metrics came to mind for model evaluation that included both precision and recall: Fβ -Scores and Area Under Precision-Recall Curve (AUPR). Given the implications of low recall versus low precision, we felt it was more detrimental for a company to lose money in a tangible way than in an abstract way so we wanted a model evaluation metric that we could weight in favor of precision. This led us to Fβ -Scores.
In general, an Fβ -Score is a harmonic mean of precision and recall. A harmonic mean of precision and recall can be thought of as a "pessimistic" measure of center, in that it always lies between the values of precision and recall but it is closer to whichever is smaller. (In contrast, the usual arithmetic mean lies in the exact center between them.) The formula for an Fβ -Score is as follows:
As the formula suggests, different choices of β allow for varied weighting of precision or recall. At β = 1, the usual harmonic mean is returned in which both precision and recall are left as-is so that the smaller of the two measures - whichever that may be - is preferenced. For 0 ≤ β < 1, precision is made artificially smaller and is therefore weighted more heavily than recall; for β > 1, precision is made artificially larger therefore recall is weighted more heavily than precision. Given that we wanted to prioritize precision in our models over recall, we opted to evaluate our models using the Fβ -Score with β = 0.5. With this chosen evaluation metric, we tuned our clicked/not clicked decision boundary for each model so as to maximize the F0.5 -Score.
Predictions and Evaluation
Recall that our experiment involved training three different logistic regression models on Hair Care impressions, on Campaign 2 impressions, and on General impressions, one for each degree of undersampling. The 1% undersampled model trained on Hair Care impressions was used to predict on the Hair Care test data, and from that prediction we obtained an F0.5 -Score for the model; we also used the 1% model trained on General impressions to predict on the Hair Care test data to get an F0.5 -Score for that model as well. We repeated these calculations for each combination of undersampling and campaign. Here are the results:
In bold we see the better F0.5 -Score for the two models being compared. For both campaigns, undersampling the major class down to 1% proved detrimental to the campaign-specific models, likely because the data didn't contain the minimum required sample complexity (not enough data was seen in our training sets to appropriately handle new/test data). Interestingly enough, every single model improved as we undersampled less, suggesting that that logistic regression is robust to extreme class imbalances under certain conditions. We suspect that the size of the data and the complexity of it both influence the efficacy of under-/over-sampling as a way of dealing with class imbalance, and we recommend exploring this issue on a case-by-case basis.
The official result of our model-comparison experiment is as follows:
Using our metric of choice, the F0.5 -Score, we conclude that campaign-specific models are more effective at predicting click probabilities than a general model, provided that both models are trained on data having minimum required sample complexity.
Although the campaign-specific models achieved better F0.5 -Scores overall, we realized the potential for a mismatch between our academic results and practical results. Hence, we sought to unpack our results further and dig deeper into the business implications of our findings. In particular, we wanted to compute the following for each model:
- Total spent: Had Ads Anonymous bid on and won all the bid requests predicted by the model to be "clicked" (positive), how much money would their client have spent? Total spent is typically not as important as net profit, companies with tighter advertising budgets may not be able to maximize net profit if doing so requires them to spend more on advertising than their budget allows.
- Total saved: Had Ads Anonymous not bid on any of the bid requests predicted by the model to be "not clicked" (negative), how much money would their client have saved? Money spent on bid requests that yield clicks is money well spent, but money spent on bid requests that do not yield clicks is not. Total saved is a naive measure of how intelligently money is being used. (It is naive because a company could save all of its money and thereby never "misspend" by buying bid requests that do not yield clicks, but this defeats the purpose of advertising at all.)
- Required "downstream return" per click in order to profit: Had Ads Anonymous bid on and won all the bid requests predicted by the model to be "clicked" and not bid on any of the bid requests predicted to be "not clicked," what would the downstream return per click need to be in order for their client to turn a profit? Different models will have Ads Anonymous bid on different bid requests and will therefore produce different numbers of true positives, true negatives, false positives, and false negatives. Whether bidding more aggressively and increasing the number of both true positives (good) and false positives (bad) is smart depends on how much return a company gets per click, and what this number needs to be in order to profit is dependent on those outcomes.
- Return on Investment (ROI): Had Ads Anonymous bid on and won all the bid requests predicted by the model to be "clicked" and not bid on any of the bid requests predicted to be "not clicked," what is the return on investment for mobile advertising for their client? As mentioned above, a more aggressive bidding strategy will result in more true positives and therefore more raw profit. However, it will also result in more false positives and the ratio of money returned to money spent is a more appropriate way to compare two strategies/models that may have drastically different numbers of true positives and false positives.
For these computations we used TP, FP, FN, and TN outputted by each of the models, as well as two other quantities: average price per 1000 impressions (by campaign) which we denote by 'price', and downstream return per click which we denote by 'x'. The requisite formulae/equations are as follows:
1. Total spent:
3. Required "downstream return" per click in order to profit (must solve for 'x'):
The values for TP, FN, FP, and TN and the four above computations for each of the best models (in all cases, the non-undersampled models) are given below. Note that the ROI computations are done with x equal to the larger value (of the two models) needed for downstream return per click for profit. This was done under the assumption that a true, single x value exists independent of any models and for comparison's sake it we made it a value that would enable both models to profit.
Across the board, campaign-specific models are more conservative/less risky from a business standpoint than the general model. By using a campaign model to inform bidding, a company will spend less, save more, and get a higher return on investment. Furthermore if the downstream return per click 'x' is unknown, a company is more likely to turn a profit by using a campaign-specific model since the downstream return per click needed to profit is lower for the campaign models than for the general model.
Overall, we see that the findings of this business-based analysis of our various models corroborate the findings of our academic analysis. Specifically, with higher ROIs and lower downstream return per click values in order to profit, the campaign-specific models are better than a general model for predicting click probabilities for individual companies.
Tailoring models for predicting click probabilities to specific companies is worth the effort!
Our project focused on finding out whether company-customized machine learning models for click probability were better than general ones. We were successful in answering this question for two carefully chosen campaigns, but it would be prudent to repeat our experiment for much larger sample of campaigns.
Although our experiment helped us determine which of a campaign-specific model or a general model was better for predicting click probabilities, it did not produce for us the absolute best model for this task. We could certainly shift our focus from a comparative one to an absolute one and pay attention to whether our results were objectively good rather than just comparatively good. To this end, there are a few avenues to pursue:
- Weighting more recent observations: With our time-based approach to training and testing our models, we could weight more recent observations more heavily in order to increase the accuracy of our future probability predictions.
- Using online learning: Online learning allows for a model to be updated in closer-to-real time. Each new, incoming data point is used to update a model and therefore influence future predictions right away. This process prevents a model from becoming stale.
- Using a hierarchical general model: Recall our first assumption that click probability of a bid request for a particular ad depends on:
- the specifics of the ad served (relevance, allure, brand recognizability, etcetera)
- the specifics of the viewer/user (age, gender, state of origin, etcetera)
Unlike a basic logistic regression model, a hierarchical logistic regression model could take both of these points into account by training on different layers. Our idea is to implement a hierarchical model with two layers, the first of which would consist of campaign-specific logistic models and the second of which be a multiple linear regression model using demographic data to predict on the first layer. More information can be found here.
*Headline photo source: http://www.mobyaffiliates.com/