Lowering Costs of Bank Marketing Campaigns
The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Banks often need to run a marketing campaign in order to sell a product to potential customers even if the data is limited. These campaigns cost time, money, and inconvenience the people contacted if the product offered them is a poor match. As a result of being inconvenienced, the customers can start having negative sentiments towards the bank, leading to reputational damage for the bank and the resulting profit losses from diminished long-term customer value. In addition, most customers will say no in the end, which makes identifying those that will say yes particularly challenging.
In the parlance of machine learning, this is known as an imbalanced classification problem. In this project, I solve this problem for the case of a Portuguese bank running telephone campaigns to sell long-term deposits. A long-term deposit is a kind of security deposit granting the lender a higher interest rate than a traditional savings account and granting the bank a guarantee of having the lender's funds for a fixed period (such as 12 months). While this problem is challenging, much headway can be made to help the bank reach the right customers, save tens of thousands of dollars in concrete costs, and protect its reputation.
Findings and Methodology Overview
- The data is imbalanced (more noes than yeses), and I rebalance it to increase model performance on the minority class. In addition, there are no timestamps and the 2008 financial crisis occurred during the data collection period. Time-dependent variables are needed for a good ROC-AUC score, but I train models with and without time-dependent variables to establish the robustness of the findings. Ideally, I would use timestamps to run the model on non-crisis data only.
- Random forests on the data rebalanced 'manually' performed better than random forest or XGBoost on data rebalanced with SMOTE. Since accuracy is not an appropriate metric for imbalanced classification, I use the ROC-AUC score.
- I compare three courses of action for the bank: (i) Using the naive approach and contacting everyone until a desired number of customers subscribe, (ii) Using the recommendations of a random forest with classification threshold set at .5, and (iii) Using the recommendations of a random forest with classification threshold set at .3. Using employee salary data, I provide approximations of cost savings due to machine learning models. In practice, the bank would still need to test the models to determine the most profitable course of action.
- Finally, I deploy an R Shiny app for use by bank employees prior to meetings with customers. The employees can use the app to offer the long-term deposit to customers who are likely to respond positively.
The data for this project was collected between May 2008 and June 2013 by a Portuguese banking institution and is available through UCI here. There are 45211 observations with features on bank client data (age, job type, marital status, education, housing and loans status), information regarding the last contact (including a leaked variable duration, which shall be dropped), information pertaining to previous campaigns, and some social or economic context variables (such as the 3-month borrowing rate between banks).
To give some examples of the effect of some of the independent variables on deposits, the mosaic plot below shows that more people with a cellphone said yes to the campaign than expected under the hypothesis of independence (blue) and fewer people said no (red). Similarly, fewer people with a landline telephone said yes than expected (red) and more people said no (blue).
Similarly, more people said yes in March, April, September, October, and December, and fewer people said yes in May. I will discuss the subject of timestamps and the potential effect of the financial crisis on the data below.
Missing Timestamps and Macroeconomic Indicators
A problem with the data is that the financial crisis occurred during the data collection period, yet to be able to separate out the financial crisis period from the rest of the data, timestamps would be needed, and they were not available. In addition, the timestamps cannot be unambiguously inferred from other variables that have a time component, such as the European 3-month inter-bank borrowing rate. As an example, I analyze the effect of the financial crisis on this variable below, and some of the other macroeconomic variables show similar pattern.
The European 3-month inter-bank borrowing rate is a proxy for much of the overall state of the economy. Intuitively, and as feature importances discussed below indicate, this interest rate is an important predictor. For example, the graph below indicates that there were more yeses when the European 3-month borrowing rates were lower and more noes when they were higher. This appears counterintuitive, yet when we consider the effect of financial crisis and the time cutoffs of our data collection period, there is a clear story, as I discuss next.
The interest rates were high but falling at the onset of the crisis, which approximately corresponds to the start of the data collection period, and fewer people were agreeing to the deposit because of the crisis. When the economy emerged out of the crisis, the interest rates were lower, but people were feeling better about investing, and there was more pressure on the bank employees/marketers to sell the long-term deposits. The line graph of European borrowing rates found here corroborates this narrative: notice the plunge in interest rates at the onset of the crisis. If the timestamp had been available, I could train the model only on data points that were not collected during the 2008 financial crisis to build a stronger model.
I could also leave out the time-dependent variables such as the European 3-month inter-bank borrowing rate. This hurts the predictive power of the model and there are already very few strongly predictive variables. While I use the models on the full feature set for this blog, I have tried the variation without time-dependent variables to be confident in my results. Essentially, the feature importances shift towards the demographic variables identified as key in the analysis below. The AUC-ROC score and accuracy on the positive class are about 5% lower. Once again, an argument can be made for using a model with or without time-dependent variables. If I were to get access to the timestamp data (e.g., as a data scientist working for the bank), I would simply separate out the time period corresponding to the financial crisis and train the model on the rest of the data.
In the data, only about 11% of people contacted end up subscribing, making this classification problem highly imbalanced. I've tried several approaches to address the imbalance issue: using the data as is, rebalancing using the SMOTE algorithm, and simple rebalancing based on sampling the minority class at a higher rate. In addition, I've tried XGBoost and random forest classification in R. Finally, accuracy cannot be used as the metric, and I chose the AUC-ROC metric as I will discuss in more detail.
Bank Marketing Campaign Metrics
Accuracy is not an acceptable metric for this problem: Classifying all customers as 'No' customers will achieve 89% accuracy without providing any help identifying the 'Yes' customers more precisely. Since the positive class is the 'Yes' class, false positives would amount to predicting that the person will say yes when they will say no. I would like to avoid these noes to minimize inconvenience to our customers, the resulting reputational damage to the bank, and time lost on contacting the no customers.
However, I would tolerate some false positives to get more yeses and would not maximize precision, which is the ratio of true positive to true positives plus false positives, per se. False negatives are when one predicts that the customer will say yes when they will say no. The cost of this prediction is not getting a client, something one would wish to avoid in a sales situation. The ratio of true positives to true positives plus false negatives is known as recall, and it is of greater importance for this problem.
Nonetheless, I would like to strike a balance between precision and recall, using the AUC-ROC metric. This metric balances out the considerations in optimizing for both a small false positives and false negatives rate, and, in view of the other tools used to solve this problem, leads to the highest recall that could be achieved.
Training, Final Model Selection, and Feature Importances
The final selection of rebalancing, model, and metric that I made is simple rebalancing, random forest model, and ROC metric. XGBoost was particularly prone to overfitting on this data, SMOTE seemed to introduce too much extra noise, and other metrics (such maximizing recall directly) did not work as well as maximizing ROC. I addressed the overfitting issue by ensuring that the nodes have a reasonable minimum number of observations (in this case, at least 40 observations in each final node were chosen).
The final model achieved an ROC score of .775 and identified a group of clients particularly likely to respond positively (3 of 8 clients identified would say yes). The group that can be reached by following the recommendations of this ML model corresponds to 63% of all the people that would say yes. The concern is, of course, that this would not be enough for the bank. I addressed this concern by lowering the classification threshold to .3 instead of the default .5, allowing 80% of the yes customers to be reached at a somewhat higher cost of noes.
The top three predictors were macroeconomic: the most important feature was the European 3-month borrowing rate, the number of workers employed in the economy, and the employment variation rate. The next two were the number of days that passed since the person was last contacted during this campaign and whether the month was May. Finally, the person's age, outcome of the previous campaign for that customer, and the number of contacts performed for this campaign and for this client came next. Among other variables are whether the person was contacted via a landline or cell (a proxy for wealth in the time of data collection?), person's job type, and whether they have defaulted on a loan.
The most important variables being general macroeconomic indicators is not uncommon as the state of the economy is highly correlated with a person's willingness to invest their money for a longer period. The month of May, however, could possibly be attributable to the financial crisis and its aftereffects, and it would be helpful to have the timestamp to tease this apart. The next set of variables consists of 'historic' variables for the given individual, describing how that person has responded to a previous campaign, how long has it been since they have last been contacted, etc. Finally come the variables that characterize a given client and could give some intuition as to which clients are more likely to say yes based on their demographic data. In the next section, I will address the business value of these models.
Business Value and Related Questions
Suppose the bank obtains 100,000 records of potential customers and would like to determine which of these people to contact. Assuming customers likely to say yes are uniformly distributed within this data, the following table summarizes three possible approaches of contacting customers along with their corresponding costs.
Note that in both ML and No ML cases, the goal is to reach 63% of all the yeses in the data. Once this target is reached, the bank's agents/telemarketers stop calling the potential customers. Since the ML strategy gives the bank information helpful for reaching the right customers, the bank can save money and intangible costs that would otherwise be spent on reaching the noes unnecessarily. The lower bound calculations assume $10.00 per hour rate (converted from euros) for telemarketers time and the upper bound calculations assume $20.00 for bank employees' time.
Hourly Salaries and Cost Calculations
The actual hourly salaries of each of these worker groups are a little lower, $8.08 and $16.00, respectively, but I'm assuming workers need some time between the yes/no calls (perhaps for non-responding customers or data lookup/entry) and base the rates off time spent on call. A natural concern is that 63% is not good enough to meet the bank's objectives. In that case, by lowering the classification threshold for a yes to .30, 80% of all the yeses can be reached, albeit at a higher cost.
After reaching the likely responders, I would suggest that the bank use the extra time it saves through the use of one of these strategies to target a different product to the customers unlikely to respond with a yes. It could also be the case that the bank determines that long-term deposits are the most profitable product that the it can offer its customers. In such case, the bank could decide to simply call all of its potential customers and accept the higher costs. In the end, this is as far as machine learning can take us, and the bank would need to test each of the three strategies in production before deploying the best one on all of its potential clients.
R Shiny App and Future Steps
R Shiny App
I developed an R Shiny app to help bank employees determine if they should offer a long-term deposit to a customer. The context is that an employee may have an in-bank meeting or a telephone call with a customer regarding a different issue, but they could enter the customer's information into the app to determine if they should also pitch the long-term security deposit. The app provides the probability that the customer will say yes, then suggests that the agent offer the product if the probability of customer accepting is above .5 if the agent is conservative or .3 if the agent is willing to take a bigger risk.
I believe that had the data been collecting around 2022, there would be more features one could use to build a much stronger predictive model. For example, the bank could perform NLP analysis to dissect agent/client interactions to determine which of agent's actions lead to higher customer conversion rates. Apart from this example, in the days of expanding data collection, there are almost certainly other features one could obtain to build an even stronger model, and the bank should consult experts in this regard.
As briefly mentioned above, it is imperative to test machine learning models before using them in production and carefully monitor for data/model drift once the models are deployed. While a conventional recommendation would be to run A/B tests, this may be challenging since the time frames of the long-term deposits are months or even years. In addition, the bank may lack the infrastructure to conduct A/B tests. Alternatively, the bank can take the no-ML approach as a default for the majority of its customers and test each ML strategy on a customer sample. It can conduct retrospective analyses using its historic data to see if its profits improve with the recommendations of an ML model. While this approach does not solve the time frame challenge, it solves the difficulty with lack of infrastructure. Once the bank stakeholders and analysts determine the best model to use, they would deploy it on more of bank's customers.
For more examples of my work in marketing, please see my project involving customer segmentation for marketing.
Portugal bank employees and call center employees salary information:
salaryexplorer.com and https://www.erieri.com/salary/job/call-center-agent/Portugal