NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Meetup > Lowering Costs of Bank Marketing Campaigns

Lowering Costs of Bank Marketing Campaigns

Dmitriy Popov-Velasco
Posted on Aug 29, 2022

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Banks often need to run a marketing campaign in order to sell a product to potential customers even if the data is limited. These campaigns cost time, money, and inconvenience the people contacted if the product offered them is a poor match. As a result of being inconvenienced, the customers can start having negative sentiments towards the bank, leading to reputational damage for the bank and the resulting profit losses from diminished long-term customer value. In addition, most customers will say no in the end, which makes identifying those that will say yes particularly challenging.

In the parlance of machine learning, this is known as an imbalanced classification problem. In this project, I solve this problem for the case of a Portuguese bank running telephone campaigns to sell long-term deposits.   A long-term deposit is a kind of security deposit granting the lender a higher interest rate than a traditional savings account and granting the bank a guarantee of having the lender's funds for a fixed period (such as 12 months). While this problem is challenging, much headway can be made to help the bank reach the right customers, save tens of thousands of dollars in concrete costs, and protect its reputation.

Findings and Methodology Overview

  • The data is imbalanced (more noes than yeses), and I rebalance it to increase model performance on the minority class.  In addition, there are no timestamps and the 2008 financial crisis occurred during the data collection period.  Time-dependent variables are needed for a good ROC-AUC score, but I train models with and without time-dependent variables to establish the robustness of the findings.  Ideally, I would use timestamps to run the model on non-crisis data only.
  • Random forests on the data rebalanced 'manually' performed better than random forest or XGBoost on data rebalanced with SMOTE.  Since accuracy is not an appropriate metric for imbalanced classification, I use the ROC-AUC score.
  • I compare three courses of action for the bank: (i) Using the naive approach and contacting everyone until a desired number of customers subscribe, (ii) Using the recommendations of a random forest with classification threshold set at .5, and (iii) Using the recommendations of a random forest with classification threshold set at .3.  Using employee salary data, I provide approximations of cost savings due to machine learning models.  In practice, the bank would still need to test the models to determine the most profitable course of action.
  • Finally, I deploy an R Shiny app for use by bank employees prior to meetings with customers.  The employees can use the app to offer the long-term deposit to customers who are likely to respond positively.

Machine Learning

Introduction

The data for this project was collected between May 2008 and June 2013 by a Portuguese banking institution and is available through UCI here. There are 45211 observations with features on bank client data (age, job type, marital status, education, housing and loans status), information regarding the last contact (including a leaked variable duration, which shall be dropped), information pertaining to previous campaigns, and some social or economic context variables (such as the 3-month borrowing rate between banks).

To give some examples of the effect of some of the independent variables on deposits, the mosaic plot below shows that more people with a cellphone said yes to the campaign than expected under the hypothesis of independence (blue) and fewer people said no (red).  Similarly, fewer people with a landline telephone said yes than expected (red) and more people said no (blue).

Similarly, more people said yes in March, April, September, October, and December, and fewer people said yes in May.  I will discuss the subject of timestamps and the potential effect of the financial crisis on the data below.

Missing Timestamps and Macroeconomic Indicators

A problem with the data is that the financial crisis occurred during the data collection period, yet to be able to separate out the financial crisis period from the rest of the data, timestamps would be needed, and they were not available.  In addition, the timestamps cannot be unambiguously inferred from other variables that have a time component, such as the European 3-month inter-bank borrowing rate. As an example, I analyze the effect of the financial crisis on this variable below, and some of the other macroeconomic variables show similar pattern.

The European 3-month inter-bank borrowing rate is a proxy for much of the overall state of the economy.   Intuitively, and as feature importances discussed below indicate, this interest rate is an important predictor. For example, the graph below indicates that there were more yeses when the European 3-month borrowing rates were lower and more noes when they were higher.  This appears counterintuitive, yet when we consider the effect of financial crisis and the time cutoffs of our data collection period, there is a clear story, as I discuss next.

The interest rates were high but falling at the onset of the crisis, which approximately corresponds to the start of the data collection period, and fewer people were agreeing to the deposit because of the crisis.  When the economy emerged out of the crisis, the interest rates were lower,  but people were feeling better about investing, and there was more pressure on the bank employees/marketers to sell the long-term deposits.  The line graph of European borrowing rates found here corroborates this narrative: notice the plunge in interest rates at the onset of the crisis.   If the timestamp had been available, I could train the model only on data points that were not collected during the 2008 financial crisis to build a stronger model.

I could also leave out the time-dependent variables such as the European 3-month inter-bank borrowing rate.  This hurts the predictive power of the model and there are already very few strongly predictive variables.  While I use the models on the full feature set for this blog, I have tried the variation without time-dependent variables to be confident in my results.  Essentially, the feature importances shift towards the demographic variables identified as key in the analysis below.  The AUC-ROC score and accuracy on the positive class are about 5% lower.  Once again, an argument can be made for using a model with or without time-dependent variables.  If I were to get access to the timestamp data (e.g., as a data scientist working for the bank),  I would simply separate out the time period corresponding to the financial crisis and train the model on the rest of the data.

Class Imbalance

In the data, only about 11% of people contacted end up subscribing, making this classification problem highly imbalanced.  I've tried several approaches to address the imbalance issue: using the data as is, rebalancing using the SMOTE algorithm, and simple rebalancing based on sampling the minority class at a higher rate. In addition, I've tried XGBoost and random forest classification in R. Finally, accuracy cannot be used as the metric, and I chose the AUC-ROC metric as I will discuss in more detail.

Bank Marketing Campaign Metrics

Accuracy is not an acceptable metric for this problem: Classifying all customers as 'No' customers will achieve 89% accuracy without providing any help identifying the 'Yes' customers more precisely. Since the positive class is the 'Yes' class, false positives would amount to predicting that the person will say yes when they will say no. I would like to avoid these noes to minimize inconvenience to our customers, the resulting reputational damage to the bank, and time lost on contacting the no customers.

However, I would tolerate some false positives to get more yeses and would not maximize precision, which is the ratio of true positive to true positives plus false positives, per se. False negatives are when one predicts that the customer will say yes when they will say no. The cost of this prediction is not getting a client, something one would wish to avoid in a sales situation. The ratio of true positives to true positives plus false negatives is known as recall, and it is of greater importance for this problem.

Nonetheless, I would like to strike a balance between precision and recall, using the AUC-ROC metric. This metric balances out the considerations in optimizing for both a small  false positives and false negatives rate, and, in view of the other tools used to solve this problem, leads to the highest recall that could be achieved.

Training, Final Model Selection, and Feature Importances

The final selection of rebalancing, model, and metric that I made is simple rebalancing, random forest model, and ROC metric. XGBoost was particularly prone to overfitting on this data, SMOTE seemed to introduce too much extra noise, and other metrics (such maximizing recall directly) did not work as well as maximizing ROC. I addressed the overfitting issue by ensuring that the nodes have a reasonable minimum number of observations (in this case, at least 40 observations in each final node were chosen).

The final model achieved an ROC score of .775 and identified a group of clients particularly likely to respond positively (3 of 8 clients identified would say yes). The group that can be reached by following the recommendations of this ML model corresponds to 63% of all the people that would say yes. The concern is, of course, that this would not be enough for the bank. I addressed this concern by lowering the classification threshold to .3 instead of the default .5, allowing 80% of the yes customers to be reached at a somewhat higher cost of noes.

The top three predictors were macroeconomic: the most important feature was the European 3-month borrowing rate, the number of workers employed in the economy, and the employment variation rate.  The next two were the number of days that passed since the person was last contacted during this campaign and whether the month was May. Finally, the person's age, outcome of the previous campaign for that customer, and the number of contacts performed for this campaign and for this client came next. Among other variables are whether the person was contacted via a landline or cell (a proxy for wealth in the time of data collection?), person's job type, and whether they have defaulted on a loan.

The most important variables being general macroeconomic indicators is not uncommon as the state of the economy is highly correlated with a person's willingness to invest their money for a longer period.  The month of May, however, could possibly be attributable to the financial crisis and its aftereffects, and it would be helpful to have the timestamp to tease this apart.  The next set of variables consists of 'historic' variables for the given individual, describing how that person has responded to a previous campaign, how long has it been since they have last been contacted, etc.  Finally come the variables that characterize a given client and could give some intuition as to which clients are more likely to say yes based on their demographic data.  In the next section, I will address the business value of these models.

Business Value and Related Questions

Suppose the bank obtains 100,000 records of potential customers and would like to determine which of these people to contact. Assuming customers likely to say yes are uniformly distributed within this data, the following table summarizes three possible approaches of contacting customers along with their corresponding costs.

Note that in both ML and No ML cases, the goal is to reach 63% of all the yeses in the data. Once this target is reached, the bank's agents/telemarketers stop calling the potential customers. Since the ML strategy gives the bank information helpful for reaching the right customers, the bank can save money and intangible costs that would otherwise be spent on reaching the noes unnecessarily. The lower bound calculations assume $10.00 per hour rate (converted from euros) for telemarketers time and the upper bound calculations assume $20.00 for bank employees' time.

Hourly Salaries and Cost Calculations

The actual hourly salaries of each of these worker groups are a little lower, $8.08 and $16.00, respectively, but I'm assuming workers need some time between the yes/no calls (perhaps for non-responding customers or data lookup/entry) and base the rates off time spent on call.  A natural concern is that 63% is not good enough to meet the bank's objectives. In that case, by lowering the classification threshold for a yes to .30, 80% of all the yeses can be reached, albeit at a higher cost.

After reaching the likely responders, I would suggest that the bank use the extra time it saves through the use of one of these strategies to target a different product to the customers unlikely to respond with a yes. It could also be the case that the bank determines that long-term deposits are the most profitable product that the it can offer its customers. In such case, the bank could decide to simply call all of its potential customers and accept the higher costs. In the end, this is as far as machine learning can take us, and the bank would need to test each of the three strategies in production before deploying the best one on all of its potential clients.

R Shiny App and Future Steps

 

R Shiny App

I developed an R Shiny app to help bank employees determine if they should offer a long-term deposit to a customer. The context is that an employee may have an in-bank meeting or a telephone call with a customer regarding a different issue, but they could enter the customer's information into the app to determine if they should also pitch the long-term security deposit. The app provides the probability that the customer will say yes, then suggests that the agent offer the product if the probability of customer accepting is above .5 if the agent is conservative or .3 if the agent is willing to take a bigger risk.

Future Steps

I believe that had the data been collecting around 2022, there would be more features one could use to build a much stronger predictive model. For example, the bank could perform NLP analysis to dissect agent/client interactions to determine which of agent's actions lead to higher customer conversion rates. Apart from this example, in the days of expanding data collection, there are almost certainly other features one could obtain to build an even stronger model, and the bank should consult experts in this regard.

As briefly mentioned above, it is imperative to test machine learning models before using them in production and carefully monitor for data/model drift once the models are deployed.  While a conventional recommendation would be to run A/B tests, this may be challenging since the time frames of the long-term deposits are months or even years.  In addition, the bank may lack the infrastructure to conduct A/B tests.  Alternatively, the bank can take the no-ML approach as a default for the majority of its customers and test each ML strategy on a customer sample. It can conduct retrospective analyses using its historic data to see if its profits improve with the recommendations of an ML model.  While this approach does not solve the time frame challenge, it solves the difficulty with lack of infrastructure. Once the bank stakeholders and analysts determine the best model to use, they would deploy it on more of bank's customers.

For more examples of my work in marketing, please see my project involving customer segmentation for marketing.

References

Image source:
https://commons.wikimedia.org/wiki/File:Zarco_%26_Bank_of_Portugal_(Funchal)_(38044349796).jpg
Data source:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#
Portugal bank employees and call center employees salary information:
salaryexplorer.com and https://www.erieri.com/salary/job/call-center-agent/Portugal

About Author

Dmitriy Popov-Velasco

I'm a certified data scientist with 4+ years of experience using machine learning and data analysis. In addition to taking numerous statistics courses, I taught discussions in hypothesis testing, linear regression, and statistical tests at UC Davis for...
View all posts by Dmitriy Popov-Velasco >

Related Articles

Capstone
Predicting the Unpredictable: Revolutionizing E-commerce Delivery with Machine Learning
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application