Analyzing Data to Predict Credit Card Churner
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Credit card churners mean lost money for the credit card company. In this analysis, I used a data set of existing and past customers to find what the churners had in common. With that information, I could find the group of people within the existing client base that is most likely to churn their credit card.
This dataset consists of customer information, with a total of 21 variables and 10,127 observations. On the dataset's kaggle page, churning is defined simply as cancelling or attriting the credit card. The definitions of the variable names are as follows:
|Clientnum||Num||Client number. Unique identifier for the customer holding the account|
|Attrition_Flag||char||Internal event (customer activity) variable - if the account is closed then 1 else 0|
|Customer_Age||Num||Demographic variable - Customer's Age in Years|
|Gender||Char||Demographic variable - M=Male, F=Female|
|Dependent_count||Num||Demographic variable - Number of dependents|
|Education_Level||Char||Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)|
|Marital_Status||Char||Demographic variable - Married, Single, Unknown|
|Income_Category||Char||Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)|
|Card_Category||Char||Product Variable - Type of Card (Blue, Silver, Gold, Platinum)|
|Months_on_book||Num||Months on book (Time of Relationship)|
|Total_Relationship_Count||Num||Total no. of products held by the customer|
|Months_Inactive_12_mon||Num||No. of months inactive in the last 12 months|
|Contacts_Count_12_mon||Num||No. of Contacts in the last 12 months|
|Credit_Limit||Num||Credit Limit on the Credit Card|
|Total_Revolving_Bal||Num||Total Revolving Balance on the Credit Card|
|Avg_Open_To_Buy||Num||Open to Buy Credit Line (Average of last 12 months)|
|Total_Amt_Chng_Q4_Q1||Num||Change in Transaction Amount (Q4 over Q1)|
|Total_Trans_Amt||Num||Total Transaction Amount (Last 12 months)|
|Total_Trans_Ct||Num||Total Transaction Count (Last 12 months)|
|Total_Ct_Chng_Q4_Q1||Num||Change in Transaction Count (Q4 over Q1)|
|Avg_Utilization_Ratio||Num||Average Card Utilization Ratio|
Data Exploration & Cleaning
Before beginning any kind of data exploration or analysis, I changed all the "Unknown" values to be recognized as null values in python. I also removed 2 columns of irrelevant data. After that, I split the data into 2 tables, attrited customers and existing customers using the "Attrition_Flag" variable. The attrited customer table and existing customer table had 1,627 and 8,500 observations, respectively.
I created histograms for the 11 categorical variables and 8 numerical variables in the 2 new tables and compared the peaks and trends.
The data visualizations showed me that the categorical data did not vary between the attrited and existing customers considerably. However, some histograms of the numerical variables followed different trends and peaked at different points between the 2 tables. These variables were considered the dependent variables of becoming an attrited customer. Those variables and their peaks are as follows:
To confirm that the attrited customers would fall at these values, I created scatterplots between each variable.
Results & Conclusion
With confirmation that the peaks showed the points of the attrited customers, I determined that existing customers that fell within the same areas on the scatterplots would also have a high likelihood of attriting. I filtered out the existing customers that fell within a standard deviation of all the peak values. The result was a table of 202 customers that were the most likely to attrit their credit cards.
This table allows the credit card company to use their resources more efficiently. By targeting these 202 card holders, instead of going through the entire list of 8,500 existing customers, the company has a higher chance of contacting customers before they attrit and retain them.
Using this method, a larger or smaller scope of customers can be found if I filter by more or less than the standard deviation. This can increase efficiency even more, by targeting the customers with the highest likelihood of attriting first and contacting more relevant customers as the company deems necessary.
Challenges & Future Work
Originally, I was going to find the credit card churners based on a stricter definition. They would be customers who apply for credit cards to only get the rewards and bonuses. Once they receive and use them, the customer would cancel the credit card. In an effort to spend as little money as possible, the customer would also have a zero or near zero total revolving balance.
A group of credit card churners would cost the company a lot of money. If I could compare the dataset of credit card churners to the other customer dataset, I could predict who in the existing customer group is a churner or even who is a churner among the credit card applicants. Unfortunately, the dataset does not provide sufficient information to find this group. I would need the total revolving balance for the entire relationship with the bank and all data on rewards and bonuses.
Moving forward, I would gather the required data and do a more focused credit card churner prediction.