US Consumer Finance Complaints Classification - NLP

Oluwole Alowolodu
Posted on Jun 7, 2019

The Consumer Financial Protection Bureau publish thousands of consumers’ complaints about financial products and services to companies for response on weekly basis. The goal of this NLP project is to build a model that can accurately classify those consumer complaints into the product category they belong to using the content of the complaint. The data is sourced from kaggle.

Exploratory Data Analysis

It contains 18 features and 555957 observations, of the 18 features, only the 'product' feature and the 'consumer_complaint_narrative' will be explored and further applied to modeling.

Missingness -

The complaint column has 489151 missing values while the product column has no missing value, below is the heatmap that reflects the missingness in the dataframe.

The black shaded regions have values and the light-brown region captures the missingness. I removed all the rows that have missing values as no form of imputation can be applied.

Visualization of data-

The graph below represents how the products/category of complaints stack together in the remaining dataset. 

Word cloud analysis of the top 3 products/complaint category:

Before carrying out wordcloud analysis, further preprocessing had to be done to enable optimal capture of the vital words and phrases. The process include:

Application of BeautifulSoup to remove all the HTML tags, deploying regular expression to remove non-alpha-numeric, tokenization into component words and removal of stopwords using NLTK.

Word cloud image of 'credit reporting' category:

credit reporting word cloud - 1

The image reveals some confidential information of consumer has been concealed as 'xxxx' or 'xx'. A second analysis was conducted to visualize the most frequent words in the complaints without concealed words.

credit reporting word cloud - 2

Wordcloud analysis of Mortgage complains:

Mortgage wordcloud

Wordcloud analysis of Debt collection complains:

Debt collection wordcloud

Word Embedding and Modeling

Converting text to feature vectors was done using Term Frequency-Inverse Document Frequency, a vectorizing technique that measures and give proportional weight to rare words.

While there are several machine learning models that can be trained to learn and identify the classes in the data, I opted for Logistic regression to build a multiclass classification algorithm needed for this project.

Solver and multi_class are two hyperparameter I tuned, others were set to default.

Result

The model performance was measured on 20% test set and it returned 85% accuracy.

I went further to measure the performance of the model on each product category, inspecting the precision, recall and f1-score.

The model's precision and recall is decent for all the product categories apart from 'Payday loan'. This can be attributed to very small number of sample to be trained for the category.

I engaged the model in predicting a random complaint copied online to demonstrate the applicability of the algorithm in identifying what product category a complain should go to:

Another prediction by the model:

Future work will include applying deep learning techniques like word2vec for feature vectorization and building the model on a neural network that will be more sensitive to product category with small sample size.

 

Github link

About Author

Oluwole Alowolodu

Oluwole Alowolodu

Recent graduate of Biotechnology - MS. Data science fellow and AI enthusiast.
View all posts by Oluwole Alowolodu >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp