Complaints Classification - NLP US Consumer Finance

Oluwole Alowolodu

Posted on Jun 7, 2019

Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

The Consumer Financial Protection Bureau publish thousands of consumers’ complaints about financial products and services to companies for response on weekly basis. The goal of this NLP project is to build a model that can accurately classify those consumer complaints into the product category they belong to using the content of the complaint. The data is sourced from kaggle.

Exploratory Data Analysis

It contains 18 features and 555957 observations, of the 18 features, only the 'product' feature and the 'consumer_complaint_narrative' will be explored and further applied to modeling.

Screen-Shot-2019-06-06-at-6.35.58-PM | Data Science Blog

Missingness -

The complaint column has 489151 missing values while the product column has no missing value, below is the heatmap that reflects the missingness in the dataframe.

Screen-Shot-2019-06-06-at-6.41.06-PM | Data Science Blog

The black shaded regions have values and the light-brown region captures the missingness. I removed all the rows that have missing values as no form of imputation can be applied.

Visualization of data-

The graph below represents how the products/category of complaints stack together in the remaining dataset.

Word cloud analysis of the top 3 products/complaint category:

Before carrying out wordcloud analysis, further preprocessing had to be done to enable optimal capture of the vital words and phrases. The process include:

Application of BeautifulSoup to remove all the HTML tags, deploying regular expression to remove non-alpha-numeric, tokenization into component words and removal of stopwords using NLTK.

Word cloud image of 'credit reporting' category:

wc1 | Data Science Blog — credit reporting word cloud - 1

The image reveals some confidential information of consumer has been concealed as 'xxxx' or 'xx'. A second analysis was conducted to visualize the most frequent words in the complaints without concealed words.

wc2 | Data Science Blog — credit reporting word cloud - 2

Wordcloud analysis of Mortgage complains:

wc4 | Data Science Blog — Mortgage wordcloud

Wordcloud analysis of Debt collection complains:

wc3 | Data Science Blog — Debt collection wordcloud

Word Embedding and Modeling

Converting text to feature vectors was done using Term Frequency-Inverse Document Frequency, a vectorizing technique that measures and give proportional weight to rare words.

While there are several machine learning models that can be trained to learn and identify the classes in the data, I opted for Logistic regression to build a multiclass classification algorithm needed for this project.

Screen-Shot-2019-06-07-at-12.00.20-AM | Data Science Blog

Solver and multi_class are two hyperparameter I tuned, others were set to default.

Result

The model performance was measured on 20% test set and it returned 85% accuracy.

I went further to measure the performance of the model on each product category, inspecting the precision, recall and f1-score.

Screen-Shot-2019-06-07-at-12.06.49-AM | Data Science Blog

The model's precision and recall is decent for all the product categories apart from 'Payday loan'. This can be attributed to very small number of sample to be trained for the category.

I engaged the model in predicting a random complaint copied online to demonstrate the applicability of the algorithm in identifying what product category a complain should go to:

Screen-Shot-2019-06-07-at-12.12.50-AM | Data Science Blog

Another prediction by the model:

Screen-Shot-2019-06-07-at-12.23.28-AM | Data Science Blog

Future work will include applying deep learning techniques like word2vec for feature vectorization and building the model on a neural network that will be more sensitive to product category with small sample size.

Github link

Complaints Classification - NLP US Consumer Finance

Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Exploratory Data Analysis

Missingness -

Visualization of data-

Word Embedding and Modeling

Result

About Author

Oluwole Alowolodu

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Complaints Classification - NLP US Consumer Finance

Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Exploratory Data Analysis

Missingness -

Visualization of data-

Word Embedding and Modeling

Result

About Author

Oluwole Alowolodu

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!