Complaints Classification - NLP US Consumer Finance

Posted on Jun 7, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

The Consumer Financial Protection Bureau publish thousands of consumers’ complaints about financial products and services to companies for response on weekly basis. The goal of this NLP project is to build a model that can accurately classify those consumer complaints into the product category they belong to using the content of the complaint. The data is sourced from kaggle.

Exploratory Data Analysis

It contains 18 features and 555957 observations, of the 18 features, only the 'product' feature and the 'consumer_complaint_narrative' will be explored and further applied to modeling.

Missingness -

The complaint column has 489151 missing values while the product column has no missing value, below is the heatmap that reflects the missingness in the dataframe.

The black shaded regions have values and the light-brown region captures the missingness. I removed all the rows that have missing values as no form of imputation can be applied.

Visualization of data-

The graph below represents how the products/category of complaints stack together in the remaining dataset. 

Word cloud analysis of the top 3 products/complaint category:

Before carrying out wordcloud analysis, further preprocessing had to be done to enable optimal capture of the vital words and phrases. The process include:

Application of BeautifulSoup to remove all the HTML tags, deploying regular expression to remove non-alpha-numeric, tokenization into component words and removal of stopwords using NLTK.

Word cloud image of 'credit reporting' category:

credit reporting word cloud - 1

The image reveals some confidential information of consumer has been concealed as 'xxxx' or 'xx'. A second analysis was conducted to visualize the most frequent words in the complaints without concealed words.

credit reporting word cloud - 2

Wordcloud analysis of Mortgage complains:

Mortgage wordcloud

Wordcloud analysis of Debt collection complains:

Debt collection wordcloud

Word Embedding and Modeling

Converting text to feature vectors was done using Term Frequency-Inverse Document Frequency, a vectorizing technique that measures and give proportional weight to rare words.

While there are several machine learning models that can be trained to learn and identify the classes in the data, I opted for Logistic regression to build a multiclass classification algorithm needed for this project.

Solver and multi_class are two hyperparameter I tuned, others were set to default.

Result

The model performance was measured on 20% test set and it returned 85% accuracy.

I went further to measure the performance of the model on each product category, inspecting the precision, recall and f1-score.

The model's precision and recall is decent for all the product categories apart from 'Payday loan'. This can be attributed to very small number of sample to be trained for the category.

I engaged the model in predicting a random complaint copied online to demonstrate the applicability of the algorithm in identifying what product category a complain should go to:

Another prediction by the model:

Future work will include applying deep learning techniques like word2vec for feature vectorization and building the model on a neural network that will be more sensitive to product category with small sample size.

 

Github link

About Author

Oluwole Alowolodu

Recent graduate of Biotechnology - MS. Data science fellow and AI enthusiast.
View all posts by Oluwole Alowolodu >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI