US Consumer Finance Complaints Classification - NLP
The Consumer Financial Protection Bureau publish thousands of consumers’ complaints about financial products and services to companies for response on weekly basis. The goal of this NLP project is to build a model that can accurately classify those consumer complaints into the product category they belong to using the content of the complaint. The data is sourced from kaggle.
Exploratory Data Analysis
It contains 18 features and 555957 observations, of the 18 features, only the 'product' feature and the 'consumer_complaint_narrative' will be explored and further applied to modeling.
The complaint column has 489151 missing values while the product column has no missing value, below is the heatmap that reflects the missingness in the dataframe.
The black shaded regions have values and the light-brown region captures the missingness. I removed all the rows that have missing values as no form of imputation can be applied.
Visualization of data-
The graph below represents how the products/category of complaints stack together in the remaining dataset.
Word cloud analysis of the top 3 products/complaint category:
Before carrying out wordcloud analysis, further preprocessing had to be done to enable optimal capture of the vital words and phrases. The process include:
Application of BeautifulSoup to remove all the HTML tags, deploying regular expression to remove non-alpha-numeric, tokenization into component words and removal of stopwords using NLTK.
Word cloud image of 'credit reporting' category:
The image reveals some confidential information of consumer has been concealed as 'xxxx' or 'xx'. A second analysis was conducted to visualize the most frequent words in the complaints without concealed words.
Wordcloud analysis of Mortgage complains:
Wordcloud analysis of Debt collection complains:
Word Embedding and Modeling
Converting text to feature vectors was done using Term Frequency-Inverse Document Frequency, a vectorizing technique that measures and give proportional weight to rare words.
While there are several machine learning models that can be trained to learn and identify the classes in the data, I opted for Logistic regression to build a multiclass classification algorithm needed for this project.
Solver and multi_class are two hyperparameter I tuned, others were set to default.
The model performance was measured on 20% test set and it returned 85% accuracy.
I went further to measure the performance of the model on each product category, inspecting the precision, recall and f1-score.
The model's precision and recall is decent for all the product categories apart from 'Payday loan'. This can be attributed to very small number of sample to be trained for the category.
I engaged the model in predicting a random complaint copied online to demonstrate the applicability of the algorithm in identifying what product category a complain should go to:
Another prediction by the model:
Future work will include applying deep learning techniques like word2vec for feature vectorization and building the model on a neural network that will be more sensitive to product category with small sample size.