Credit Card Fraud Detection with QDA, LR, SVM models

Posted on Feb 5, 2018

Project Description:

This blog is based on my work on Kaggle. Link: https://www.kaggle.com/huyqtran/qda-lr-svm-for-fraud-detection

This kernel used the Credit Card Fraud transactions dataset to build classification models using QDA (Quadratic Discriminant Analysis), LR (Logistic Regression), and SVM (Support Vector Machine) machine learning algorithms to help detect Fraud Credit Card transactions.

With the provided dataset, we have 492 frauds out of 284,807 transactions, or the positive class (frauds) account for 0.172% of all transactions. In addition, we have a high dimentional dataset with 30 features, including 28 PCA-ed features (V1, V2, ... V28), 'Amount', and 'Time' features. More details about the dataset can be refered to the link above.

In this exercise, I focused on the Recall score of the model - Fraud detection (True Negative Classification); however, I also considered the trade-off on the Precision score - Normal transaction classification rate (True Positive vs. True Negative Classification). When we achieve a high score of Fraud detection, it will help prevent business/bank from losing money; however, it's also very important to consider the Precision score - the performance of the model in classifying Normal transactions. Misclassifying Normal transactions will affect Credit Card Customer's experience, and the Customer Service department will end up receiving more calls from the client their transactions are suspected due to being misclassified as Fraud.

In addition, due to the highly unbalanced dataset, before model fitting, I applied resampling technique to under-sample the majority class, and over-sample the minority class before training process. And, recursive feature elimination and cross-validated technique is used for features selection.

The packages used in the kernel:

1/ Exploring the Dataset:

  • The dataset does not have any missing data:

  • The dataset is highly unbalanced, the positive class (Fraud transactions) account for 0.172% of all transactions:

2/ Selecting Features:

  • Feature ranking with recursive feature elimination and cross-validated selection of the best number of features:

  • The algorithm above, illustrated with the chart, suggested 11 most-effective features.
  • And, the list of selected features are:

3/ Preparing Training and Testing Datasets:

We're going to split the dataset into 80:20 ratio for Training and Testing datasets. We need to perform the following steps:

  • Split the data into Normal and Fraud datasets
  • Shuffle the data before taking sample to make sure the data from each class is randomly selected
  • Eliminate the features/predictors that have low/none support to the classification

  • The Training dataset is now containing only features supporting the classification process:

4/ Resampling data:

Transforming Training Dataset:

The ratio 0.17% between the Fraud and Normal classes is showing an strongly unbalanced data in favor to the Normal class. Resampling is used to transform the Training dataset, in which we will under-resampling the Normal class, and make the Dataset balanced out between the Classes, this prevents fitting model from overfitting on the majority class.

Centroid Clustering technique is used to transform the Training dataset. Perform under-sampling by generating centroids based on clustering methods. Method that under samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. This algorithm keeps N majority samples by fitting the KMeans algorithm with N cluster to the majority class and using the coordinates of the N cluster centroids as the new majority samples.

Examing the original Training dataset and its under-resampling version after the transforming process:

  • Codes:

5/ Fitting models

i/ QDA model:

a/ Fitting QDA model:

b/ QDA model Score on the Training Dataset:

Training Scoring:

Precision: 94% - True Positive / Normal transaction classification

Recall: 88% - True Negative / Fraud transaction classification

c/ QDA model Score on the Testing Dataset:

Testing Scoring:

Precision: 98% - True Positive / Normal transaction classification

Recall: 90% - True Negative / Fraud transaction classification

d/ Confusion plot on the Testing Dataset:

Plot code:

Confusion plot:

e/ A glance at how the QDA Classification works on the Test Dataset:

Plot codes:

Plots:

ii/ Logistic Regression model

a/ Fitting Logistic Regression model:

b/ LR model Score on the Training Dataset:

Training Scoring:

Precision: 96% - True Positive / Normal transaction classification

Recall: 89% - True Negative / Fraud transaction classification

c/ LR model Score on the Testing Dataset:

Testing Scoring:

Precision: 98% - True Positive / Normal transaction classification

Recall: 90% - True Negative / Fraud transaction classification

d/Confusion plot on the Testing Dataset:

e/ Selecting probabitiy threshold for the Logistic Regression Model Classification:

This classification scoring is going to be examined further in order to find the opitmal point where the model can achieve not only high score in dectecting Fraud (Recall Score), but also high score in correctly classifying Normal transactions (Precision Score) due to the following aspects:

High Fraud classification (Recall score) rate will obviously help business/bank from losing money

High Normal classification (Precision score) rate will help improve the Customer Experience/Satisfactory

We will select the Optimal Threshold where it helps achieve the goal for both Recall (Fraud detection rate) score and Precision score. In this case, the probability threshold 0.5 can be selected with 98% Precision score, and 90% Recall score.

iii/ Support Vector Machine model

a/ Fitting SVM model:

b/ Selecting best SVM parameters:

c/ LR model Score on the Testing Dataset:

Testing Scoring:

Precision: 99% - True Positive / Normal transaction classification

Recall: 89% - True Negative / Fraud transaction classification

d/Confusion plot on the Testing Dataset:

 

THANK YOU FOR READING!

Feedbacks or Comments, please send to [email protected].

About Author

Related Articles

Leave a Comment

Stanleysoync August 25, 2019
5 Best Casual Dating Apps (2215 single girls in your location): http://cuigiusasen.gq/lhl3?mx3Yxoft2

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI