Credit Card Fraud Detection with QDA, LR, SVM models

Huy Tran

Posted on Feb 5, 2018

Project Description:

This blog is based on my work on Kaggle. Link: https://www.kaggle.com/huyqtran/qda-lr-svm-for-fraud-detection

This kernel used the Credit Card Fraud transactions dataset to build classification models using QDA (Quadratic Discriminant Analysis), LR (Logistic Regression), and SVM (Support Vector Machine) machine learning algorithms to help detect Fraud Credit Card transactions.

With the provided dataset, we have 492 frauds out of 284,807 transactions, or the positive class (frauds) account for 0.172% of all transactions. In addition, we have a high dimentional dataset with 30 features, including 28 PCA-ed features (V1, V2, ... V28), 'Amount', and 'Time' features. More details about the dataset can be refered to the link above.

In this exercise, I focused on the Recall score of the model - Fraud detection (True Negative Classification); however, I also considered the trade-off on the Precision score - Normal transaction classification rate (True Positive vs. True Negative Classification). When we achieve a high score of Fraud detection, it will help prevent business/bank from losing money; however, it's also very important to consider the Precision score - the performance of the model in classifying Normal transactions. Misclassifying Normal transactions will affect Credit Card Customer's experience, and the Customer Service department will end up receiving more calls from the client their transactions are suspected due to being misclassified as Fraud.

In addition, due to the highly unbalanced dataset, before model fitting, I applied resampling technique to under-sample the majority class, and over-sample the minority class before training process. And, recursive feature elimination and cross-validated technique is used for features selection.

The packages used in the kernel:

1/ Exploring the Dataset:

The dataset does not have any missing data:

The dataset is highly unbalanced, the positive class (Fraud transactions) account for 0.172% of all transactions:

2/ Selecting Features:

Feature ranking with recursive feature elimination and cross-validated selection of the best number of features:

The algorithm above, illustrated with the chart, suggested 11 most-effective features.
And, the list of selected features are:

3/ Preparing Training and Testing Datasets:

We're going to split the dataset into 80:20 ratio for Training and Testing datasets. We need to perform the following steps:

Split the data into Normal and Fraud datasets
Shuffle the data before taking sample to make sure the data from each class is randomly selected
Eliminate the features/predictors that have low/none support to the classification

The Training dataset is now containing only features supporting the classification process:

4/ Resampling data:

Transforming Training Dataset:

The ratio 0.17% between the Fraud and Normal classes is showing an strongly unbalanced data in favor to the Normal class. Resampling is used to transform the Training dataset, in which we will under-resampling the Normal class, and make the Dataset balanced out between the Classes, this prevents fitting model from overfitting on the majority class.

Centroid Clustering technique is used to transform the Training dataset. Perform under-sampling by generating centroids based on clustering methods. Method that under samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. This algorithm keeps N majority samples by fitting the KMeans algorithm with N cluster to the majority class and using the coordinates of the N cluster centroids as the new majority samples.

Examing the original Training dataset and its under-resampling version after the transforming process:

Codes:

5/ Fitting models

i/ QDA model:

a/ Fitting QDA model:

b/ QDA model Score on the Training Dataset:

Training Scoring:

Precision: 94% - True Positive / Normal transaction classification

Recall: 88% - True Negative / Fraud transaction classification

c/ QDA model Score on the Testing Dataset:

Testing Scoring:

Precision: 98% - True Positive / Normal transaction classification

Recall: 90% - True Negative / Fraud transaction classification

d/ Confusion plot on the Testing Dataset:

Plot code:

Confusion plot:

e/ A glance at how the QDA Classification works on the Test Dataset:

Plot codes:

Plots:

ii/ Logistic Regression model

a/ Fitting Logistic Regression model:

b/ LR model Score on the Training Dataset:

Training Scoring:

Precision: 96% - True Positive / Normal transaction classification

Recall: 89% - True Negative / Fraud transaction classification

c/ LR model Score on the Testing Dataset:

Testing Scoring:

Precision: 98% - True Positive / Normal transaction classification

Recall: 90% - True Negative / Fraud transaction classification

d/Confusion plot on the Testing Dataset:

e/ Selecting probabitiy threshold for the Logistic Regression Model Classification:

This classification scoring is going to be examined further in order to find the opitmal point where the model can achieve not only high score in dectecting Fraud (Recall Score), but also high score in correctly classifying Normal transactions (Precision Score) due to the following aspects:

High Fraud classification (Recall score) rate will obviously help business/bank from losing money

High Normal classification (Precision score) rate will help improve the Customer Experience/Satisfactory

We will select the Optimal Threshold where it helps achieve the goal for both Recall (Fraud detection rate) score and Precision score. In this case, the probability threshold 0.5 can be selected with 98% Precision score, and 90% Recall score.

iii/ Support Vector Machine model

a/ Fitting SVM model: