Fraud Detection Competition Features and Variables

Jose Gonzalez, Luke Gray, Rajesh Earlu, Fred (Lefan) Cheng - 程乐帆 and Alyssa Wei

Posted on Oct 31, 2019

As a group we completed the IEEE-CIS (Institute of Electrical and Electronic Engineers) Fraud Detection competition on Kaggle. The dataset of credit card transactions provided by Vesta Corporation, described as the world's leading payment service company. The dataset includes identity and transaction CSV files for both test and train. The training dataset is 590540 x 433 in terms of columns and rows with 20663 fraud transactions. Target variable for this competition 'isFraud'.

Here are some of the most important features and variables:

TransactionDT: timedelta from a given reference datetime (not a timestamp)
TransactionAMT: transaction payment in USD
ProductCD: product code, the product for each transaction
card1-card6: payment card information, such as card type
addr: address
dist: distance
P_ and (R_)emaildomain: purchaser and recipient email domain
C1-C14: counting, addresses and other things, actual meaning masked
D1-D15: timedelta, such as days between previous transaction, etc
M1-M9: match, such as names on card and address, etc
Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations

Here are some of the most important things we discovered during our exploratory data analysis (EDA):

One of the first things we noticed when conducting our EDA was the sparsity of the dataset
Only ~3.5% of the total transactions were positively classified as part of the 'isFraud' variable

Another observation that was immediately apparent is the imbalanced nature of the data. This shows that 'TransactionDT' is a timedelta gap, not a timestamp

We also explored the timedelta in more detail, although there was a lot of noise.

Another interesting thing we found during our EDA:

Target variable 'isFraud' is more prevalent in the mobile 'DevticeType' as well as more prevalent in the 'IP_PROXY:ANONYMOUS' based on 'id_31

Next we tackled missingness and imputation and found:

Dataset has a very high percentage of missing values, especially the V columns

Anonymized columns not only had a high amount of missing data, but their distributions also were not normally distributed

We had two main plans with how we should deal with missingness and imputation, plan A and plan B as described below:

Plan A:
- Drop columns with over 80% missing values
- Impute columns with less than 20% missing values by the mean of each row's product ID
- Use machine learning model with the columns without missing value as input variables to predict the remaining missing values
- Precise but time consuming
- Hard to impute for anonymized data
Plan B:
- Impute all missing values with -999 first which is very fast and model can still find some pattern instead of losing information by dropping them
- Do more complex imputation afterwards if we had more time

Another one of the most important parts of this project was feature engineering. Given the sparsity and anonymity of our data, feature engineering was a central focus of the project. The most intuitive way to handle this was first engineering on known features, namely, the TransactionDT and TransactionAmt

Engineering on the anonymized V-columns was another step in this process. Considering that V-columns occupied most of our data and were engineered by the company themselves, we indexed on the trends in those values to impute.

Another very important part of our project was dimensionality reduction. In our efforts towards this we attempted the following:

PCA
Lasso
Sparse PCA

It was essential to run PCA before balancing the created values from oversampling wouldn't be influenced by uninformative features.

We also implemented balancing techniques. This was to help with the imbalance that we showed earlier, we used techniques such as oversampling the minority class along with SMOTE. SMOTE, which stands for Synthetic Minority Oversampling Technique, which judges nearest neighbors based on Euclidean distance between data points. Here is a visual example of both techniques below:

After this we ran our models, XGBoost and LightGBM. Here are our scores for both of the models:

XGBoost
- Accuracy: 0.98
- Precision: 0.75
- Recall: 0.45
LightGBM
- AUC: 0.972328

Scoring of our submissions were evaluted on area under the ROC curve between predicted probability and the observed target. A graphical example made on sample data is shown below:

Future work, meaning if we had more time to work on this project before this presentation.

We would attempt to utilize cloud computing services:
- Before oversampling and feature engineering, our workspace could only handle so much before running into memory errors
- The amount of data also made grid-searching impossible unless only considering a very limited range of hyperparameters
- We even made a memory reduction function just to do our exploratory data analysis
Further feature engineering to help deal with the sparsity of our data
Proper optimization techniques
- GridSearchCV, bayes_opt, GPyOpt, stratified KFold

In conclusion, this was a very fun and challenging project that we believe we did very well on given our time constraints. We learned a lot about fraud detection especially when dealing with credit card transactions. We enjoyed this competition and its valuable experience will help us in future projects.

About Authors

Fred (Lefan) Cheng - 程乐帆

Fred Cheng is a certified data scientist who is working as a data science consultant in Zenon. He owns a Master’s Degree in Management and Systems from New York University with a bachelor’s in business management from The...

View all posts by Fred (Lefan) Cheng - 程乐帆 >

Alyssa Wei

View all posts by Alyssa Wei >

Google January 3, 2021

Google Here are several of the web sites we suggest for our visitors.

Google December 17, 2020

Google The info mentioned in the report are several of the top available.

Google November 21, 2020

Google Usually posts some very fascinating stuff like this. If you are new to this site.

Google November 21, 2020

Google We came across a cool site that you simply could get pleasure from. Take a look in the event you want.

OnHax August 24, 2020

OnHax [...]one of our guests not long ago proposed the following website[...]

Fraud Detection Competition Features and Variables

About Authors

Jose Gonzalez

Luke Gray

Rajesh Earlu

Fred (Lefan) Cheng - 程乐帆

Alyssa Wei

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Fraud Detection Competition Features and Variables

About Authors

Jose Gonzalez

Luke Gray

Rajesh Earlu

Fred (Lefan) Cheng - 程乐帆

Alyssa Wei

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!