Kaggle Fraud Detection

As a group we completed the IEEE-CIS (Institute of Electrical and Electronic Engineers) Fraud Detection competition on Kaggle. The dataset of credit card transactions provided by Vesta Corporation, described as the world's leading payment service company. The dataset includes identity and transaction CSV files for both test and train. The training dataset is 590540 x 433 in terms of columns and rows with 20663 fraud transactions. Target variable for this competition 'isFraud'.

Here are some of the most important features and variables:

  • TransactionDT: timedelta from a given reference datetime (not a timestamp)
  • TransactionAMT: transaction payment in USD
  • ProductCD: product code, the product for each transaction
  • card1-card6: payment card information, such as card type
  • addr: address
  • dist: distance
  • P_ and (R_)emaildomain: purchaser and recipient email domain
  • C1-C14: counting, addresses and other things, actual meaning masked
  • D1-D15: timedelta, such as days between previous transaction, etc
  • M1-M9: match, such as names on card and address, etc
  • Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations

Here are some of the most important things we discovered during our exploratory data analysis (EDA):

  • One of the first things we noticed when conducting our EDA was the sparsity of the dataset
  • Only ~3.5% of the total transactions were positively classified as part of the 'isFraud' variable

  • Another observation that was immediately apparent is the imbalanced nature of the data. This shows that 'TransactionDT' is a timedelta gap, not a timestamp

We also explored the timedelta in more detail, although there was a lot of noise.

Another interesting thing we found during our EDA:

  • Target variable 'isFraud' is more prevalent in the mobile 'DevticeType' as well as more prevalent in the 'IP_PROXY:ANONYMOUS' based on 'id_31

Next we tackled missingness and imputation and found:

  • Dataset has a very high percentage of missing values, especially the V columns

  • Anonymized columns not only had a high amount of missing data, but their distributions also were not normally distributed

We had two main plans with how we should deal with missingness and imputation, plan A and plan B as described below:

  • Plan A: 
    • Drop columns with over 80% missing values
    • Impute columns with less than 20% missing values by the mean of each row's product ID
    • Use machine learning model with the columns without missing value as input variables to predict the remaining missing values
    • Precise but time consuming
    • Hard to impute for anonymized data
  • Plan B:
    • Impute all missing values with -999 first which is very fast and model can still find some pattern instead of losing information by dropping them
    • Do more complex imputation afterwards if we had more time

Another one of the most important parts of this project was feature engineering. Given the sparsity and anonymity of our data, feature engineering was a central focus of the project. The most intuitive way to handle this was first engineering on known features, namely, the TransactionDT and TransactionAmt

Engineering on the anonymized V-columns was another step in this process. Considering that V-columns occupied most of our data and were engineered by the company themselves, we indexed on the trends in those values to impute. 

Another very important part of our project was dimensionality reduction. In our efforts towards this we attempted the following:

  • PCA
  • Lasso
  • Sparse PCA

It was essential to run PCA before balancing the created values from oversampling wouldn't be influenced by uninformative features. 

We also implemented balancing techniques. This was to help with the imbalance that we showed earlier, we used techniques such as oversampling the minority class along with SMOTE. SMOTE, which stands for Synthetic Minority Oversampling Technique, which judges nearest neighbors based on Euclidean distance between data points. Here is a visual example of both techniques below: 

After this we ran our models, XGBoost and LightGBM. Here are our scores for both of the models:

  • XGBoost
    • Accuracy: 0.98
    • Precision: 0.75
    • Recall: 0.45
  • LightGBM
    • AUC: 0.972328

Scoring of our submissions were evaluted on area under the ROC curve between predicted probability and the observed target. A graphical example made on sample data is shown below:

Future work, meaning if we had more time to work on this project before this presentation.

  • We would attempt to utilize cloud computing services:
    • Before oversampling and feature engineering, our workspace could only handle so much before running into memory errors
    • The amount of data also made grid-searching impossible unless only considering a very limited range of hyperparameters
    • We even made a memory reduction function just to do our exploratory data analysis
  • Further feature engineering to help deal with the sparsity of our data
  • Proper optimization techniques
    • GridSearchCV, bayes_opt, GPyOpt, stratified KFold

In conclusion, this was a very fun and challenging project that we believe we did very well on given our time constraints. We learned a lot about fraud detection especially when dealing with credit card transactions. We enjoyed this competition and its valuable experience will help us in future projects. 

About Authors

Avatar

Fred (Lefan) Cheng

Fred Cheng is a certified data scientist who is working as a data science consultant in Zenon. He owns a Master’s Degree in Management and Systems from New York University with a bachelor’s in business management from The...
View all posts by Fred (Lefan) Cheng >

Leave a Comment

Avatar
Google November 21, 2020
Google Usually posts some very fascinating stuff like this. If you are new to this site.
Avatar
Google November 21, 2020
Google We came across a cool site that you simply could get pleasure from. Take a look in the event you want.
Avatar
OnHax August 24, 2020
OnHax [...]one of our guests not long ago proposed the following website[...]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp