Fraud Detection Competition Features and Variables

As a group we completed the IEEE-CIS (Institute of Electrical and Electronic Engineers) Fraud Detection competition on Kaggle. The dataset of credit card transactions provided by Vesta Corporation, described as the world's leading payment service company. The dataset includes identity and transaction CSV files for both test and train. The training dataset is 590540 x 433 in terms of columns and rows with 20663 fraud transactions. Target variable for this competition 'isFraud'.

Here are some of the most important features and variables:

  • TransactionDT: timedelta from a given reference datetime (not a timestamp)
  • TransactionAMT: transaction payment in USD
  • ProductCD: product code, the product for each transaction
  • card1-card6: payment card information, such as card type
  • addr: address
  • dist: distance
  • P_ and (R_)emaildomain: purchaser and recipient email domain
  • C1-C14: counting, addresses and other things, actual meaning masked
  • D1-D15: timedelta, such as days between previous transaction, etc
  • M1-M9: match, such as names on card and address, etc
  • Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations

Here are some of the most important things we discovered during our exploratory data analysis (EDA):

  • One of the first things we noticed when conducting our EDA was the sparsity of the dataset
  • Only ~3.5% of the total transactions were positively classified as part of the 'isFraud' variable

  • Another observation that was immediately apparent is the imbalanced nature of the data. This shows that 'TransactionDT' is a timedelta gap, not a timestamp

We also explored the timedelta in more detail, although there was a lot of noise.

Another interesting thing we found during our EDA:

  • Target variable 'isFraud' is more prevalent in the mobile 'DevticeType' as well as more prevalent in the 'IP_PROXY:ANONYMOUS' based on 'id_31

Next we tackled missingness and imputation and found:

  • Dataset has a very high percentage of missing values, especially the V columns

  • Anonymized columns not only had a high amount of missing data, but their distributions also were not normally distributed

We had two main plans with how we should deal with missingness and imputation, plan A and plan B as described below:

  • Plan A: 
    • Drop columns with over 80% missing values
    • Impute columns with less than 20% missing values by the mean of each row's product ID
    • Use machine learning model with the columns without missing value as input variables to predict the remaining missing values
    • Precise but time consuming
    • Hard to impute for anonymized data
  • Plan B:
    • Impute all missing values with -999 first which is very fast and model can still find some pattern instead of losing information by dropping them
    • Do more complex imputation afterwards if we had more time

Another one of the most important parts of this project was feature engineering. Given the sparsity and anonymity of our data, feature engineering was a central focus of the project. The most intuitive way to handle this was first engineering on known features, namely, the TransactionDT and TransactionAmt

Engineering on the anonymized V-columns was another step in this process. Considering that V-columns occupied most of our data and were engineered by the company themselves, we indexed on the trends in those values to impute. 

Another very important part of our project was dimensionality reduction. In our efforts towards this we attempted the following:

  • PCA
  • Lasso
  • Sparse PCA

It was essential to run PCA before balancing the created values from oversampling wouldn't be influenced by uninformative features. 

We also implemented balancing techniques. This was to help with the imbalance that we showed earlier, we used techniques such as oversampling the minority class along with SMOTE. SMOTE, which stands for Synthetic Minority Oversampling Technique, which judges nearest neighbors based on Euclidean distance between data points. Here is a visual example of both techniques below: 

After this we ran our models, XGBoost and LightGBM. Here are our scores for both of the models:

  • XGBoost
    • Accuracy: 0.98
    • Precision: 0.75
    • Recall: 0.45
  • LightGBM
    • AUC: 0.972328

Scoring of our submissions were evaluted on area under the ROC curve between predicted probability and the observed target. A graphical example made on sample data is shown below:

Future work, meaning if we had more time to work on this project before this presentation.

  • We would attempt to utilize cloud computing services:
    • Before oversampling and feature engineering, our workspace could only handle so much before running into memory errors
    • The amount of data also made grid-searching impossible unless only considering a very limited range of hyperparameters
    • We even made a memory reduction function just to do our exploratory data analysis
  • Further feature engineering to help deal with the sparsity of our data
  • Proper optimization techniques
    • GridSearchCV, bayes_opt, GPyOpt, stratified KFold

In conclusion, this was a very fun and challenging project that we believe we did very well on given our time constraints. We learned a lot about fraud detection especially when dealing with credit card transactions. We enjoyed this competition and its valuable experience will help us in future projects. 

About Authors

Fred (Lefan) Cheng - 程乐帆

Fred Cheng is a certified data scientist who is working as a data science consultant in Zenon. He owns a Master’s Degree in Management and Systems from New York University with a bachelor’s in business management from The...
View all posts by Fred (Lefan) Cheng - 程乐帆 >

Leave a Comment

Google January 3, 2021
Google Here are several of the web sites we suggest for our visitors.
Google December 17, 2020
Google The info mentioned in the report are several of the top available.
Google November 21, 2020
Google Usually posts some very fascinating stuff like this. If you are new to this site.
Google November 21, 2020
Google We came across a cool site that you simply could get pleasure from. Take a look in the event you want.
OnHax August 24, 2020
OnHax [...]one of our guests not long ago proposed the following website[...]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI