Fraud Detection Competition Features and Variables
As a group we completed the IEEE-CIS (Institute of Electrical and Electronic Engineers) Fraud Detection competition on Kaggle. The dataset of credit card transactions provided by Vesta Corporation, described as the world's leading payment service company. The dataset includes identity and transaction CSV files for both test and train. The training dataset is 590540 x 433 in terms of columns and rows with 20663 fraud transactions. Target variable for this competition 'isFraud'.
Here are some of the most important features and variables:
- TransactionDT: timedelta from a given reference datetime (not a timestamp)
- TransactionAMT: transaction payment in USD
- ProductCD: product code, the product for each transaction
- card1-card6: payment card information, such as card type
- addr: address
- dist: distance
- P_ and (R_)emaildomain: purchaser and recipient email domain
- C1-C14: counting, addresses and other things, actual meaning masked
- D1-D15: timedelta, such as days between previous transaction, etc
- M1-M9: match, such as names on card and address, etc
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations
Here are some of the most important things we discovered during our exploratory data analysis (EDA):
- One of the first things we noticed when conducting our EDA was the sparsity of the dataset
- Only ~3.5% of the total transactions were positively classified as part of the 'isFraud' variable
- Another observation that was immediately apparent is the imbalanced nature of the data. This shows that 'TransactionDT' is a timedelta gap, not a timestamp
We also explored the timedelta in more detail, although there was a lot of noise.
Another interesting thing we found during our EDA:
- Target variable 'isFraud' is more prevalent in the mobile 'DevticeType' as well as more prevalent in the 'IP_PROXY:ANONYMOUS' based on 'id_31
Next we tackled missingness and imputation and found:
- Dataset has a very high percentage of missing values, especially the V columns
- Anonymized columns not only had a high amount of missing data, but their distributions also were not normally distributed
We had two main plans with how we should deal with missingness and imputation, plan A and plan B as described below:
- Plan A:
- Drop columns with over 80% missing values
- Impute columns with less than 20% missing values by the mean of each row's product ID
- Use machine learning model with the columns without missing value as input variables to predict the remaining missing values
- Precise but time consuming
- Hard to impute for anonymized data
- Plan B:
- Impute all missing values with -999 first which is very fast and model can still find some pattern instead of losing information by dropping them
- Do more complex imputation afterwards if we had more time
Another one of the most important parts of this project was feature engineering. Given the sparsity and anonymity of our data, feature engineering was a central focus of the project. The most intuitive way to handle this was first engineering on known features, namely, the TransactionDT and TransactionAmt
Engineering on the anonymized V-columns was another step in this process. Considering that V-columns occupied most of our data and were engineered by the company themselves, we indexed on the trends in those values to impute.
Another very important part of our project was dimensionality reduction. In our efforts towards this we attempted the following:
- PCA
- Lasso
- Sparse PCA
It was essential to run PCA before balancing the created values from oversampling wouldn't be influenced by uninformative features.
We also implemented balancing techniques. This was to help with the imbalance that we showed earlier, we used techniques such as oversampling the minority class along with SMOTE. SMOTE, which stands for Synthetic Minority Oversampling Technique, which judges nearest neighbors based on Euclidean distance between data points. Here is a visual example of both techniques below:
After this we ran our models, XGBoost and LightGBM. Here are our scores for both of the models:
- XGBoost
- Accuracy: 0.98
- Precision: 0.75
- Recall: 0.45
- LightGBM
- AUC: 0.972328
Scoring of our submissions were evaluted on area under the ROC curve between predicted probability and the observed target. A graphical example made on sample data is shown below:
Future work, meaning if we had more time to work on this project before this presentation.
- We would attempt to utilize cloud computing services:
- Before oversampling and feature engineering, our workspace could only handle so much before running into memory errors
- The amount of data also made grid-searching impossible unless only considering a very limited range of hyperparameters
- We even made a memory reduction function just to do our exploratory data analysis
- Further feature engineering to help deal with the sparsity of our data
- Proper optimization techniques
- GridSearchCV, bayes_opt, GPyOpt, stratified KFold
In conclusion, this was a very fun and challenging project that we believe we did very well on given our time constraints. We learned a lot about fraud detection especially when dealing with credit card transactions. We enjoyed this competition and its valuable experience will help us in future projects.