NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Data Visualization on Fraud Detection with Vesta Corporation

Data Visualization on Fraud Detection with Vesta Corporation

Justin L. Ng
Posted on Sep 3, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Data shows payment fraud costs merchants eight percent of revenue annually. In businesses such as retail where margins are razor-thin, fraud can be a real killer and, with the proliferation of digital commerce, grows perniciously more sophisticated over time.

Card fraud is typically addressed via anomaly detection, which relies on user profiling to determine baseline behaviors. With all the details of a consumer's spending habits, card issuers can identify patterns by grouping transactions, for instance, by locale, type, and frequency. On finding anomaly, the bank can intervene โ€” if a threshold is crossed. The anomaly may not be sufficiently exigent, after all, if the expected costs of intervention outweigh the expected gains. This is best seen in the light of market segmentation:

A small card transaction followed shortly by a large one commonly triggers an alert. The bank issuer sees that a municipal one-dollar parking meter is charged just before the customer walks into a store to buy a $300 pair of shoes. Nevertheless, operating under conditions of relative information opacity, the bank must suspect the pattern could be a fraudster pinging the card for usability prior to extracting $300.

Intervention is decided easily if the charge was on a mundane credit card product by a consumer who infrequently purchases shoes. The card is frozen and a text message is sent to the card's owner. Alternately, for a premium card product, the risk of inconveniencing a high-spender who expects a seamless buying experience is enough for the bank to swallow the potential loss.

Out of this story of consumer profiling comes an exciting, recent direction for fraud detection, based off digitised records and social media: network analysis. By tracking relational links between known individuals and the communities they traffic in, banks and businesses can identify patterns of fraud as those patterns evolve.  

Objective

The dataset prepared by Vesta Corporation for this project, unfortunately, permits none of these techniques. It has been completely anonymised for privacy. As the principal provider of this data, only Vesta possesses the information to reconcile our numerical results with operational anti-fraud measures. We must, ourselves, be contented with knowledge of the purely computational models that we are limited to building.

So we proceed.

 

Data Set

Vesta presents a supervised learning question: given a training dataset of transactions that have been pre-labelled as genuine or fraudulent, how well can we build a model to predict fraud on a designated test set?

We are given a training set split into two CSV files. The first contains 590,000 observations of 394 transactional features; the second, 144,000 observations of 41 meta-identity features.

The transactional features span a broad gamut. Individual columns include the transaction amount, the time of transaction from some unspecified reference date, the product code for that transaction, and the purchaser and seller email domains.

Grouped features are divided into: 6 columns of unspecified card information, 14 direct counts of unspecified meaning, 15 unspecified time deltas, 9 matching indicators of unknown but purportedly corresponding transaction details, and 339 meta-features engineered by Vesta directly.

The identity features refer broadly to digital signatures and network details collected from the transactions. Some of these details are evident, as where mobile device types are listed; but most are unclear.

The test set is similarly split into transaction and identity files of similar size as the training files, omitting only the "answers" column specifying fraud. Predicting that column is our task.

 

Notes on Exploratory Data Analysis

Given the degree to which these features have been anonymised, we begin by treating each observation as its own independent transaction by its own individual purchaser. 

Examining our target variable, we see a clearly imbalanced distribution. 

Data Visualization on Fraud Detection with Vesta Corporation

Fraud makes fewer than 3.5% of our total transactions. This disproportion is typical in fraud, and is one of the difficulties of building fraud detection models as the scarcity of targets makes fraud harder to definitively characterise. 

Logarithmically transforming our transaction values gives a nearly normal distribution. Data Visualization on Fraud Detection with Vesta Corporation

This is a nice thing to have and reflects the way in which our original data clusters around low valued-transactions, with high values in the thousands being clear outliers.

The rest of our features are meaninglessly opaque. A brute-force approach would scope through the distributions of each regardless, but we opt instead to look for meta-patterns within our meta-features. Three arenas seem promising.

1) High-Risk Values Per Feature

Some values turn out to be highly fraudulent. A transparent sample is the merchant-side email domain 'protonmail.com'. Since this an end-to-end encrypted email service, it is no surprise that 95% of transactions to this domain turn fraudulent.

A brief search across our feature space yields 232 columns, or >50% our total features, containing at least one highly fraudulent value. A further breakdown:

Data Visualization on Fraud Detection with Vesta Corporation From this graphic, two prominent feature typologies come to mind. First, it is eminently sensible for fraudsters to repeat tactics within features known to be "safe" or essential to consumer welfare. We therefore want to identify features with large numbers of fraudulent values that might make up relatively low percentages of all possible values.

If we restrict ourselves to >500 high-risk values per feature:

We obtain a list of features that fit that profile. Some have relatively higher levels of risk, but the nearly 9,000 noted values in id_02, for instance, make up fewer than 7% of 115,000 total unique values.

High Risk Values

Second, features dominated by their high-risk values should be naturally suspect. Consider features whose unique values are majority high-risk:

These features all have relatively few uniques โ€” typically fewer than 100.

Altogether we generate a list of features that we suspect to have greater relevance, on which we may perform specific analyses.

2) Time Variation

The presence of a reference time per transaction gives us reason to suspect some temporal variation in at least some features. Which features these are, we cannot say without naively examining the distribution of each feature over time for trend and seasonality. But a cursory glance at fraudulent activity per hour of day, aggregated over our time period of six months, reinforces these suspicions:

Common sense tells us also that fraud should rise and fall with features that reveal vulnerabilities over time, particularly as software is patched or as customer support ceases for outdated software.

3) Null Spaces

The examination of null values is a basic element of any exploratory analysis.

Null values are widespread, particularly as we combine the transaction and identity CSV files. We attribute this to the limitations of Vesta's data collection capabilities.

The distribution of nulls across all 436 features yields a major insight. Reading the graph like a contour plot, our eyes are drawn to the edge of the shape generated by the null values. What is immediately obvious is the incredible uniformity of null spaces across groups of neighboring features.

This raises two questions for continuing investigation: to what extent are features within these groups correlated? Are there, perhaps, powerful interactions among in-group variables (or, alternately, out-group variables) that would boost our predictions? 

Concluding Our Notes

We present three general themes for exploratory analysis, each of which constructs better-informed avenues of pursuit. At the end of day, however, we are limited by the opacity of our features if we desire intelligent, precise feature-engineering. Our analysis does little to relieve the advantage of sheer computational power in blindly generating, and testing, features for predictive relevance.

 

Building the Model

We require a model that performs reasonably well regardless of null values, imbalanced data, or nonlinearity in its features. Each of these elements, by itself, normally wants formidable pre-processing. A judicious choice of model will largely obviate that need.

We thus implement XGBoost, a tree model that fulfills the above requirements. 

1) Feature Engineering

Without resorting to blindly statistical testing, we engineer three simple features: whether buyer and seller email domains match, an hourly time-of-day, and a day-of-week according to the time-delta reference.

2) Encoding

To run our model, our inputs must be completely numeric. We convert our string columns into numerics via the following strategy:

Findings

The features with only two unique values, particularly in True/False binary, are easily labelled as 1 or 0. We could label the intermediate features, with more than two values but fewer than ten, as [0, 1, 2, 3, ... ]. But that implicitly introduces an ordering relationship between the values within that feature, as though one value were "greater" or "lesser" than another.

We dare not make that assumption and choose to one-hot encode them instead. Each unique value now has its own column and a binary scheme to indicate its presence or absence in each observation.

Whereas our data set of 590,000 observations can handle the expansion of additional columns from those intermediate variables, for our features with dozens or hundreds of unique values, one-hot encoding becomes untenable. It is simply too dangerous to expand the feature space recklessly in relation to the size of the dataset. 

We therefore apply target encoding, which preserves dimensionality by computing, for each instance of a value, the average of all observed results according to that type of value. So if there are ten instances of "red", eight belonging to fraudulent transactions and two of which are genuine, the encoding for "red" would be 8 of 10, or 0.8.

The convenience of this procedure is having a richer set of encoded values that are now relationally adjusted to the target. Our model might, however, overfit the training sample and perform poorly on the test. We mitigate this in two ways: by adding an element of gaussian noise to the encoding procedure and by leaving out the current observation when computing the replacement value.

3) Time-Series Cross-Validation

Patterns of fraud in our data are likely to shift over time. Splitting our training data into multiple, equally-sized components, each of which 'rotates' in its turn as the validation set on which our model's parameters are tuned, is highly inappropriate due to temporal mismatch.

We use, instead, a nested cross-validation that creates successively larger windows of training data and sequential validation parts to tune on.

Results of the Model

After tuning our parameters, we achieve an in-sample accuracy of 98.9% on our training dataset. Our baseline model achieves a score of 0.761 ROC-AUC on the test data, where 1.0 indicates perfect classification ability and 0.5 is zero classification ability. There is still considerable room for improvement.

Of particular interest to our corporate sponsor will be our model's ranking of features.

A caveat: there is alway some potential element of chance in the stacking of feature importance. To be safe, we assess only the top features โ€” the top five โ€” and note that they are all V-features. We can thus report to Vesta with V258, V201, V149, V156, and V257 in hand; each of which they may deconstruct in their top-secret laboratories, and assemble for themselves the appropriate operational anti-fraud measures.

The code for this project can be viewed here.

 

Future Work

There are a few directions on which our model could continue to improve. We suspect, firstly, that our hyper-parameters are not yet optimally tuned and would devote more computational resources to that. We might also devise means of tracking pattern shifts over time or, failing that, manually identify them ourselves.

Although XGBoost performs relatively well with imbalanced data and null values, the present literature suggests it would perform even better if the issues were addressed. We could oversample our minority class of fraudulent values to provide a more balanced dataset. Synthetic Minority Oversampling Technique (SMOTE) is standard for this task.

To impute our null values, we might consider variants of SMOTE, which is itself based off K-nearest neighbors. Alternately, we would look into variational auto-encoders (VAEs). These are neural nets with dimensional bottlenecks that, by learning to reconstruct their inputs from reduced data, discover the underlying characteristics of those inputs.

A generative adversarial network could then be applied, between a VAE that attempts to impute values faithfully to the larger dataset and a discriminative model that attempts to identify whether a value has been artificially imputed or not. Iteratively, this should produce better and better imputations.

Beyond that, we would consider directly augmenting our baseline model. That would open a whole new world, for us, of stacking our XGBoost with neural nets โ€” VAEs first in our line-up.

About Author

Justin L. Ng

The author is an enthusiast of data-driven decision-making. He is a graduate of Rice University, where he studied engineering and economics, and of the Collegiate School.
View all posts by Justin L. Ng >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application