Data Visualization on Fraud Detection with Vesta Corporation
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Data shows payment fraud costs merchants eight percent of revenue annually. In businesses such as retail where margins are razor-thin, fraud can be a real killer and, with the proliferation of digital commerce, grows perniciously more sophisticated over time.
Card fraud is typically addressed via anomaly detection, which relies on user profiling to determine baseline behaviors. With all the details of a consumer's spending habits, card issuers can identify patterns by grouping transactions, for instance, by locale, type, and frequency. On finding anomaly, the bank can intervene โ if a threshold is crossed. The anomaly may not be sufficiently exigent, after all, if the expected costs of intervention outweigh the expected gains. This is best seen in the light of market segmentation:
A small card transaction followed shortly by a large one commonly triggers an alert. The bank issuer sees that a municipal one-dollar parking meter is charged just before the customer walks into a store to buy a $300 pair of shoes. Nevertheless, operating under conditions of relative information opacity, the bank must suspect the pattern could be a fraudster pinging the card for usability prior to extracting $300.
Intervention is decided easily if the charge was on a mundane credit card product by a consumer who infrequently purchases shoes. The card is frozen and a text message is sent to the card's owner. Alternately, for a premium card product, the risk of inconveniencing a high-spender who expects a seamless buying experience is enough for the bank to swallow the potential loss.
Out of this story of consumer profiling comes an exciting, recent direction for fraud detection, based off digitised records and social media: network analysis. By tracking relational links between known individuals and the communities they traffic in, banks and businesses can identify patterns of fraud as those patterns evolve.
Objective
The dataset prepared by Vesta Corporation for this project, unfortunately, permits none of these techniques. It has been completely anonymised for privacy. As the principal provider of this data, only Vesta possesses the information to reconcile our numerical results with operational anti-fraud measures. We must, ourselves, be contented with knowledge of the purely computational models that we are limited to building.
So we proceed.
Data Set
Vesta presents a supervised learning question: given a training dataset of transactions that have been pre-labelled as genuine or fraudulent, how well can we build a model to predict fraud on a designated test set?
We are given a training set split into two CSV files. The first contains 590,000 observations of 394 transactional features; the second, 144,000 observations of 41 meta-identity features.
The transactional features span a broad gamut. Individual columns include the transaction amount, the time of transaction from some unspecified reference date, the product code for that transaction, and the purchaser and seller email domains.
Grouped features are divided into: 6 columns of unspecified card information, 14 direct counts of unspecified meaning, 15 unspecified time deltas, 9 matching indicators of unknown but purportedly corresponding transaction details, and 339 meta-features engineered by Vesta directly.
The identity features refer broadly to digital signatures and network details collected from the transactions. Some of these details are evident, as where mobile device types are listed; but most are unclear.
The test set is similarly split into transaction and identity files of similar size as the training files, omitting only the "answers" column specifying fraud. Predicting that column is our task.
Notes on Exploratory Data Analysis
Given the degree to which these features have been anonymised, we begin by treating each observation as its own independent transaction by its own individual purchaser.
Examining our target variable, we see a clearly imbalanced distribution.
Fraud makes fewer than 3.5% of our total transactions. This disproportion is typical in fraud, and is one of the difficulties of building fraud detection models as the scarcity of targets makes fraud harder to definitively characterise.
Logarithmically transforming our transaction values gives a nearly normal distribution.
This is a nice thing to have and reflects the way in which our original data clusters around low valued-transactions, with high values in the thousands being clear outliers.
The rest of our features are meaninglessly opaque. A brute-force approach would scope through the distributions of each regardless, but we opt instead to look for meta-patterns within our meta-features. Three arenas seem promising.
1) High-Risk Values Per Feature
Some values turn out to be highly fraudulent. A transparent sample is the merchant-side email domain 'protonmail.com'. Since this an end-to-end encrypted email service, it is no surprise that 95% of transactions to this domain turn fraudulent.
A brief search across our feature space yields 232 columns, or >50% our total features, containing at least one highly fraudulent value. A further breakdown:
From this graphic, two prominent feature typologies come to mind. First, it is eminently sensible for fraudsters to repeat tactics within features known to be "safe" or essential to consumer welfare. We therefore want to identify features with large numbers of fraudulent values that might make up relatively low percentages of all possible values.
If we restrict ourselves to >500 high-risk values per feature:
We obtain a list of features that fit that profile. Some have relatively higher levels of risk, but the nearly 9,000 noted values in id_02, for instance, make up fewer than 7% of 115,000 total unique values.
High Risk Values
Second, features dominated by their high-risk values should be naturally suspect. Consider features whose unique values are majority high-risk:
These features all have relatively few uniques โ typically fewer than 100.
Altogether we generate a list of features that we suspect to have greater relevance, on which we may perform specific analyses.
2) Time Variation
The presence of a reference time per transaction gives us reason to suspect some temporal variation in at least some features. Which features these are, we cannot say without naively examining the distribution of each feature over time for trend and seasonality. But a cursory glance at fraudulent activity per hour of day, aggregated over our time period of six months, reinforces these suspicions:
Common sense tells us also that fraud should rise and fall with features that reveal vulnerabilities over time, particularly as software is patched or as customer support ceases for outdated software.
3) Null Spaces
The examination of null values is a basic element of any exploratory analysis.
Null values are widespread, particularly as we combine the transaction and identity CSV files. We attribute this to the limitations of Vesta's data collection capabilities.
The distribution of nulls across all 436 features yields a major insight. Reading the graph like a contour plot, our eyes are drawn to the edge of the shape generated by the null values. What is immediately obvious is the incredible uniformity of null spaces across groups of neighboring features.
This raises two questions for continuing investigation: to what extent are features within these groups correlated? Are there, perhaps, powerful interactions among in-group variables (or, alternately, out-group variables) that would boost our predictions?
Concluding Our Notes
We present three general themes for exploratory analysis, each of which constructs better-informed avenues of pursuit. At the end of day, however, we are limited by the opacity of our features if we desire intelligent, precise feature-engineering. Our analysis does little to relieve the advantage of sheer computational power in blindly generating, and testing, features for predictive relevance.
Building the Model
We require a model that performs reasonably well regardless of null values, imbalanced data, or nonlinearity in its features. Each of these elements, by itself, normally wants formidable pre-processing. A judicious choice of model will largely obviate that need.
We thus implement XGBoost, a tree model that fulfills the above requirements.
1) Feature Engineering
Without resorting to blindly statistical testing, we engineer three simple features: whether buyer and seller email domains match, an hourly time-of-day, and a day-of-week according to the time-delta reference.
2) Encoding
To run our model, our inputs must be completely numeric. We convert our string columns into numerics via the following strategy:
Findings
The features with only two unique values, particularly in True/False binary, are easily labelled as 1 or 0. We could label the intermediate features, with more than two values but fewer than ten, as [0, 1, 2, 3, ... ]. But that implicitly introduces an ordering relationship between the values within that feature, as though one value were "greater" or "lesser" than another.
We dare not make that assumption and choose to one-hot encode them instead. Each unique value now has its own column and a binary scheme to indicate its presence or absence in each observation.
Whereas our data set of 590,000 observations can handle the expansion of additional columns from those intermediate variables, for our features with dozens or hundreds of unique values, one-hot encoding becomes untenable. It is simply too dangerous to expand the feature space recklessly in relation to the size of the dataset.
We therefore apply target encoding, which preserves dimensionality by computing, for each instance of a value, the average of all observed results according to that type of value. So if there are ten instances of "red", eight belonging to fraudulent transactions and two of which are genuine, the encoding for "red" would be 8 of 10, or 0.8.
The convenience of this procedure is having a richer set of encoded values that are now relationally adjusted to the target. Our model might, however, overfit the training sample and perform poorly on the test. We mitigate this in two ways: by adding an element of gaussian noise to the encoding procedure and by leaving out the current observation when computing the replacement value.
3) Time-Series Cross-Validation
Patterns of fraud in our data are likely to shift over time. Splitting our training data into multiple, equally-sized components, each of which 'rotates' in its turn as the validation set on which our model's parameters are tuned, is highly inappropriate due to temporal mismatch.
We use, instead, a nested cross-validation that creates successively larger windows of training data and sequential validation parts to tune on.
Results of the Model
After tuning our parameters, we achieve an in-sample accuracy of 98.9% on our training dataset. Our baseline model achieves a score of 0.761 ROC-AUC on the test data, where 1.0 indicates perfect classification ability and 0.5 is zero classification ability. There is still considerable room for improvement.
Of particular interest to our corporate sponsor will be our model's ranking of features.
A caveat: there is alway some potential element of chance in the stacking of feature importance. To be safe, we assess only the top features โ the top five โ and note that they are all V-features. We can thus report to Vesta with V258, V201, V149, V156, and V257 in hand; each of which they may deconstruct in their top-secret laboratories, and assemble for themselves the appropriate operational anti-fraud measures.
The code for this project can be viewed here.
Future Work
There are a few directions on which our model could continue to improve. We suspect, firstly, that our hyper-parameters are not yet optimally tuned and would devote more computational resources to that. We might also devise means of tracking pattern shifts over time or, failing that, manually identify them ourselves.
Although XGBoost performs relatively well with imbalanced data and null values, the present literature suggests it would perform even better if the issues were addressed. We could oversample our minority class of fraudulent values to provide a more balanced dataset. Synthetic Minority Oversampling Technique (SMOTE) is standard for this task.
To impute our null values, we might consider variants of SMOTE, which is itself based off K-nearest neighbors. Alternately, we would look into variational auto-encoders (VAEs). These are neural nets with dimensional bottlenecks that, by learning to reconstruct their inputs from reduced data, discover the underlying characteristics of those inputs.
A generative adversarial network could then be applied, between a VAE that attempts to impute values faithfully to the larger dataset and a discriminative model that attempts to identify whether a value has been artificially imputed or not. Iteratively, this should produce better and better imputations.
Beyond that, we would consider directly augmenting our baseline model. That would open a whole new world, for us, of stacking our XGBoost with neural nets โ VAEs first in our line-up.