Machine Learning Techniques for Fraud Analytics, Part 1
Posted February 7, 2018
Fraud analytics is an endless game of cat and mouse, but machine learning just might be the tool to help fraud professionals win this game.
In the financial services world, fraudsters must be faster and smarter than the slowest bank to be “quids in”. And a bank must be better than the fraudster to avoid being a victim.
Analytics and data science play a pivotal role in this. However, a troubling issue that banks often face is the bridge between the data scientist and the fraud analyst: one really understands statistics while the other understands fraud.
The best fraud professionals are those that think like hackers or fraudsters themselves and have the requisite knowledge of machine learning to be able to include the fraud mitigation strategies necessary to stay one step ahead.
Finding the Right Tools for the Job
There are many different machine learning techniques with advantages, disadvantages, and varying degrees of complexity. Let’s examine this from a modus operandi perspective to pick the right tools for the right job.
One of the key challenges in fraud, as compared to other risk and marketing machine learning, is the number of ‘bad’ events available. Fraud data scientists need to take special consideration of sampling and overfitting (where the model only works on the training dataset and fails when put in the wild). Another key point upfront is the maxim ‘rubbish in, rubbish out’. A machine learning model is only as good as its data input and the strength of the bad definition. This feeds into a “best practice” 80/20 rule, which effectively recommends that a data scientist should spend 80 percent of his or her time getting the data ready and 20 percent doing the modelling.
When the Fraudster is Your Customer
First-party fraud is where the fraudster is also a customer of the bank. Generally, the fraudster lies about his or her situation, or elements of his or her identity to, for example, maximize a loan agreement before busting out and disappearing.
Another branch of first-party fraud is money mules who receive money from an online account takeover and exit funds for the fraudster or use their own accounts to circulate funds for money laundering. After looking at these two populations from a statistical standpoint, it has been proven that these populations indeed share the exact same social demographics and can be combined in a bad definition to build stronger machine learning models.
Machine learning is ideal for this situation as the features used tend to be of the static variety, including time in the country, gender, job, where they live and more. Supervised learning is generally used for considering historical data and building a classification model that can be used for classifying future incidences into the “fraud” vs. “good” classes.
Fraud professionals should consider the following machine learning techniques when the fraudster or mule is your actual customer:
Logistic Regression – Once a bank has sufficient fraud data volume, a regression model can be built to predict fraud. Using differential statistical methods, fraudulent customers are separated from genuine customers by a score. Logistic regression is also often used in credit risk to determine bad debt. Other techniques can be used to reduce features and remove correlation, such as LASSO or Elastic Net. Of the two, Elastic Net generally has better performance.
Decision Trees – These have been used in banks for years for varying fraud problems. Here, the data is partitioned into subsets. The partitioning process starts with a binary split and continues until no further splits can be made. Various branches of variable length are formed. The goal of a decision tree is to encapsulate the training data in the smallest possible tree. There are various methods and techniques to control the depth or “prune the tree”.
Random Forest – An ensemble of decision trees can also be used to predict human behavior, including fraud. At each node in the tree, a yes/no decision is realized, and this flow can be used for setting strategies. A random forest is built by using hundreds of different decision trees, taking random samples of the same data using a technique known as “bootstrapping.” Random forest is a very useful technique to quickly generate feature importance or see the maximum predictive power from your data.
The biggest advantage of these machine learning techniques is that they transparent with weights and features, so you can see which features contribute to the algorithm. These techniques are considered in the clear-box approach, and are ideal for explaining results to stakeholders and fraud professionals.
The Rise of The Machines
One of the key issues with machine learning in fraud is how that model output is checked and interpreted. Black-box models with binary outcomes give no insight into a subjective area, especially such as first-party fraud. Prejudice is reinforced in models, sometimes to the detriment of genuine customers.
One example…a human adversary is attacking a bank’s systems and process – potentially even using the same tools and techniques against your bank (adversarial machine learning). Human analysts take in much more environmental data and can predict where motives may shift.
“I’m launching a new product, and this is the fraud risk. Fraudsters did this last year, but new tools are available, they are likely to attack like this…”
Unfortunately, machine learning is a long way from helping with this.
Part 2 will look at the third-party fraud and some more advanced techniques.