December 5, 2018
November 29, 2018
Posted July 31, 2018
Link analysis is a technique to analyze data by creating networks and investigating the relationships between different entities. It can be used in conjunction with global intelligence from the ThreatMetrix Digital Identity Network to assess true identity and pinpoint fraud that might have otherwise been missed.
At its very core, it relies on linking requests together with their corresponding entities. Requests can include payment transactions, logins, or new account openings while entities represent a person’s digital identity (e.g., IP address or Smart ID).
By linking requests and entities together, we can identify malicious entities and spot any transactions executed by these entities that were not marked as fraud. In many cases, “non-fraudulent” requests are only assigned that label based on an organization’s individual analysis. However, they may in fact be fraudulent.
For example, suppose we are looking at new account openings and there is one device associated with multiple fraudulent new account opening requests elsewhere on the global network. Any future new account opening requests performed by this device are highly likely to be fraud—even if the attributes presented at the time of that event look legitimate. This reasoning follows from the fact that the bad actor in this case is an individual who owns that device. So, any requests coming from that device should then be flagged as high-risk.
When performing link analysis at ThreatMetrix, we use Impala, a massively parallel processing SQL query engine, to query data. It is a workhorse that can be used to perform all of the heavy computation needed to find complex links between various transactions. The resulting dataset is condensed and can be manipulated locally in Python, which is a much more feature-rich programming language. This step is necessary because the data by itself is harder to interpret. We can transform it into a more legible form or turn it into a network visualization.
The first step is to write a query that only includes entities that have instances of both fraud and non-fraud (e.g., a device that has fraudulent and non-fraudulent logins). By doing so, we filter out entities that only have fraud or ones that only have non-fraud. These do not provide any value to us since we are trying to identify new instances of fraud, namely “non-fraud” linked to fraud. This step will typically drastically reduce the amount of data.
The previous step provides us with our entities of interest. Next, we will need to match the entities to their associated events, whether it be login, account opening, or payment. For each entity, some of the events will be fraudulent and some will be non-fraudulent. We can sort the data in such a way that the entities are grouped together, so it is easy to discern visually which fraud events and non-fraud events are linked together by the entity. Additionally, we can shade alternating groups of rows representing an entity differently to distinguish between them. Finally, this can all be outputted to a spreadsheet file.
Together, the links between requests and entities form large, complicated networks. In these networks, clusters emerge, each containing fraud and non-fraud. They usually center around one or a few malicious entities.
These networks can become very convoluted, especially if many instances of more than one type of entity is introduced. For example, there could be a cluster of various IPs and devices all tangled together.
We can use Gephi, an open source graph visualization platform, to visualize such networks. Gephi can take various graph file types as inputs. We can construct and output most of these file formats by using Python.
Figure 1: Example of Graph Visualization of the Links (created in Gephi)
We have given a brief overview of link analysis and how it can be done in an Impala environment. However, this process is manual and needs to be relayed to our clients periodically. Furthermore, it requires someone with the technical knowledge to perform it.
We can circumvent this by leveraging the power of ThreatMetrix Smart Rules, which delivers real-time pattern recognition through advanced behavioral analytics. This technology creates a variable representing the number of times an entity was seen with a fraudulent request. A reason code can then be created to see if this number is greater than 0. In this way, the link analysis will be done automatically and be incorporated into the policy score.
This is the future of link analysis: by moving away from heavy reliance and manual processes across multiple software, fraud modeling exports can be supported by real-time behavioral analytics to reduce fraud rates and streamline policy development.