In recent years, there has been a huge surge in the usage of online platforms like Reddit, Yelp, and Amazon. While the usage of such platforms has positively influenced our daily lives, fraudulent activities is amid this rapid increase in digital data volume and anonymity of online networks. Fraudulent behaviors are executed and hidden in plain sight of this vast amount of online records. Thus, enhancing fraud detection is crucial to protect us from fraudsters against harmful scams, and even criminal activities. By tackling this fraud detection task, we hope to contribute to protecting us against scams.
Many techniques exist to combat fraud; however, they often fail to capture the imbalanced class structure in data involving fraudulent activities. It's important to tackle such concern so we can harness its power to correctly predict anomalies. So, the question remains: How can we effectively detect and mitigate fraudulent activities, especially when faced with imbalanced datasets? Our research contributes to the study of such concern with a model that harnesses the strengths of many existing techniques from different domains.
Our proposed solution utilizes a combination of 2 models: GraphSMOTE and SparseGAD. GraphSMOTE, although not commonly used for fraud detection, is an oversampling technique that identifies minority class nodes and creates new nodes that resemble those minority class data points, thus helping it balance. Since datasets involving fraud detection are hugely imbalanced, GraphSMOTE will help us balance the class distribution in the graph. We will then take the output of this GraphSMOTE and feed it into SparseGAD, known for anomaly detection by introducing sparsity constraints. Such constraints help highlight significant connections, and anything that substantially deviates from that will be looked into further for fraud. Upon applying these techniques, we utilize Graph Neural Networks (GNNs) for our model.
Our model utilizes three datasets - Amazon, Yelp, and Reddit. These datasets are obtained from the Deep Graph Library (DGL) and PyGOD in graph format. Each dataset exhibits class imbalance:
We feed our imbalanced datasets into GraphSMOTE, which balances it by generating synthetic nodes. It addresses the imbalance by adding synthetic anomalous nodes to the graph, amplifying the minority class. These synthetic nodes maintain connections with existing nodes while preserving heterogeneity among links.
GraphSMOTE alone doesn't fully capture the dissimilarity of anomalous nodes with their connected users. To address this, SparseGAD introduces sparsity, utilizing a learnable adjacency matrix ("homey" matrix) derived from cosine similarity calculations. This matrix helps identify whether connected nodes are similar or dissimilar, enhancing the model's ability to detect anomalous nodes exhibiting heterophilic behaviors.
Upon applying these techniques, we feed in the output to our GNNs to obtain a result. The 3 GNNs we use are Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSage. We utilize different preprocessing technique, or "paths" before applying these GNN models. The main 3 paths, as well as a description of these 3 GNN models will be discussed in detail in the Implementation section.
Finally, we use the trained GNN model to classify nodes as either fraudulent or non-fraudulent based on their features and neighborhood structure. We evaluate the model’s performance using the ROC-AUC score based on its ability to correctly identify anomalies in the dataset.
In the synthetic node generation process, SMOTE method is utilized to amplify the minority class, typically anomalous users. Synthetic nodes are created by replicating existing nodes and their connections, maintaining heterogeneity in links. The "homey" adjacency matrix replaces the original adjacency matrix, enabling the calculation of cosine similarity between adjacent node pairs. This diagram explains GraphSMOTE: In short:
Sparsification techniques further refine the graph by filtering unnecessary connections using a threshold δ and limiting connections through KNN. Graph Anomaly Detection (GAD)-oriented regularization is applied to prevent faulty links, enhancing the model's accuracy in distinguishing anomalous users. This diagram explains SparseGAD: In short:
Our approach to detecting fraud involves three main paths, each of which utilizes a different preprocessing technique before applying GNN models (such as GCN, GAT, or GraphSage). The following visualizes the workflow: In short,
As we can see in the table, the GraphSage model performs best on Yelp and Reddit datasets, and the GraphSage with Modified GraphSMOTE performs best on the Amazon dataset.
We first aim to explain why GraphSAGE outperforms GCN and GAT on the Yelp and Reddit datasets, while achieving a similar ROC-AUC score on the Amazon dataset. Our intuition is that this discrepancy arises from the structural differences between the datasets. It's important to note that both the Yelp and Reddit datasets exhibit relatively sparser structures, whereas the Amazon dataset presents a more condensed structure. This condensed structure in the Amazon dataset allows the weights in the GAT model on each neighbor’s features to have greater significance compared to those in the Yelp and Reddit datasets. On the other hand, the GraphSAGE model samples and aggregates from the neighboring nodes. On the Yelp and Reddit datasets, the simplicity of the GraphSAGE method facilitates better model tuning and results in improved convergence, thereby enabling the GraphSAGE model to outperform the GAT model on these datasets.
The main reasons why we believe our GraphSAGE performs well on the Yelp and Reddit datasets, and GraphSAGE with Modified GraphSMOTE on the Amazon dataset, are related to the degrees of connectivity and the number of connected nodes. Below, we present the degree distribution of the datasets.
In the Yelp and Reddit datasets, the median number of degrees is lower (6 and 1 respectively) vs. Amazon (which has 42). This discrepancy may reflect variations in the connectivity and structural complexity of the datasets, which directly influence the performance of fraud detection models. Furthermore, Yelp and Reddit datasets have more nodes with only self-loops (106 and 625 respectively) compared to Amazon (which has 6). This lack of connectivity in the Yelp and Amazon datasets limits the potential for synthetic nodes generated by the SMOTE method to establish meaningful connections within the graph. As a result, the ability of the SMOTE method to effectively balance the class distribution while maintaining dataset structure may be constrained in Yelp and Reddit datasets whereas flourished in the Amazon dataset. Thus, the GraphSMOTE and the Modified GraphSMOTE models were unable to surpass the performance of the baseline GraphSage model
This disparity in performance highlights the importance of tailoring the model to the characteristics of the dataset.
Fraud detection is prevalent now more than ever due to the massive surge in the usage of online platforms. Our research contributes to the study of such concern with a model that harnesses the benefits of many existing models. We proposed a solution that utilizes a combination of oversampling techniques and sparsity constraints to balance and predict fraud data.
In the future, we would like to refine our model to accommodate variability in datasets such as degrees of connectivity and number of connected nodes. We would also like to seek partnerships for deployment and cybersecurity collaboration. As mentioned, online fraud detection can be catastrophic to many and results in a negative impression of internet. Our goal is to contribute to bridging the gap in cybersecurity fraud detection.
The fusion of GraphSMOTE and SparseGAD showcased promising results, more specifically for Amazon dataset. However, there are many more models that exist. We would need to find a combinations of them which will be the most powerful in detecting fraud. Leveraging subject matter expertise and domain knowledge will be invaluable assets in research of this domain.