In this post, we will build an autoencoder in Deeplearning4j to detect credit card fraud transactions. We will learn how to deal with an unbalanced dataset which is very common in anomaly detection applications.
Anomaly is a broad term denoting something that deviates from what is standard or expected. Often, identifying an anomaly event in the early stage is challenging with the conditions of recognizing one being vague or unclear.
Hence, anomaly detection modelling with deep learning normally starts from spotting samples of data with unusual trends and patterns, marking these being the interesting ones, and further inspected to verify the anomaly events the later stages.
Another distinctive feature of the anomaly event problem is the dataset is inherently unbalanced, with the proportion of positive labelled samples being far lesser compared to negative labelled samples. This scenario is common in the real world problem, where anomaly event happens far less frequently compared to the rest.
When selecting neural networks architecture, the highly unbalanced dataset need to be factored in. The vanilla Multi-Layer Perceptron Neural Network would not be appropriate as the model will be highly biased towards a dataset with larger samples, propagating bias and prejudice in return.
The autoencoder is commonly used for an unbalanced dataset and good at modelling anomaly detection such as fraud detection, industry damage detection, network intrusion. This will be further elaborated in the next section.
Introduction to Autoencoder
Autoencoder learns in an unsupervised manner to create a general representation of the dataset. As seen in Figure 1 below, the network has two important modules, an encoder and a decoder. The encoder achieves the purpose of dimensionality reduction by learning the underlying patterns of the dataset in the compressed form. This follows by a decoder which does the reverse by producing the uncompressed version of the data.
Figure 1. Illustration of an autoencoder.
Essentially, the bottleneck layer sandwiched between the encoder and decoder is the key attribute of the network design. It restricts the amount of information traversing from encoder to decoder, forcing the network to learn the low dimensional representation of the input data. With the output layer having the same number of units as the input layer, the network learns to develop a compressed and efficient representation of the input data through minimizing the reconstruction error.
In this example, we illustrate anomaly detection with the use case of credit card fraud detection. The dataset used in this example can be retrieved from Kaggle.
In this example, the non-fraud dataset is segmented to a sample set of 108000 data points and the fraud dataset to a size of 490 data points from the original dataset. The dataset is highly unbalanced with 490 frauds identified out of the whole dataset (about 0.451% of all transactions). The distribution of the data is shown in the figure below, where you can see the amount of fraud data is heavily unproportioned compared to the normal data.
Figure 2. Distribution of normal to fraud data.
The highly unbalanced dataset makes it suitable to be modeled with autoencoder, where the model will learn the general representation of the data and identify out data with uncommon patterns. LSTM autoencoder is used in this example to capture the time-series nature of the transaction data. If you want to learn more about LSTM, here’s a really good link to learn more about it.
The program can be found on the Github Repository. The readme in the subfolder shows how to run the modeling of the program. Note: the time needed to run the program varies depending on the computation backends. Choose for CUDA with CuDNN backend for faster computation.
Snippets of code are demonstrated below for explanation purposes.
The original dataset is in a file where most transactions features are principal components obtained from the technique Principal Component Analysis (PCA). There are another two features ‘Time’ and ‘Amount’ which have not been transformed. The last column of the dataset denotes the ground truth of ‘fraud’ or ‘non-fraud’.
To proceed with our analysis, the ‘Time’ column is removed because the feature does not need to be factored in and the ‘Amount’ column is normalized within the range of 0 and 1. Each sample data point of a dataset is further separated into a file, stored in either the fraud or non-fraud directory depending on the ground truth labels. The label of the sample is stored in another label directory with the same file name.
During training, we segregate out a big portion of non-fraud data (100 000 samples) to be modeled by the autoencoder. The rest of the non-fraud data (8000 samples) and fraud data (490 samples) is used during the testing and validation phases.
As a result, we have 4 data directory: train_data_feature, train_data_label, test_data_feature, and test_data_label. For example, 0.csv in directory test_data_feature store one sample data point of credit card transactions with 0.csv in directory test_data_label stored the label (fraud/non-fraud) of the corresponding transaction.
Fig 3. File directory paths for training and testing dataset
Note: Autoencoder is commonly used in unsupervised learning where data with no labels is provided and trained. In this example, autoencoder is used as semi-supervised learning due to the presence of labels. We only train the autoencoder with the “non-fraud” data, presumably it will lead to a better generalization of the representation. During testing, data with high reconstruction errors will be categorized as a high risk of fraud.
The data stored in CSV is further read in using CSVSequenceRecordReader and SequenceRecordReaderDataSetIterator as shown in Figure 4. The training dataset is grouped to a minibatch size of 284 while the testing dataset contains a single data point to suit the evaluation purpose later. Note that the training data in this example only contains non-frauds data.
Fig 4. Vectorization of data
Next, the LSTM autoencoder is formed as shown in Fig 5. Each layer is constructed using the LSTM layer to learn the correlation of time series data. Note that Mean Square Error function is used in the out-most layer as autoencoder build upon reconstructing the input data with a value of the reconstruction error.
Fig 5. LSTM Autoencoder Architecture
The network started with an LSTM layer of 29 units which is the feature-length of input data. For the encoder part, the neural network nodes are then reduced to 16 and eventually 8 in the subsequent layers. This is the bottleneck of the autoencoder to capture the essential representation of the data. The structure of the decoder is the opposite of the former encoder, where it is constructed with 8 nodes, follows by 16 nodes, in the attempt to reconstruct the input data. The network is completed with an output layer of 29 units, which is identical to the number of nodes of the input layer. This information is summarized in Fig 6.
Fig 6. LSTM Autoencoder Network Configuration
After the configuration setup, the network is then trained with 1 epoch where minibatch data is feed in subsequently. Features of the data are fed in as both the input and output due to the nature of autoencoder in reconstructing data while minimizing reconstruction error.
Fig 7. Training of the LSTM Autoencoder Network
The illustration through the DL4J training user interface reveals that the training is heading to a smaller loss value, shows that the network is converging.
Fig 8. Deeplearning4j Training User Interface.
After the training, the network is evaluated using the testing dataset. A testing data point is fed in the network, where data point with high reconstruction error is labeled as predicted fraud. The threshold value set for this example is 2.5.
It is seen that the network can correctly identify 415 number of frauds out of a total of 490 cases (about 84.7% of all frauds). The network also able to recognize 27479 normal transactions out of a total of 28431 cases.
Fig 9. Evaluation Results.