Fig 1. The workflow of Spam Filtering Implementation
The workflow of spam filtering is shown in Fig 1. The workflow started with data cleansing and restructuring to prepare data into a ready format for training.
The original dataset can be retrieved from here. In the uncompressed folder, you can see that the whole dataset is in SMSSpamCollection.txt. In the file, there are labels with the word “spam” and “non-spam”, which correspond to spam and non-spam data.
Figure 2. Distribution of Dataset
The distribution of the dataset is displayed in the chart above. This dataset is unbalanced with more non-spam data points (4827 samples) compared to spam data points (747 samples). In this example, the unbalanced data is modeled with a classification model. There is a caveat where the network might adapt to patterns of non-spam mail better due to the volume of the data. The performance may be better improved with a balanced dataset.
The data processing step starts by reading in the original text file — SMSSpamCollection.txt. The text file contains multiple lines of text where each string of text constituted to single mail content. Each of these is retrieved and saved into an independent text file separately. Figure 3 provides a visualization of the end results of separation. The dataset is then separated into training and testing datasets (into separate subfolders).
Figure 3. Illustration of a text of an email in a separate txt file.
Get the Codebase
The program of this example is stored in the Github repository below.
Refer to the directory SpamMailFiltering for this use case example. The codebase will be further explained below for a better understanding.
The program is based on an open-source Java-based deep learning framework — DeepLearning4J (DL4J). If you are new to DL4J, you can refer to another article of mine here for an introduction and installation of it.
Loading of Pre Trained Word2Vec Model
Before running the main program by executing the file SpamMailFiltering.java, you will need to download the pre-trained embeddings for text. To train neural networks with input data in the forms of texts, the texts have to be converted into embeddings. In this example, a pre-trained model to convert text to embedding is used. If you want to understand more about Word2Vec, here’s a good link for it.
The example started with the loading of a pre-trained Word2Vec model with Google news corpus. This pre-trained model is trained with 3 billion running words, outputting 300-dimension English word vectors. Download it from here and change the WORD_VECTORS_PATH in SpamMailFiltering.java to point to the path of saving the file.
Figure 4. Assign WORD_VECTORS_PATH to local directory path of GoogleNews-vectors-negative300.bin.gz
To Run the Code
After setting WORD_VECTORS_PATH, run the code by executing SpamMailFiltering.java. While the neural network is training, open up http://localhost:9000 to virtualize the raining progress.
Figure 5. Visualization of model training.
The program might takes quite some time to run on CPU backend. Loading the large Google news corpus is time consuming. Alternatively, switching to CUDA backend would takes shorter time of execution. Doing it is simple as changing a line in pom.xml. Meanwhile, you can let the program runs, take a break and go grab a cup of coffee ☕.
The description below provides a more detailed walkthrough of the process.
The training and testing data is stored in directories with structure as illustrated below. There are train and test folders, with spam and non-spam sub-directory folders in each.
Figure 6. Data Directory Structure
Figure 7. SpamMailDataSetIterator Initialization
Next, Long Short Term Memory (LSTM) model is configured to model the data. LSTM is commonly used to model sequential data due to its ability to capture long term dependencies. Check out this link to learn more about LSTM.
Fig 8. LSTM Neural Network Architecture
As shown in Figure 8, the network started with an LSTM layer of 300 units which is the dimension of pre-trained word embeddings. Each mail text will be truncated to the prefixed length if the original length is longer than that. The network continues with an output layer of 2 classes as spam and non-spam labels.
Evaluation of the testing dataset after 1 epoch shows a promising result as illustrated in Figure 6. Among the 150 samples of spam mails, 103 mails are identified correctly as spam while 47 mails are wrongly labeled as a false negative.
Fig 9. Evaluation Results.
I also tested the model on a sample spam mail to evaluate the output. With the text of “Congratulations! Call FREEFONE 08006344447 to claim your guaranteed £2000 CASH or £5000 gift. Redeem it now!”, which is notably a prize scam, the model resulted in probability of spam on 98%. This shows a confidence that the model able to identify spam and non-spam mail distinctively.
Fig 10. Probabilities Output for A Sample Spam Mail.