Applied text classification on Email Spam Filtering [part 1]

Since last few months, I’ve started working on online Machine Learning Specialization provided by the University of Washington. The first course was about ML foundations, the second about Linear Regression, and the third course on which I’m currently on track is about Classification. I liked the courses almost in every aspect as they teach ML algorithms implementation from Scratch. That was my goal when I decided to discover the field in more depth. But, honestly, I was feeling that there is a gap somehow because many questions were left without answers along the way. Then, after reading about how to start with Machine Learning, I found that most articles emphasized the importance of combining courses with practical projects in order to apply what’s learned and better assimilate it.. and it’s so true! You just have to try to combine both and you will soon notice the difference!

So, here I am in my first practice application! 😀 I chose to try Email Spam Filtering as it’s a very common topic in applied classification. It can be understood easily because we are experiencing spam filtering in our e-mails every day.

I followed a simple starting tutorial: Email Spam Filtering: A Python implementation with Scikit-learn. Soon after finishing it, my brain started analyzing the steps and a bunch of questions bombarded my head!

Why “equal number of spam and non-spam emails”? What’s stemming? Are there other methods to clean data other than removing stop words and lemmatization? How is the partition between the training set and test set done? Why no validation set? Why Naive Bayes classifier and SVM (Support Vector Machines) were specifically used? What makes Naive Bayes so popular for document classification problem? etc..

As William.S.Burroughs said, “Your mind will answer most questions if you learn to relax and wait for the answer.”

I took a breath and started answering question by question by doing sometimes search on the net, experimenting some changes in code and analyzing the output. And I’m happily sharing the results:

1) The data we need

– how many emails we’ve seen (will be used in train-test sets)
– how many emails go in each label (used to detect if there is imbalanced data)
– how often a word is associated with each label (used to calculate the probability of an email being a spam or ham (class 0 or class 1))

2) Cleaning data

Why cleaning the words list? Cleaning data is essential in order to reduce the probability of getting wrong results because some words have no influence on the classification (they can neither be associated with spam class and nor with ham class) and there are words that can be normalized in order to group same-meaning words and reduce redundancy. By acting on the quality of the training data, we can change what is called the accuracy of the classifier. So removing stop words, stemming, and lemmatization help in improving the results of Machine Learning algorithms.

3) Naive Bayes

Why was Naive Bayes used? Naive Bayes has highly effective learning and prediction, it’s often used to compare with more sophisticated methods because it’s fast and highly scalable (works well with high-dimensional data) and as Andrew Ng suggests when dealing with an ML problem start by trying with a simple quick and dirty algorithm and then expand from that point.

How is Naive Bayes simple and easy? Naive Bayes is based on “Bayes” theorem and it’s called “Naive” because it assumes that features are independent of each other given the class(no/little correlation between features), which is not realistic. Thus, Naive Bayes can learn individual features importance but can’t determine the relationship among features. Besides, the training time with Naive Bayes is significantly smaller as opposed to alternative methods and it doesn’t require much training data.

Why Multinomial Naive Bayes? What about other models like Gaussian Naive Bayes or Bernoulli Naive Bayes?

Well, Multinomial NB considers the frequency count (occurrences) of the features (words in our case) while Bernoulli NB cares only about the presence or absence of a particular feature (word) in the document. The latter is adequate for features that are binary-valued (Bernoulli, boolean). Whereas, with Gaussian NB, features are real-valued or continuous and their distribution is Gaussian, the Iris Flower dataset is an example with continuous features.

4) Support Vector Machines (SVM)

Why was SVM used? I didn’t find a specific reason to that, but what I learned is that SVM delivers high accuracy results because it uses an optimization procedure. SVM builds a classifier by searching for a separating hyperplane (optimal hyperplane) which is optimal and maximises the margin that separates the categories (in our case spam and ham). Thus, SVM has the advantage of robustness in general and effectiveness when the number of dimensions is greater than the number of samples.

Unlike Naive Bayes, SVM is a non-probabilistic algorithm.

What’s the difference between LinearSVC and SVC (Scikit-learn)? The difference is that they don’t accept the same parameters. For example, LinearSVC does not accept kernel parameter as it’s supposed linear. SVC supports more parameters(C, gamma,..) since it holds all possible kernel functions (linear, polynomial, rbf or radial basis function, sigmoid).

How can tuning SVM parameters be done? Tuning SVM parameters improve the performance of the algorithm. Some of them have a higher impact:

-Kernel: Kernal is like a similarity function. It’s a way of computing the dot product of two vectors in possibly a high dimensional feature space using data transformations based on some provided constraints into a more complex space. Kernel functions are sometimes called “generalized dot product”.

-Gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma will try to exactly fit as per training data set i.e. generalization error and cause an over-fitting problem.

-C: The factor C in (3.15) is a parameter that allows one to trade off training error vs. model complexity. A small value for C will increase the number of training errors, while a large C will lead to a behavior similar to that of a hard-margin SVM.” Joachims (2002), page 40

5) Analyzing output in different cases

What if I vary dictionary size?

Varying dictionary size means changing the number of features (words). So, I wanted to explore the impact of having more features and what’s the limit of a good result based on confusion matrix result.

I tested on size= {3000,5000,6000,7000} and discovered that at size = 7000, SVM classification starts slightly dropping (false identification) while Naive Bayes delivered same results despite the size variation.

I think that at that point maybe target classes started overlapping or training data overfitting. I’m not yet sure about the explanation of the result.

What if I try Gaussian and Bernoulli?

Obviously, introducing Bernoulli won’t help because as I explained above, it doesn’t provide enough information in our case, we need the count of words, not the presence/absence of it.

Multinomial NB:
[[129   1]
[ 9 121]]
Gaussian NB:
[[129   1]
[ 11 119]]
Bernoulli NB:
[[130   0]
[ 53 77]]

As we can see, Multinomial NB outperformed both Gaussian NB and Bernoulli NB.

What if I try GridSearch on SVM in order to tune SVM parameters?
Params GridSearch: param_grid = {‘C’:[0.1,1,10,100,1000],’gamma’:[1,0.1,0.01,0.001,0.0001]}
Best params found: Gamma: 0.0001 ; C: 100

Linear SVM:
[[126   4]
[ 5 125]]
Multinomial NB:
[[129   1]
[ 9 121]]
SVM:
[[129   1]
[ 62 68]]
GridSearch on SVM:
[[126   4]
[ 2 128]]

SVM tuned parameters using GridSearch allowed to get better results.

Conclusion

So, that was about my first steps in Email-Spam Filtering application! I hope that it’s helpful if you are thinking about starting a text-classification project! And I will continue sharing various reflections and experiments along the way. For next time, I will explore more the improvement/change of training data and features.

The project is on Github too.

Tell me about your experience with text-classification? In what did you apply it? What methods do you suggest to apply? What are the challenges?

Helpful Resources:

[1] Naive Bayes and Text Classification.
[2]Naive Bayes by Example.
[3] Andrew Ng explanation of Naive Bayes video 1 and video 2
[4] Please explain SVM like I am 5 years old.
[5] Understanding Support Vector Machines from examples.