### Applied text classification on Email Spam Filtering [part 1]

Since last few months, I’ve started working on online Machine Learning Specialization provided by the University of Washington. The first course was about ML foundations, the second about Linear Regression, and the third course on which  I’m currently on track is about Classification. I liked the courses almost in every aspect as they teach ML algorithms implementation from Scratch. That was my goal when I decided to discover the field in more depth. But, honestly, I was feeling that there is a gap somehow because many questions were left without answers along the way. Then, after reading about how to start with Machine Learning, I found that most articles emphasized the importance of combining courses with practical projects in order to apply what’s learned and better assimilate it.. and it’s so true! You just have to try to combine both and you will soon notice the difference!

So, here I am in my first practice application! 😀 I chose to try Email Spam Filtering as it’s a very common topic in applied classification. It can be understood easily because we are experiencing spam filtering in our e-mails every day.

I followed a simple starting tutorial: Email Spam Filtering: A Python implementation with Scikit-learn. Soon after finishing it, my brain started analyzing the steps and a bunch of questions bombarded my head!

Why “equal number of spam and non-spam emails”? What’s stemming? Are there other methods to clean data other than removing stop words and lemmatization? How is the partition between the training set and test set done? Why no validation set? Why Naive Bayes classifier and SVM (Support Vector Machines) were specifically used? What makes Naive Bayes so popular for document classification problem? etc..

As William.S.Burroughs said, “Your mind will answer most questions if you learn to relax and wait for the answer.”

I took a breath and started answering question by question by doing sometimes search on the net, experimenting some changes in code and analyzing the output. And I’m happily sharing  the results:

### 1) The data we need

– how many emails we’ve seen (will be used in train-test sets)
– how many emails go in each label (used to detect if there is imbalanced data)
– how often a word is associated with each label (used to calculate the probability of an email being a spam or ham (class 0 or class 1))

### 2) Cleaning data

Why cleaning the words list? Cleaning data is essential in order to reduce the probability of getting wrong results because some words have no influence on the classification (they can neither be associated with spam class and nor with ham class) and there are words that can be normalized in order to group same-meaning words and reduce redundancy. By acting on the quality of the training data, we can change what is called the accuracy of the classifier. So removing stop words, stemming, and lemmatization help in improving the results of Machine Learning algorithms.

### 3) Naive Bayes

Why was Naive Bayes used? Naive Bayes has highly effective learning and prediction, it’s often used to compare with more sophisticated methods because it’s fast and highly scalable (works well with high-dimensional data) and as Andrew Ng suggests when dealing with an ML problem start by trying with a simple quick and dirty algorithm and then expand from that point.

How is Naive Bayes simple and easy? Naive Bayes is based on “Bayes” theorem and it’s called “Naive” because it assumes that features are independent of each other given the class(no/little correlation between features), which is not realistic. Thus, Naive Bayes can learn individual features importance but can’t determine the relationship among features. Besides, the training time with Naive Bayes is significantly smaller as opposed to alternative methods and it doesn’t require much training data.

Why Multinomial Naive Bayes? What about other models like Gaussian Naive Bayes or Bernoulli Naive Bayes?

Well, Multinomial NB considers the frequency count (occurrences) of the features (words in our case) while Bernoulli NB cares only about the presence or absence of a particular feature (word) in the document. The latter is adequate for features that are binary-valued (Bernoulli, boolean).  Whereas, with Gaussian NB, features are real-valued or continuous and their distribution is Gaussian, the Iris Flower dataset is an example with continuous features.

### 4) Support Vector Machines (SVM)

Why was SVM used? I didn’t find a specific reason to that, but what I learned is that SVM delivers high accuracy results because it uses an optimization procedure. SVM builds a classifier by searching for a separating hyperplane (optimal hyperplane) which is optimal and maximises the margin that separates the categories (in our case spam and ham). Thus, SVM has the advantage of robustness in general and effectiveness when the number of dimensions is greater than the number of samples.

Unlike Naive Bayes, SVM is a non-probabilistic algorithm.

What’s the difference between LinearSVC and SVC (Scikit-learn)? The difference is that they don’t accept the same parameters. For example, LinearSVC does not accept kernel parameter as it’s supposed linear. SVC supports more parameters(C, gamma,..) since it holds all possible kernel functions (linear, polynomial, rbf or radial basis function, sigmoid).

How can tuning SVM parameters be done? Tuning SVM parameters improve the performance of the algorithm. Some of them have a higher impact:

-Kernel: Kernal is like a similarity function. It’s a way of computing the dot product of two vectors in possibly a high dimensional feature space using data transformations based on some provided constraints into a more complex space. Kernel functions are sometimes called “generalized dot product”.

-Gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Higher the value of gamma will try to exactly fit as per training data set i.e. generalization error and cause an over-fitting problem.

-C: The factor C in (3.15) is a parameter that allows one to trade off training error vs. model complexity. A small value for C will increase the number of training errors, while a large C will lead to a behavior similar to that of a hard-margin SVM.” Joachims (2002), page 40

### 5) Analyzing output in different cases

What if I vary dictionary size?

Varying dictionary size means changing the number of features (words). So, I wanted to explore the impact of having more features and what’s the limit of a good result based on confusion matrix result.

I tested on size= {3000,5000,6000,7000} and discovered that at size = 7000, SVM classification starts slightly dropping (false identification) while Naive Bayes delivered same results despite the size variation.

I think that at that point maybe target classes started overlapping or training data overfitting. I’m not yet sure about the explanation of the result.

What if I try Gaussian and Bernoulli?

Obviously, introducing Bernoulli won’t help because as I explained above, it doesn’t provide enough information in our case, we need the count of words, not the presence/absence of it.

Multinomial NB:
[[129   1]
[  9 121]]
Gaussian NB:
[[129   1]
[ 11 119]]
Bernoulli NB:
[[130   0]
[ 53  77]]

As we can see, Multinomial NB outperformed both Gaussian NB and Bernoulli NB.

What if I try GridSearch on SVM in order to tune SVM parameters?
Params GridSearch: param_grid = {‘C’:[0.1,1,10,100,1000],’gamma’:[1,0.1,0.01,0.001,0.0001]}
Best params found: Gamma: 0.0001 ; C: 100

Linear SVM:
[[126   4]
[  5 125]]
Multinomial NB:
[[129   1]
[  9 121]]
SVM:
[[129   1]
[ 62  68]]
GridSearch on SVM:
[[126   4]
[  2 128]]

SVM tuned parameters using GridSearch allowed to get better results.

#### Conclusion

So, that was about my first steps in Email-Spam Filtering application! I hope that it’s helpful if you are thinking about starting a text-classification project! And I will continue sharing various reflections and experiments along the way. For next time, I will explore more the improvement/change of training data and features.

The project is on Github too.

Tell me about your experience with text-classification? In what did you apply it? What methods do you suggest to apply? What are the challenges?

[1] Naive Bayes and Text Classification.
[2]Naive Bayes by Example.
[3] Andrew Ng explanation of Naive Bayes video 1 and video 2
[4] Please explain SVM like I am 5 years old.
[5] Understanding Support Vector Machines from examples.

### Starting with OpenCV and Tesseract OCR on visual studio 2017 [Challenge 1]

I have recently started working on a Freelance project where I need to use text scene recognition based on OpenCV and Tesseract as libraries. I was so motivated to hit the Wolrd of computer vision combined with machine learning and experience developing applications in the field, so I welcomed challenges that come with!
Here I’ll be talking about the first challenge and how I tackled it.

found myself with multiple new things to prepare in order to start coding, without mentioning that it’s been a long time before when I last coded with C++ (back to my university time)!

At first, I was asked to use OpenCV 3.0 and Tesseract 3.02 in order to run the project’s part which is already available. So I installed OpenCV 3.0 and Tesseract 3.02 with Leptonica Library, by following provided documentation about how to build application with OpenCV on Windows Visual Studio in this link. Then, I tried to run the project in Visual Studio 2017. I got more than 800 errors!!! Most of them where LINK errors of type:

mismatch detected for '_MSC_VER': value '1700' doesn't match value '1900' in <filename.obj> 

and

error LNK2038: mismatch detected for 'RuntimeLibrary': value 'MT_StaticDebug' doesn't match value 'MD_DynamicDebug' in file.obj. 

The first error was caused by the fact that objects were compiled by different versions of the compiler which is not supported because different versions of the standard library are binary incompatible. Actually, the numbers 1700 and 1900 meant binary version of  v11 (Visual Studio 2012) is not compatible with the binary version of  v14 (Visual Studio 2015).[1][2]

About the second one, the reason behind is that both the library and your project must be linked with the same settings with regards to the C Runtime Library, and that whether in Release or Debug mode [3]. And here the solution in case you need to use static libs or you can deactivate static lib when building with CMake by setting:

-DBUILD_WITH_STATIC_CRT=OFF

So I decided to change the version of Tesseract so that it will be compatible with newer version of Visual Studio and x64 target project. As compiling Tesseract3.05.1 with cppan client was not possible for me (ccpan crashes just after launching it), I decided to go for VS2015_Tesseract. I followed these steps in addition to using v140 toolset because Visual Studio 2017 was shipped with v141 toolset, but it supports v140.

Besides, there were some other points to consider. Unfortunately, I got again errors after configuring my project exactly like it’s in tesseract\tesseract.vcxproj (extracted from VS2015_Tesseract) and I find out that the reason behind was: Having a “space” in tesseract path directory.

One last thing to consider when using this tesseract version is to the initialization of tesseract: TESSDATA_PREFIX environment variable is required to setup Tesseract data (tesseract/tessdata) folder. Also you can set it via api->Init(<data path here>, <language>)

I wanted to get also a newer version of OpenCV and went for OpenCV 3.02. Fortunately, that was easier than Tesseract installation, but of course not without some difficulties. I followed this Tutorial in order to install it. In these steps, you have to pay attention to the version of visual studio you choose when configuring CMake as there is a version for win64 (or x64) target project and a version for win32 (or x86). I chose “Visual Studio 15 2017 Win64”.

And yes! it worked! I tested opencv3.2 on a sample project and tesseract on a sample project with OpenCV! I was relieved! Finally, I could start coding!…

After launching the project “End to end text recognition“..I discovered that I wasn’t so close! There was a problem with opencv_text module. I got the error: Tesseract not found at runtime. So, I asked for help through OpenCV forum and StackOverflow and got a suggested solution. In the step of OpenCV configuration using CMake, besides providing Lept_LIBRARY Path and Tesseract_INCLUDE_DIR Path and tesseract_LIBRARY, check Build_examples checkbox, uncheck Build_tests and Build_perf_tests (You don’t need tests when you are a beginner and it takes a lot of memory space, as I know). Also, be sure that:

– tesseract305.dll file is in tesseract-3.05.01\build\bin\Release directory;

leptonica-1.74.4.dll file in leptonica-1.74.4/build/bin/Release.

-put them both in opencv\build\bin\Release (if you will use Release mode or bin\Debug if you will use debug mode).

After that, you have to open INSTALL.vcxproj from OpenCV build directory and build it in Debug (if you will need Debug mode later) and then Release. Be sure to select the right mode x64 (if you chose x64 in CMake configure) or x86 (if you chose Win32 in CMake configure), or you can get errors of type:

“fatal error LNK1112: module machine type ‘x64’ conflicts with target machine type ‘X86”  [4]

That’s all! Now you can start creating projects in Visual studio. Don’t forget to configure the project correctly: