{"id":1073,"date":"2017-09-01T17:47:28","date_gmt":"2017-09-01T15:47:28","guid":{"rendered":"http:\/\/www.sarahmestiri.com\/?p=1073"},"modified":"2017-09-02T11:31:22","modified_gmt":"2017-09-02T09:31:22","slug":"applied-text-classification-on-email-spam-filtering-part-1","status":"publish","type":"post","link":"https:\/\/www.sarahmestiri.com\/index.php\/2017\/09\/01\/applied-text-classification-on-email-spam-filtering-part-1\/","title":{"rendered":"Applied text classification on Email Spam Filtering [part 1]"},"content":{"rendered":"<p>Since last few months, I&#8217;ve started working on online\u00a0<a href=\"https:\/\/www.coursera.org\/specializations\/machine-learning\">Machine Learning Specialization<\/a> provided by the University of Washington. The first course was about <em>ML foundations<\/em>, the second about <em>Linear Regression<\/em>, and the third course on which \u00a0I&#8217;m currently on track is about <em>Classification<\/em>. I liked the courses almost in every aspect as they\u00a0teach ML algorithms implementation from Scratch. That was my goal when I decided to discover the field in more depth. But, honestly, I was feeling that there is a gap somehow because many questions were left without answers along the way. Then, after reading about how to start with Machine Learning, I found that most articles emphasized the importance of <strong>combining courses <\/strong>with<strong> practical projects<\/strong> in order to apply what&#8217;s learned and better assimilate it.. and it&#8217;s so true! You just have to try to combine both and you will soon notice the difference!<\/p>\n<p>So, here I am in my first practice application! \ud83d\ude00 I chose to try <strong><em>Email Spam Filtering<\/em><\/strong> as it&#8217;s a very common topic in applied classification. It can be understood easily because we are experiencing spam filtering in our e-mails every day.<\/p>\n<p>I followed a simple starting tutorial: <a href=\"https:\/\/appliedmachinelearning.wordpress.com\/2017\/01\/23\/email-spam-filter-python-scikit-learn\/\">Email Spam Filtering: A Python implementation with Scikit-learn<\/a>. Soon after finishing it, my brain started analyzing the steps and a bunch of questions bombarded my head!<\/p>\n<p>Why &#8220;equal number of spam and non-spam emails&#8221;? What&#8217;s stemming? Are there other methods to clean data other than removing stop words and lemmatization? How is the partition between the training set and test set done? Why no validation set? Why Naive Bayes classifier and SVM (Support Vector Machines) were specifically used? What makes Naive Bayes so popular for document classification problem? etc..<\/p>\n<blockquote><p>As William.S.Burroughs said, &#8220;Your mind will answer most questions if you learn to relax and wait for the answer.&#8221;<\/p><\/blockquote>\n<p>I took a breath and started answering question by question by doing sometimes search on the net, experimenting some changes in code and analyzing the output. And I&#8217;m happily sharing \u00a0the results:<\/p>\n<h3><span style=\"color: blue;\">1) The data we need<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">&#8211; how many emails we\u2019ve seen (will be used in train-test sets)<\/span><br \/>\n<span style=\"font-weight: 400;\">&#8211; how many emails go in each label (used to\u00a0detect if there is imbalanced data)<\/span><br \/>\n<span style=\"font-weight: 400;\">&#8211; how often a word is associated with each label (used to calculate the probability of an email being a spam or ham (class 0 or class 1))<\/span><\/p>\n<h3><span style=\"color: blue;\">2) Cleaning data<\/span><\/h3>\n<p><strong>Why cleaning the words list?<\/strong> Cleaning data is essential in order to reduce the probability of getting wrong results because some words have no influence on the classification (they can neither be associated with spam class and nor\u00a0with ham class) and there are words that can be normalized in order to group same-meaning words and reduce redundancy.\u00a0By acting on the quality of the training data, we can change what is called the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Accuracy_and_precision\">accuracy<\/a>\u00a0of the classifier. So removing<em> stop words, stemming<\/em>, and <em>lemmatization\u00a0<\/em>help in improving the results of Machine Learning algorithms.<\/p>\n<h3><span style=\"color: blue;\">3) Naive Bayes<\/span><\/h3>\n<p><strong>Why was Naive Bayes used?<\/strong> Naive Bayes has\u00a0highly\u00a0<span style=\"font-weight: 400;\">effective learning and prediction,\u00a0it&#8217;s often used to compare with more sophisticated methods because it&#8217;s fast and highly scalable (works well with high-dimensional data) and as Andrew Ng suggests when dealing with an ML problem start by trying with a simple quick and dirty algorithm and then expand from that point.<\/span><\/p>\n<p><strong>How is Naive Bayes simple and easy?\u00a0<\/strong>Naive Bayes is based on &#8220;Bayes&#8221; theorem and it&#8217;s called &#8220;Naive&#8221; because it\u00a0<span style=\"font-weight: 400;\">assumes that <em>features are independent<\/em> of each other given the class(no\/little correlation between features), which is not realistic. Thus,\u00a0<\/span><span style=\"font-weight: 400;\">Naive Bayes can learn individual features importance but can\u2019t determine the relationship among features. Besides, the training time with Naive Bayes is significantly smaller as opposed to alternative methods and it doesn&#8217;t require much training data.<\/span><\/p>\n<p><strong>Why Multinomial Naive Bayes? What about other models like Gaussian Naive Bayes or Bernoulli Naive Bayes?<\/strong><\/p>\n<p>Well, <strong><em>Multinomial NB<\/em><\/strong> considers the <strong><em>frequency<\/em>\u00a0<\/strong><em><strong>count<\/strong><\/em><strong>\u00a0<\/strong>(occurrences)\u00a0<em>of the features<\/em> (words in our case) while <strong>Bernoulli NB<\/strong> cares only about the <strong>presence or absence<\/strong> of a particular feature (word) in the document. The latter is adequate for features that are <em>binary-valued\u00a0<\/em>(Bernoulli, boolean). \u00a0Whereas, with <strong>Gaussian NB<\/strong>, features are <em>real-valued<\/em> or <em>continuous\u00a0<\/em>and their distribution is Gaussian, the Iris Flower dataset is an example with continuous features.<\/p>\n<h3><span style=\"color: blue;\">4) Support Vector Machines (SVM)<\/span><\/h3>\n<p><strong>Why was SVM used?\u00a0<\/strong>I didn&#8217;t find a specific reason to that, but what I learned is that SVM delivers high accuracy results because it uses\u00a0<span style=\"font-weight: 400;\">an optimization procedure.\u00a0<\/span><span style=\"font-weight: 400;\">SVM builds a classifier by searching for a separating hyperplane (<\/span><b>optimal hyperplane<\/b><span style=\"font-weight: 400;\">) which is optimal and <\/span><b>maximises <\/b><b>the margin <\/b>that separates\u00a0the categories (in our case spam and ham)<span style=\"font-weight: 400;\">. Thus, SVM has the advantage of robustness in general and effectiveness when\u00a0the number of dimensions is greater than the number of samples.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Unlike Naive Bayes, SVM is a non-probabilistic algorithm.<\/span><\/p>\n<p><strong>What&#8217;s\u00a0the difference between LinearSVC and SVC (Scikit-learn)?\u00a0<\/strong>The difference is that they don&#8217;t accept the same parameters. For example, LinearSVC does not accept kernel parameter as it&#8217;s supposed linear. SVC supports more parameters(C, gamma,..) since it holds all possible <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/svm.html#svm-kernels\">kernel functions<\/a> (linear, polynomial, rbf or radial basis function, sigmoid).<\/p>\n<p><strong>How can tuning SVM parameters be done?\u00a0<\/strong>Tuning SVM parameters improve the performance of the algorithm. Some of them have a higher impact:<\/p>\n<p><strong>-Kernel<\/strong>: Kernal is like a similarity function. It&#8217;s a way of computing the dot product of two vectors in possibly a high dimensional feature space using data transformations based on some provided constraints into a more complex space. Kernel functions are sometimes called &#8220;generalized dot product&#8221;.<\/p>\n<p><strong>-Gamma:<\/strong>\u00a0Kernel coefficient for \u2018rbf\u2019, \u2018poly\u2019 and \u2018sigmoid\u2019. Higher the value of gamma will try to exactly fit as per training data set i.e. generalization error and cause an over-fitting problem.<\/p>\n<p><strong>-C<\/strong>: The factor\u00a0<i>C<\/i>\u00a0in (3.15) is a parameter that allows one to trade off training error vs. model complexity. A small value for\u00a0<i>C<\/i>\u00a0will increase the number of training errors, while a large\u00a0<i>C<\/i>\u00a0will lead to a behavior similar to that of a hard-margin SVM.&#8221; Joachims (2002), page 40<\/p>\n<h3><span style=\"color: blue;\">5) Analyzing output in different cases<\/span><\/h3>\n<p><strong>What if I vary dictionary size?<\/strong><\/p>\n<p>Varying dictionary size means changing the number of features (words). So, I wanted to explore the impact of having more features and what&#8217;s the limit of a good result based on <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.metrics.confusion_matrix.html\">confusion matrix<\/a>\u00a0result.<\/p>\n<p>I tested on size= {3000,5000,6000,7000} and discovered that at size = 7000, SVM classification starts slightly dropping (false identification) while Naive Bayes delivered same results despite the size variation.<\/p>\n<p>I think that at that point maybe target classes started overlapping or training data overfitting. I&#8217;m not yet sure about the explanation of the result.<\/p>\n<p><strong>What if I try Gaussian and Bernoulli?<\/strong><\/p>\n<p>Obviously, introducing Bernoulli won&#8217;t help because as I explained above, it doesn&#8217;t provide enough information in our case, we need the count of words, not the presence\/absence of it.<\/p>\n<p><span style=\"font-weight: 400;\">Multinomial NB: <\/span><br \/>\n<span style=\"font-weight: 400;\">[[129 \u00a0\u00a01]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ \u00a09 121]]<\/span><br \/>\n<span style=\"font-weight: 400;\">Gaussian NB:<\/span><br \/>\n<span style=\"font-weight: 400;\">[[129 \u00a0\u00a01]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ 11 119]]<\/span><br \/>\n<span style=\"font-weight: 400;\">Bernoulli NB:<\/span><br \/>\n<span style=\"font-weight: 400;\">[[130 \u00a0\u00a00]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ 53 \u00a077]]<\/span><\/p>\n<p>As we can see, Multinomial NB outperformed both Gaussian NB and Bernoulli NB.<\/p>\n<p><strong>What if I try <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/grid_search.html\">GridSearch<\/a> on SVM in order to tune SVM parameters?<\/strong><br \/>\nParams GridSearch: param_grid = {&#8216;C&#8217;:[0.1,1,10,100,1000],&#8217;gamma&#8217;:[1,0.1,0.01,0.001,0.0001]}<br \/>\nBest params found: Gamma: 0.0001 ; C: 100<\/p>\n<p><span style=\"font-weight: 400;\">Linear SVM:<\/span><br \/>\n<span style=\"font-weight: 400;\">[[126 \u00a0\u00a04]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ \u00a05 125]]<\/span><br \/>\n<span style=\"font-weight: 400;\">Multinomial NB:<\/span><br \/>\n<span style=\"font-weight: 400;\">[[129 \u00a0\u00a01]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ \u00a09 121]]<\/span><br \/>\n<span style=\"font-weight: 400;\">SVM:<\/span><br \/>\n<span style=\"font-weight: 400;\">[[129 \u00a0\u00a01]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ 62 \u00a068]]<\/span><br \/>\n<span style=\"font-weight: 400;\">GridSearch on SVM:<\/span><br \/>\n<span style=\"font-weight: 400;\">[[126 \u00a0\u00a04]<\/span><br \/>\n<span style=\"font-weight: 400;\"> [ \u00a02 128]]<\/span><\/p>\n<p>SVM tuned parameters using GridSearch allowed to get better results.<\/p>\n<h4><span style=\"color: blue;\">Conclusion<\/span><\/h4>\n<p>So, that was about my first steps in Email-Spam Filtering application! I hope that it&#8217;s helpful if you are thinking about starting a text-classification project! And I will continue sharing various reflections and experiments along the way. For next time, I will explore more the improvement\/change of training data and features.<\/p>\n<p>The project is on <a href=\"https:\/\/github.com\/SarahMestiri\/MachineLearning\/tree\/master\/Email-Spam-Filter\">Github<\/a> too.<\/p>\n<p><strong class=\"markup--strong markup--p-strong\"><em class=\"markup--em markup--p-em\">Tell me about your experience with text-classification? In what did you apply it? What methods do you suggest to apply? What are the challenges?<\/em><\/strong><\/p>\n<p><strong>Helpful Resources:<\/strong><\/p>\n<p>[1] <a href=\"http:\/\/sebastianraschka.com\/Articles\/2014_naive_bayes_1.html\">Naive Bayes and Text Classification<\/a>.<br \/>\n[2]<a href=\"http:\/\/dataaspirant.com\/2017\/02\/06\/naive-bayes-classifier-machine-learning\/\">Naive Bayes by Example<\/a>.<br \/>\n[3] Andrew Ng explanation of Naive Bayes <a href=\"https:\/\/www.youtube.com\/watch?v=z5UQyCESW64\">video 1<\/a>\u00a0and <a href=\"https:\/\/www.youtube.com\/watch?v=NFd0ZQk5bR4\">video 2<\/a><br \/>\n[4] <a href=\"https:\/\/www.reddit.com\/r\/MachineLearning\/comments\/15zrpp\/please_explain_support_vector_machines_svm_like_i\/\">Please explain SVM like I am 5 years old<\/a>.<br \/>\n[5] <a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2015\/10\/understaing-support-vector-machine-example-code\/\">Understanding Support Vector Machines from examples<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since last few months, I&#8217;ve started working on online\u00a0Machine Learning Specialization provided by the University of Washington. The first course was about ML foundations, the second about Linear Regression, and the third course on which \u00a0I&#8217;m currently on track is about Classification. I liked the courses almost in every aspect as they\u00a0teach ML algorithms implementation [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1093,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"sfsi_plus_gutenberg_text_before_share":"","sfsi_plus_gutenberg_show_text_before_share":"","sfsi_plus_gutenberg_icon_type":"","sfsi_plus_gutenberg_icon_alignemt":"","sfsi_plus_gutenburg_max_per_row":"","footnotes":""},"categories":[11,23,13],"tags":[18,38,20,37,36],"class_list":["post-1073","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-my-activities","category-projects","category-technology","tag-machine-learning","tag-naive-bayes","tag-python","tag-svm","tag-text-classification"],"_links":{"self":[{"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/posts\/1073","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/comments?post=1073"}],"version-history":[{"count":22,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/posts\/1073\/revisions"}],"predecessor-version":[{"id":1085,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/posts\/1073\/revisions\/1085"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/media\/1093"}],"wp:attachment":[{"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/media?parent=1073"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/categories?post=1073"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sarahmestiri.com\/index.php\/wp-json\/wp\/v2\/tags?post=1073"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}