Spam Email Filtering

fat8348
Dec 7, 2021
5 min read

Updated: Dec 8, 2021

Github Link: - https://github.com/Fagun13/DataMinig-Final

Kaggle Link:- https://www.kaggle.com/fagunthakkar/notebook10f1bdaf80/data

YouTube Link:-https://www.youtube.com/watch?v=Z3QjRj0limM

Project Page Link:- https://fat8348.wixsite.com/website/single-project

Download project Link on homepage project section: - https://fat8348.wixsite.com/website

Problem Explanation:

Due to overuse of internet security is becoming a major issue in today's world.

Most social media is implementing security in their applications.

Email is a medium to communicate in office culture and it is an important part of the business.

Spam emails are one of the threats to our email systems.

Spam emails contain junk data that can affect your system, may hack your system.

Spam emails are a high risk to credential information on your system.

To move this mail to a separate folder is a major goal of this project.

Here, I am using 4 different dataset files

Bellow Link contains 1 File

File contains 3 columns, Email content, Email ID, Classification

https://www.kaggle.com/team-ai/spam-text-message-classification?select=SPAM+text+message+20170820+-+Data.csv

Bellow Link contains 3 CSV files of the dataset

Each file contains the Subject and classification

https://www.kaggle.com/team-ai/spam-text-message-classification?

Step 1: Load data and preprocess It.

In this step, I am loading content from CSV files and Removing special characters using regex.

Next Removing stop words using NLKT

Next Data Division int train dev test

Ratio 60 20 20 respectively

Train - used for training model

dev - is used for various testing and to come up with optimal hyperparameters

Here i am trying different models of algorithms on the dev dataset and come with final solution

Test - Final testing using parameters found from the dev

Step 2: Build a Model of classification

Here I am going to Build a model NB classifier for text classification

Work in this part,

Count how many spam document contains a particular word and add in a set of unique word with count

Count How many ham documents contain particular word and add inset of uniq words with count

How many spam sentences/documents are there and how many ham documents are there.

Step 4 : Test Build Model

Here, for each sentence in the test data is broken into words and find the probability of that word being in spam and ham documents.

Find the probability for sentence By Multiplying Each probability of word with final spam class and ham class probability respectively.

NB classifier Algorithm Explanation using Example:

Text:-Congrats!! You are the winner of 1000 amount

Preprocessing text

After pre-processing content is: congrats winner 1000 amount

1)Calculating probability of each word being in spam

Spam Email Probability calculationProbability of text being in Spam= probability of each word being in Spam * class probability

Probability of word being in spam = number of spam document contains word/number of spam document

Spam Class probability = number of spam documents/total number of documents</p><br>

p(spam class)= 3386/14533=0.23298699511456686

p(congrats/spam)=2.1240035768220234e-0

p(winner/spam)=3.1579738360469243e-09

p(1000/spam)=1.3616321878246288e-12

p(amount/spam)=5.813144390883573e-1

Final Probability of content being in spam

p(s/spam)=p(congrats/spam) * p(winner/spam) * p(1000/spam) * p(amount/spam) * p(spam class)

p(s/spam)=1.8832179849497723e-14

Same way Find for ham and compare both probabilities.

Assign whichever is higher

Result of NB classifier with dev dataset with alpha value 0 and 1

Step 4: Knn classifier

Algorithm Explanation

Find Distance with each document in the training dataset.

Find Class (spam or ham ) of K - nearest data.

Here k's value is 3.

Count spam or ham labels and assign whichever is highest.

The odd number will be better because if 2 documents are spam and 2 are of ham

Then it will be difficult to choose which class to assign.

So the odd value of K is better.

Here I have taken the concept of Jaccard distance.

Jaccard distance=matching 1/total 1

Jaccord distance calculation

#Jaccord = number of 11 matches / number of non-zero attributes

words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy'}

words_doc2 = {'data', 'is', 'a', 'new', 'oil'}

intersec = words_doc1.intersection(words_doc2)

union = words_doc1.union(words_doc2)

print(len(intersec)/len(union))

The accuracy of this model is 26% Step 5: K-means Algorithm

Grouping algorithm

Algorithm Explanation

It groups spam and ham email according to its distance with center

The distance concept is the same as above

Value of k=2 because there are 2 classes spam and ham

Using Jaccard distance

1) Randomly select 2 centers (2 centers because k=2)

2)Find the distance of each data point to centers and assign a class of smaller distances to the center.

3)group again and find center again using new class assigned to each data

4)Repeat the process until you find the same centers

Here I am finding the distance of the center with each data in the dataset.

Whichever distance is higher will assign a class of that center and then find the center again.

The accuracy of the above K-means Model is 42% Step - 6: Find the best algorithm

As you can see in graph NB classifier performs better than k-nn and k -means

The reason is that k-nn and k-means are distance algorithms.

Finding distance of text takes time moreover it does not work well with a missing value.

So, the performance of the NB-classifier is high.

Step -7 :

Now as we know NB- classifier is the best algorithm, want to make changes in that approach as well.

Here I am adding the email and reply columns for use.

I am storing the domain of email into spam_domain_set and ham_domain_set.

Why I am including these 2 extra columns for classification?

I used to get a fake email of a job from one specific domain, I Sent them to reply one time by chance and now in Gmail I still get email from that domain, even though I have added that email in spam.

This is just because I did reply one time by mistake and then get to know that this is fake because they were asking me to pay 9000Rs to get a job in Infosys.

So to solve this kind of issue I came up with a new solution.

Note:- here I have not added email for every data in the file.

First Algorithm,

I am just using the subject, email, and classification columns from the dataset.

While training I am storing the domain of each email in spam and ham set respectively and storing count with each domain in spam and ham set.

While calculating probability will calculate the probability of domain in spam and ham respectively.

The testing model finds probability.

Here I am calculating the probability of the email domain as well if it is there in the data.

Reply Colum to take into account

Algorithm

This is a hybrid algorithm

Here I am adding conditions after calculating the probability for spam and ham.

if the reply count is greater than 0 then it will for sure ham email.

Reply count:- how many times I sent a reply to a particular sender.

This I have implemented because I feel this is what my Gmail is doing.

To improve this my solution is that to use the above approach.

If we get an email from the same domain that is already in spam then add emails in spam.

Changes I have made.

Final Accuracy of the model

60.17937219730942%

Contribution:

1)NB classifier with old approach and alpha value - Build manually using theory concept

2)K-nn classifier with jaccord - Build manually using theory concept

3)K-means - grouping spam and ham emails - Build manually using theory concept

4)NB -classifier with a subject, email, reply fields - Build manually using theory concept

5)Come to the best algorithm NB,k-means,k-nn

Challenges:

The main challenge in this project was to find the distance for large size of data

Train data size:14533

Test Data size:4846

For each test data find distance with each train data

Finding distance for 14533 * 4846 data was very difficult

My computer was running code for the above task for around 8 hours (maybe an issue in my memory)

So,reduced size of data drastically used a small amount of data for train and test for k-nn project.

Means using the concept of sampling.

References:

https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

https://www.statology.org/jaccard-similarity-python/

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

https://en.wikipedia.org/wiki/K-means_clustering

https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value

Spam Email Filtering

Recent Posts

Comments