Github Link: - https://github.com/Fagun13/DataMinig-Final
Project Page Link:- https://fat8348.wixsite.com/website/single-project
Download project Link on homepage project section: - https://fat8348.wixsite.com/website
Problem Explanation:
Due to overuse of internet security is becoming a major issue in today's world.
Most social media is implementing security in their applications.
Email is a medium to communicate in office culture and it is an important part of the business.
Spam emails are one of the threats to our email systems.
Spam emails contain junk data that can affect your system, may hack your system.
Spam emails are a high risk to credential information on your system.
To move this mail to a separate folder is a major goal of this project.
Here, I am using 4 different dataset files
Bellow Link contains 1 File
File contains 3 columns, Email content, Email ID, Classification
Bellow Link contains 3 CSV files of the dataset
Each file contains the Subject and classification
Step 1: Load data and preprocess It.
In this step, I am loading content from CSV files and Removing special characters using regex.
Next Removing stop words using NLKT
Next Data Division int train dev test
Ratio 60 20 20 respectively
Train - used for training model
dev - is used for various testing and to come up with optimal hyperparameters
Here i am trying different models of algorithms on the dev dataset and come with final solution
Test - Final testing using parameters found from the dev
Step 2: Build a Model of classification
Here I am going to Build a model NB classifier for text classification
Work in this part,
Count how many spam document contains a particular word and add in a set of unique word with count
Count How many ham documents contain particular word and add inset of uniq words with count
How many spam sentences/documents are there and how many ham documents are there.
Step 4 : Test Build Model
Here, for each sentence in the test data is broken into words and find the probability of that word being in spam and ham documents.
Find the probability for sentence By Multiplying Each probability of word with final spam class and ham class probability respectively.
NB classifier Algorithm Explanation using Example:
Text:-Congrats!! You are the winner of 1000 amount
Preprocessing text
After pre-processing content is: congrats winner 1000 amount
1)Calculating probability of each word being in spam
Spam Email Probability calculationProbability of text being in Spam= probability of each word being in Spam * class probability
Probability of word being in spam = number of spam document contains word/number of spam document
Spam Class probability = number of spam documents/total number of documents</p><br>
p(spam class)= 3386/14533=0.23298699511456686
p(congrats/spam)=2.1240035768220234e-0
p(winner/spam)=3.1579738360469243e-09
p(1000/spam)=1.3616321878246288e-12
p(amount/spam)=5.813144390883573e-1
Final Probability of content being in spam
p(s/spam)=p(congrats/spam) * p(winner/spam) * p(1000/spam) * p(amount/spam) * p(spam class)
p(s/spam)=1.8832179849497723e-14
Same way Find for ham and compare both probabilities.
Assign whichever is higher
Result of NB classifier with dev dataset with alpha value 0 and 1
Step 4: Knn classifier
Algorithm Explanation
Find Distance with each document in the training dataset.
Find Class (spam or ham ) of K - nearest data.
Here k's value is 3.
Count spam or ham labels and assign whichever is highest.
The odd number will be better because if 2 documents are spam and 2 are of ham
Then it will be difficult to choose which class to assign.
So the odd value of K is better.
Here I have taken the concept of Jaccard distance.
Jaccard distance=matching 1/total 1
Jaccord distance calculation
#Jaccord = number of 11 matches / number of non-zero attributes
words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy'}
words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
intersec = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
print(len(intersec)/len(union))
The accuracy of this model is 26% Step 5: K-means Algorithm
Grouping algorithm
Algorithm Explanation
It groups spam and ham email according to its distance with center
The distance concept is the same as above
Value of k=2 because there are 2 classes spam and ham
Using Jaccard distance
1) Randomly select 2 centers (2 centers because k=2)
2)Find the distance of each data point to centers and assign a class of smaller distances to the center.
3)group again and find center again using new class assigned to each data
4)Repeat the process until you find the same centers
Here I am finding the distance of the center with each data in the dataset.
Whichever distance is higher will assign a class of that center and then find the center again.
The accuracy of the above K-means Model is 42% Step - 6: Find the best algorithm
As you can see in graph NB classifier performs better than k-nn and k -means
The reason is that k-nn and k-means are distance algorithms.
Finding distance of text takes time moreover it does not work well with a missing value.
So, the performance of the NB-classifier is high.
Step -7 :
Now as we know NB- classifier is the best algorithm, want to make changes in that approach as well.
Here I am adding the email and reply columns for use.
I am storing the domain of email into spam_domain_set and ham_domain_set.
Why I am including these 2 extra columns for classification?
I used to get a fake email of a job from one specific domain, I Sent them to reply one time by chance and now in Gmail I still get email from that domain, even though I have added that email in spam.
This is just because I did reply one time by mistake and then get to know that this is fake because they were asking me to pay 9000Rs to get a job in Infosys.
So to solve this kind of issue I came up with a new solution.
Note:- here I have not added email for every data in the file.
First Algorithm,
I am just using the subject, email, and classification columns from the dataset.
While training I am storing the domain of each email in spam and ham set respectively and storing count with each domain in spam and ham set.
While calculating probability will calculate the probability of domain in spam and ham respectively.
The testing model finds probability.
Here I am calculating the probability of the email domain as well if it is there in the data.
Reply Colum to take into account
Algorithm
This is a hybrid algorithm
Here I am adding conditions after calculating the probability for spam and ham.
if the reply count is greater than 0 then it will for sure ham email.
Reply count:- how many times I sent a reply to a particular sender.
This I have implemented because I feel this is what my Gmail is doing.
To improve this my solution is that to use the above approach.
If we get an email from the same domain that is already in spam then add emails in spam.
Changes I have made.
Final Accuracy of the model
60.17937219730942%
Contribution:
1)NB classifier with old approach and alpha value - Build manually using theory concept
2)K-nn classifier with jaccord - Build manually using theory concept
3)K-means - grouping spam and ham emails - Build manually using theory concept
4)NB -classifier with a subject, email, reply fields - Build manually using theory concept
5)Come to the best algorithm NB,k-means,k-nn
Challenges:
The main challenge in this project was to find the distance for large size of data
Train data size:14533
Test Data size:4846
For each test data find distance with each train data
Finding distance for 14533 * 4846 data was very difficult
My computer was running code for the above task for around 8 hours (maybe an issue in my memory)
So,reduced size of data drastically used a small amount of data for train and test for k-nn project.
Means using the concept of sampling.
References:
Comments