Problem Explanation
Here. the task is to find the emotion of the sentence or we can say sentiment. The machine will try to predict the sentiment the user wants to convey through a sentence. Here using Kaggle data. There are 3 forms of data in Kaggle. I am using imbd_labeled.txt. It has 1000 rows with sentiment associated. Two classes of sentiments first positive and the second negative.
To solve this I am going to use a Naive Bayes classifier. It uses the concept of probability theory. The main assumption behind this is that conditionally independent probabilities.
Root of this classifier is bayes thorem. Bellow is bayes theorm p(a/b)=p(b/a)p(a)/p(b).
The main advantage of this classification method is that it gives the probability for each classification. Moreover, it can also handle null values or missing values using the smoothing concept.
Many classification algorithms use the concept of distance, in that cases handling missing value is difficult. In-text classification task there can be many missing values. So, the NB classifier is a good choice in the task of text classification.
Solution
First, I am doing preprocessing on sentences.
This task includes reading from a file line by line. Here sentiment of the sentence is associated with the last index of each sentence. So first removing sentiment from a sentence. Then preprocess each sentence. Removed special characters, stopwords and made the whole sentence in lower case. I have created an array of special characters and removed each one if found in a sentence. Got a list of stop words from nltk and removed if found in sentance.
Here I have used nltk library for removing stop words and tokenizing sentences into an array of words.
The second task is to split into train, dev, and test split.
The division ratio for train, dev, and test - 60 20 20.
The training dataset is for building model.
Dev dataset is for testing various hyperparameters and returning optimal hyperparameters for testing.
While doing testing using the test dataset we use optimal parameters we got from the dev dataset.
The Third task is to build the model using a training dataset.
Here I have created a function for training the model. Use this function multiple times by just passing the dataset through which you want to build the model.
The function takes the dataset as input and gives the result as the count of each word in all documents, positive document and negative document.
Here, the task is to just build a vocabulary or dictionary of words with count to make a model.
After that, we will use this count to find the probability of each sentence being positive and negative. Which one is highest will assign that to a sentence.
First, tokenize each sentence, add each word in a set. Count the number of positive sentences and negative sentences from all sentences.
trListAll = List of all words in all documents with repetition
trListPos = List of all words in all positive sentence(document) with repetition
trListNeg = List of all words in all negative sentence(document) with repetition
trSetAll= This is basically set. List of all words in all documents without repetition. Means all unique words.
trSetPos= This is also set. List of all words in positive documents(sentences) without repetition. Means all unique words.
trSetNeg=This is also set. List of all words in Negative documents(sentences) without repetition. Means all unique words.
posDocCount=Total number of the positive sentences in all list of sentences
negDocCount=Total number of the negative sentences in all list of sentences
trSetAllCount= This is set , it holds all unique words in a whole current dataset with occurrence of each word in all documents. A number of documents containing a particular word.
trSetPosCount= This is set, it holds all unique words in all positive sentences in a current dataset with the occurrence of each word in positive documents. A number of positive documents containing a particular word.
trSetNegCount= This is set, it holds all unique words in all negative sentences in a current dataset with the occurrence of each word in negative documents. A number of positive documents containing a particular word.
This will be used in the next steps to calculate the probabilities.
Forth, the Task is to define testing. I have created a testing function. This function can be used again if one wants to predict the accuracy of a new dataset.
This breaks each sentence into words from the test dataset. Now find if that word is present on a positive set of words. If yes will get a count of sentences that have a current word. Divide this count by the number of positive sentences in the dataset. If word not fount on positive st of words then it will be zero. The same is applied to the negative class. Now we will multiply this probability with a probability of positive and negative respectively. Next is to find which probability is higher than the other one and apply the class accordingly. If the probability of a word being in positive p(word/pos) is greater than the probability of a word being in negative p(word/neg) then assign positive class otherwise negative class. Now, this assignment will be checked with the actual sentiment class, if it matches will increment the accurate count. In last will find the percentage of accuracy by dividing accurate prediction *100 by a total number of sentences.
After this, I found the accuracy of model build using train dataset and test dataset.
accuracy on the training dataset was 98% and the accuracy of the test dataset was 49%.
The fifth task is dividing the dev dataset for 5-fold validation and performing training and testing on them. Here 5-fold means we will divide the dev dataset into 5 parts and use one part as testing another 4 parts for training. This process is repeated until we process each data as test data. So basically we perform this action 5 times and 5 times we get the different train and test datasets. We perform training and testing each time.
With 5-fold dev dataset accuracy 40% and without 5-fold dev dataset accuracy 43%.
The sixth task is to apply Laplace smoothing on data. Why smoothing? - To solve the problem of missing value we use smoothing. For example, if we find some new value in the test dataset which is not there in the training dataset then due to the normal process we will get zero probability. To avoid this issue of zero probability or missing value we use Laplace smoothing. In Laplace smoothing, we add alpha in the numerator. Add alpha value multiplied with no of classes in the denominator. So now we will not have zero probability.
For this task, I have created a new method to find the probability of each word being in a particular sentiment (positive/negative). The formula for Laplace smoothing is
Here d is a number of classes for classification. For this example value of d will be 2 because there are two sentiments to do classification, positive and negative.
Passing a value of alpha in function to check the effect on the result.
Again using this function we try to find accuracy on the dev dataset using different values of alpha.
As you can see as alpha's value increases accuracy also increases.
So optimal value of alpha is 100. (hyper parameter we choose from dev dataset)
The final task is to apply the same thing for testing the dataset.
Here I have applied Laplace smoothing with alpha value 100 and found the final accuracy of the model.
Last but not the list,
Find 10 positive and negative words with the highest probability.
Probability(pos/word)= p(pos and word)/ p(word) same way for negative.
From this, we find the highest 10 probability for each emotion.
Reference
留言