Sentiment Analysis On IMDB Dataset

Problem statement:

The main objective in this Project is to predict the sentiment for a number of movie reviews obtained from the Internet Movie Database (IMDb). This dataset contains 50,000 movie reviews that have been pre-labeled with “positive” and “negative” sentiment class labels based on the review content. 

The dataset can be obtained from - Here, courtesy of Stanford University and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 
They have datasets in the form of raw text as well as already processed bag of words formats. We will only be using the raw labeled movie reviews for our analysis. Hence our task will be to predict the sentiment of 15,000 labeled movie reviews and use the remaining 35,000 reviews for training our supervised models.

What is Sentiment analysis?

Sentiment analysis is contextual mining of text which identifies and extracts subjective information in source material, and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations. Sentiment analysis helps data analysts within large enterprises gauge public opinion, conduct nuanced market research, monitor brand and product reputation, and understand customer experiences. For instance, sentiment analysis may be performed on Twitter to determine overall opinion on a particular trending topic. Companies and brands often utilize sentiment analysis to monitor brand reputation across social media platforms or across the web as a whole.

What we will do:

We would cover a two varieties of techniques for analysing sentiment, which include the following. 
  • Unsupervised lexicon-based models 
  • Traditional supervised Machine Learning models

Step 1: Loading Our Data

Fortunately we have a csv file containing all the data which makes data loading super easy. Using pandas's read_csv() method we can load the dataset in only one line.

Step 2: Text Preprocessing

So as well know machine learning works with numerical data, that's why whenever we have any textual data, we need to apply several pre-processing steps to data, in order to transform those words into numerical features, which then will be fed to the machine learning algorithms. The pre-processing steps for a problem depend mainly on the domain and the problem itself, hence, we don’t need to apply all steps to every problem. Here we will use few of them.

1) Tokenization

Tokenization is a way of separating a piece of text into smaller units called tokens. For example, if there is a text saying  "Hello, World!". Then the tokenized output will be a list of words in that text. Here it will be a list containing [ 'Hello', ',' ,'World', '!' ].

2) Removing Html tags

Sometimes, if our data is coming from web, which often is the case most of the times, then might contain some html tags in our data. These tags are not important for our sentiment analysis task, so it would be a better idea to remove those.

3) Removing accentend words

We need to make sure, all of the data is in the same format. Otherwise we need to convert it to a single standardized format like ASCII. A simple example
would be converting é to e

4) Removing contractions

In the English language, contractions are basically shortened versions of words or syllables. These shortened versions of existing words or phrases are created by removing specific letters and sounds. Examples would be, expand can’t to can not.

5) Removing Special characters

We will remove special characters such as '@','!','&'. We can use regex to achieve this.

6) Removing stopwords

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

7) Stemming

Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words. For ex, for word "Removed" the stemmed output word will be "Remove".

8) Lemmatization

Like stemming, lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word is Meaningful. We will get valid words if we use lemmatization. In NLTK, we use the WordNetLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter.

*Full Code for pre-processing is available here.

Step 3: Training a Model 

1st Approach : Doing Sentiment Analysis using
Unsupervised Lexicon-Based Models

Using Afinn

AFINN is a lexicon of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011. To try afinn, click here.
Here is a simple example how we use afinn sentiment analysis.
Small example using Afinn:
>>> from afinn import Afinn
>>> afinn = Afinn()
>>> afinn.score('This was an excellent trick')
3.0

So, as shown in the above example we pass some text to the afinn object it returns the sentiment score for that text. What we will do is that pass all the reviews in our dataset, one by one and store the scores returned. Then we will compare those scores with actual sentiments in the dataset to get the accuracy.
Here is the result:


Sentiment Analysis Using SentiWordNet

SentiWordNet is an another lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. 
Example:
>>> import nltk
>>> from nltk.corpus import sentiwordnet as swn
>>> list(swn.senti_synsets('Good'))[0].pos_score()
0.5 >>> list(swn.senti_synsets('Good'))[0].neg_score()
0.0

In this case, it returns two sentiment score, rather than only one, so we can tell, if sentiment is positive, how positive it is as well as how negative it is. we can access these scores with pos_score() and neg_score() methods.
We apply this on all of our reviews in IMDB dataset, and compare the predictions with actual sentiments and here are the results: 


Sentiment Analysis Using Vader

Vader is yet another lexiconn based sentiment analyzer. read more here.
Here a simple example of how to use vader.
>>> from nltk.sentiment.vader import SentimentIntensityAnalyzer
>>> analyzer = SentimentIntensityAnalyzer()
>>> scores = analyzer.polarity_scores("This movie was not that bad")
>>> print(scores) {'neg': 0.0, 'neu': 0.637, 'pos': 0.363, 'compound': 0.431}

We do this on all of the reviews in our dataset and keep all the scores, then compare them to the actual score to see the performance.


Conclusion from Traditional Models

Looking at the performance, it's clear that winner is SentiWordNet. It has lowest False negatives and recall score is better than other two.

2nd Approach: Sentiment Analysis using Supervised models

Supervised models are more advanced techniques which uses machine learning to learn the sentiments of the sentences and when given a new sentence it tries to predict the sentiment of that sentence based on the previous data on which it has learned. I am not going to explain every algorithm, because that would be a totally lengthy topic. So we will only discuss the performance of various models trained on our dataset, and compare them.

Logistic Regression with Bag of Words representation

So first, we will do some feature engineering, creating of Bag of words representation of our dataset. Machine learning models are nothing but a numerical functions which when given an input returns an output. We can directly pass the text information to a machine learning model. that's why we need to convert it into a numerical vector, which in this case is our Bag of words vector.
Then we apply logistic regression on these vectors, remember our labels does not need to be converted in bag of words vector, though we do have to convert them using Label Encoder or OneHot Encoder. After training our Logistic regression model we predict the sentiments on test set and compare the results.


Logistic Regression with TF-IDF vector representation

 TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.


Random Forest with Bag of Words representation

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean/average prediction of the individual trees.


Random Forest  with TF-IDF vector representation

Here are the results of random forest model with TF-IDF representation of movie reviews:

* Full notebook containing the code is available here.

Final Conclusion:

Sentiment analysis of customers reviews can be useful to make the business decisions. we discussed about different Sentiment analysis algorithms, both traditional as well as supervised. In comparison, supervised algorithms tends to perform better, it's kind of obvious because their supervised nature. 
Logistic regression is the ultimate winner here, it's because we are actually doing a binary classification while predicting if sentiment is positive or negative. and logistic regression works very well for such binary classification tasks. Also the training logistic regression is faster. 

References:

Check out Dereck's blog on sentiment analysis. 

Comments

Post a Comment

Popular posts from this blog

Reinforcement Learning - UCB and Thompson Sampling