Using Data to Analyze Movie Reviews - NLP
NLP - Natural Language Processing is a subfield in data/computer science that deals with how computers are programmed to analyze human language. The goal of this NLP is to conduct sentiment analysis of movie reviews, a project Kaggle titled - Bag of Words Meets Bags of Popcorn. The models are to be trained to identify positive reviews and negative reviews.
The data was sourced from kaggle. It consists of 50000 movie reviews, 25000 of the reviews in the train data have labeled 'sentiment' and remaining 25000 of the review are in the test data which contains only the 'id' and the 'review' for each of the review.
Data preprocessing
Data Preprocessing from unstructured text to bag-of-words:
The first step in the data preprocessing is to clean the review as it contains html tags and different symbols which are not useful for the predictions - noise. The extraction of text-only from the review was processed with BeautifulSoup and regular expression.
Specifically, BeautifulSoup was used for removing HTML tags and regular expression was applied to remove punctuation marks. The resulting text was tokenized i.e converting them to lowercase and splitting them to words and stops words were removed using Python library called NLTK - Natural Language Toolkit.
Stop words are removed because they make the data unnecessarily bogus and contribute to curse of dimensionality while they have no statistical importance in the prediction. Just by visually comparing image1 and image2 above, it is clear that there is a significant reduction in the content of the review, and the removed data are all noise.
Representing raw text like this as an input for any classification algorithm or any other form of algorithm requires the text be converted to feature vectors - Word embedding. There are several means implementing this such as Word2Vec and Doc2Vec which are prediction based approach. I applied one of the oldest methods of conversion called Bag-of-words', which is a frequency based-approach.
Using scikit-learn's CountVectorizer, the reviews were converted to numeric representation called vectors, and then to arrays to enhance handling speed.
The resulting bag-of-words has 5000 features, a limit I had set in the argument of countVectorizer.
Modeling:
1 - Random Forest
2 - TensorFlow Binary Classification
Using Data to Analyze Random Forest
Having processed the reviews(text) to vectors, the next step was modeling using random forest classifier of sklearn. First I tuned hyperparameters, except for n_estimators:
Using the gridsearch result, I trained the model on random forest classifier with 100 trees (n_estimators = 100).
Using Data to Analyze TensorFlow
Applying tensorflow to train the model in order to identify positive and negative reviews was my actual target, to achieve this, I set the hyperparameters as follows:
Next step was creating placeholders for my dependent and independent variables, defining model parameters i.e weight and bias, and applying sigmoid function - for a binary classification model.
Upon opening training session and initiating the variables, the first issue I ran into was the shape of my Y-placeholder. Tensorflow could not train the model because the shape of the y_train fed into feed_dict differs from that of the placeholder (?, 2).
This I resolved by reshaping the Y-placeholder and encoded it with tensorflow one-hot encoder.
The encoded placeholder was passed into cost function and also in the estimation of accuracy. Another major change I made to the algorithm was to train the model in batches in order to save time. I saved the resulting variables from training the model using tensorflow train-saver and restored them to make predictions on my test data.
Result
For Random Forest classifier, I ran cross-validation to obtain root mean squared error score of my model.
The Kaggle score obtained from the model prediction is 0.84760 at position 344.
The model predicted close tie between negative and positive reviews by viewers, however, there are more negative than positive:
Class 0 represents the negative reviews and 1 represents the positives. Numerically, of the 25000 reviews, 12270 are negative, while 12730 are positive.
For TensorFlow, I obtained an accuracy of 100% on the test conducted on the splitted train data.
The Kaggle score obtained from the model prediction is 0.86196 at position 289.
Further analysis of the Tensorflow prediction also shows a close tie between the two classes of predictions, however, unlike the random forest model, there are slightly more positive reviews than the negative reviews.
Here, there are 12643 positive review and 12357 negatives
Future work
Future work will be tunning the tensorflow hyperparameter like the learning rate and batch size and also applying a different word embedding technique like Word2Vec or Doc2Vec. Having 100% accuracy locally on my tensorflow model and 86% accuracy on kaggle suggests the model seems overfitting, this will form part of what has to be investigated and corrected.