Movie Review Analysis - NLP

Posted on May 20, 2019

NLP - Natural Language Processing is a subfield in data/computer science that deals with how computers are programmed to analyze human language. The goal of this NLP is to conduct sentiment analysis of movie reviews, a project Kaggle titled - Bag of Words Meets Bags of Popcorn. The models are to be trained to identify positive reviews and negative reviews.

The data was sourced from kaggle. It consists of 50000 movie reviews, 25000 of the reviews in the train data have labeled 'sentiment' and remaining 25000 of the review are in the test data which contains only the 'id' and the 'review' for each of the review.

Data preprocessing

Data Preprocessing from unstructured text to bag-of-words:

The first step in the data preprocessing is to clean the review as it contains html tags and different symbols which are not useful for the predictions - noise. The extraction of text-only from the review was processed with BeautifulSoup and regular expression.

Image1: Raw review

Specifically, BeautifulSoup was used for removing HTML tags and regular expression was applied to remove punctuation marks. The resulting text was tokenized i.e converting them to lowercase and splitting them to words and stops words were removed using Python library called NLTK - Natural Language Toolkit. 

Image2: Processed review

Stop words are removed because they make the data unnecessarily bogus and contribute to curse of dimensionality while they have no statistical importance in the prediction. Just by visually comparing image1 and image2 above, it is clear that there is a significant reduction in the content of the review, and the removed data are all noise.

Representing raw text like this as an input for any classification algorithm or any other form of algorithm requires the text be converted to feature vectors - Word embedding. There are several means implementing this such as Word2Vec and Doc2Vec which are prediction based approach, here, I applied one of the oldest methods of conversion called Bag-of-words', which is a frequency based-approach.

Using scikit-learn's CountVectorizer, the reviews were converted to numeric representation called vectors, and then to arrays to enhance handling speed. 

The resulting bag-of-words has 5000 features, a limit I had set in the argument of countVectorizer.


1 - Random Forest

2 - TensorFlow Binary Classification


Random Forest

Having processed the reviews(text) to vectors, the next step was modeling using random forest classifier of sklearn. First I tuned hyperparameters, except for n_estimators:

Using the gridsearch result, I trained the model on random forest classifier with 100 trees (n_estimators = 100).



Applying tensorflow to train the model in order to identify positive and negative reviews was my actual target, to achieve this, I set the hyperparameters as follows:

Next step was creating placeholders for my dependent and independent variables, defining model parameters i.e weight and bias, and applying sigmoid function - for a binary classification model.

Upon opening training session and initiating the variables, the first issue I ran into was the shape of my Y-placeholder. Tensorflow could not train the model because the shape of the y_train fed into feed_dict differs from that of the placeholder (?, 2).

This I resolved by reshaping the Y-placeholder and encoded it with tensorflow one-hot encoder.

The encoded placeholder was passed into cost function and also in the estimation of accuracy.  Another major change I made to the algorithm was to train the model in batches in order to save time. I saved the resulting variables from training the model using tensorflow train-saver and restored them to make predictions on my test data.


For Random Forest classifier, I ran cross-validation to obtain root mean squared error score of my model.

The Kaggle score obtained from the model prediction is 0.84760 at position 344.

The model predicted close tie between negative and positive reviews by viewers, however, there are more negative than positive:

Class 0 represents the negative reviews and 1 represents the positives. Numerically, of the 25000 reviews, 12270 are negative, while 12730 are positive.

For TensorFlow, I obtained an accuracy of 100% on the test conducted on the splitted train data.

The Kaggle score obtained from the model prediction is 0.86196 at position 289.

Further analysis of the Tensorflow prediction also shows a close tie between the two classes of predictions, however, unlike the random forest model, there are slightly more positive reviews than the negative reviews.

Here, there are 12643 positive review and 12357 negatives



Future work will be tunning the tensorflow hyperparameter like the learning rate and batch size and also applying a different word embedding technique like Word2Vec or Doc2Vec. Having 100% accuracy locally on my tensorflow model and 86% accuracy on kaggle suggests the model seems overfitting, this will form part of what has to be investigated and corrected.

About Author

Oluwole Alowolodu

Recent graduate of Biotechnology - MS. Data science fellow and AI enthusiast.
View all posts by Oluwole Alowolodu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp