Using Data to Analyze Movie Reviews - NLP

Posted on May 20, 2019

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

NLP - Natural Language Processing is a subfield in data/computer science that deals with how computers are programmed to analyze human language. The goal of this NLP is to conduct sentiment analysis of movie reviews, a project Kaggle titled - Bag of Words Meets Bags of Popcorn. The models are to be trained to identify positive reviews and negative reviews.

The data was sourced from kaggle. It consists of 50000 movie reviews, 25000 of the reviews in the train data have labeled 'sentiment' and remaining 25000 of the review are in the test data which contains only the 'id' and the 'review' for each of the review.

Data preprocessing

Data Preprocessing from unstructured text to bag-of-words:

 

Using Data to Analyze Movie Reviews - NLP

The first step in the data preprocessing is to clean the review as it contains html tags and different symbols which are not useful for the predictions - noise. The extraction of text-only from the review was processed with BeautifulSoup and regular expression.

Using Data to Analyze Movie Reviews - NLP
Image1: Raw review

Specifically, BeautifulSoup was used for removing HTML tags and regular expression was applied to remove punctuation marks. The resulting text was tokenized i.e converting them to lowercase and splitting them to words and stops words were removed using Python library called NLTK - Natural Language Toolkit. 

Using Data to Analyze Movie Reviews - NLP
Image2: Processed review

Stop words are removed because they make the data unnecessarily bogus and contribute to curse of dimensionality while they have no statistical importance in the prediction. Just by visually comparing image1 and image2 above, it is clear that there is a significant reduction in the content of the review, and the removed data are all noise.

Representing raw text like this as an input for any classification algorithm or any other form of algorithm requires the text be converted to feature vectors - Word embedding. There are several means implementing this such as Word2Vec and Doc2Vec which are prediction based approach. I applied one of the oldest methods of conversion called Bag-of-words', which is a frequency based-approach.

Using scikit-learn's CountVectorizer, the reviews were converted to numeric representation called vectors, and then to arrays to enhance handling speed. 

The resulting bag-of-words has 5000 features, a limit I had set in the argument of countVectorizer.

Modeling:

1 - Random Forest

2 - TensorFlow Binary Classification

 

Using Data to Analyze Random Forest

Having processed the reviews(text) to vectors, the next step was modeling using random forest classifier of sklearn. First I tuned hyperparameters, except for n_estimators:

Using Data to Analyze Movie Reviews - NLP

Using the gridsearch result, I trained the model on random forest classifier with 100 trees (n_estimators = 100).

 

Using Data to Analyze TensorFlow

Applying tensorflow to train the model in order to identify positive and negative reviews was my actual target, to achieve this, I set the hyperparameters as follows:

Using Data to Analyze Movie Reviews - NLP

Next step was creating placeholders for my dependent and independent variables, defining model parameters i.e weight and bias, and applying sigmoid function - for a binary classification model.

Using Data to Analyze Movie Reviews - NLP

Upon opening training session and initiating the variables, the first issue I ran into was the shape of my Y-placeholder. Tensorflow could not train the model because the shape of the y_train fed into feed_dict differs from that of the placeholder (?, 2).

This I resolved by reshaping the Y-placeholder and encoded it with tensorflow one-hot encoder.

Using Data to Analyze Movie Reviews - NLP

The encoded placeholder was passed into cost function and also in the estimation of accuracy.  Another major change I made to the algorithm was to train the model in batches in order to save time. I saved the resulting variables from training the model using tensorflow train-saver and restored them to make predictions on my test data.

Using Data to Analyze Movie Reviews - NLP

Result

For Random Forest classifier, I ran cross-validation to obtain root mean squared error score of my model.

Using Data to Analyze Movie Reviews - NLP

The Kaggle score obtained from the model prediction is 0.84760 at position 344.

Using Data to Analyze Movie Reviews - NLP

The model predicted close tie between negative and positive reviews by viewers, however, there are more negative than positive:

Using Data to Analyze Movie Reviews - NLP

Class 0 represents the negative reviews and 1 represents the positives. Numerically, of the 25000 reviews, 12270 are negative, while 12730 are positive.

 

For TensorFlow, I obtained an accuracy of 100% on the test conducted on the splitted train data.

Using Data to Analyze Movie Reviews - NLP

The Kaggle score obtained from the model prediction is 0.86196 at position 289.

Further analysis of the Tensorflow prediction also shows a close tie between the two classes of predictions, however, unlike the random forest model, there are slightly more positive reviews than the negative reviews.

Using Data to Analyze Movie Reviews - NLP

Here, there are 12643 positive review and 12357 negatives

 

 

Future work

Future work will be tunning the tensorflow hyperparameter like the learning rate and batch size and also applying a different word embedding technique like Word2Vec or Doc2Vec. Having 100% accuracy locally on my tensorflow model and 86% accuracy on kaggle suggests the model seems overfitting, this will form part of what has to be investigated and corrected.

About Author

Oluwole Alowolodu

Recent graduate of Biotechnology - MS. Data science fellow and AI enthusiast.
View all posts by Oluwole Alowolodu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI