NLP Data analysis -- Movie Reviews from Rotten Tomatos
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Tingting Chang. She is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 31th, 2017. This post is based on her second class project - Web Scraping (due on the 6th week of the program).
About Tingting Chang: GitHub | LinkedIn
Introduction
Data shows it has became super common for people looking at reviews before they go to the theater to see a movie. Some website provides movie ranking, rating and reviews for coming movies as well as the old movies.
Rotten tomato is one of famous one and it grabs all the reviews from other movie critics websites. However, it brings a question: are these ranking, rating, critics true? Do they have any connection between them? I scraped all the movie reviews, rankings, ratings for the top 100 movies in 2016. After using NLTK package to tokenize data and remove all the stop words, along with punctuations, I was left with about 13467 reviews with only words for all 100 movies. From there, I used sentiment analysis, ft-idf, topic model(LDA) to analysis the reviews.
Data
This is how my data looks like:
The bar plotting for every movie's the number of reviews. I picked the 20 movies with the most reviews count and other 20 movies with the least number of reviews.
I used sentiment analysis function build in NLTK package to calculate sentiment score for every review:
For example, the first review "one disney best movies past years" has score 0.6369 which is pretty positive. Another negative review "animal antics display movie also gently broaches human issues stereotyping prejudice racism something " has score -0.4588 due to some negative words "stereotyping prejudice racism".
I also computed the average sentiment score for every movie:
Then I use box plot try to find out the relationship between the movie rating and sentiment mean compound score.
But there is not any specific relationship between the sentiment mean score and movie rating for every movie except that the most of movies have positive sentiment score.
In addition, I used another box plot for movie ranking and sentiment mean compound score. Then I found out that rank 2.0 has a negative sentiment score which was pretty interesting. I went to website and try to find out why.
For this movie: Things to Come ,the Rotten Tomato gives 100% rating and ranking 11th. However, you can see from the website that it only has 73% audience scores. I did some researches and I learned that some company would hire some reviewers to write fake reviews for a movie. Moreover, the significant reviewers' reviews will have a larger effect on ranking movie. Nevertheless, their taste can not 100% represent audience's taste. Therefore, it will be normal if movie ranking, rating does not match the sentiment score from movie reviews.
Also, I tried some EDA analysis to find out whether there was any relationship between movie ranking, rating, review counts and sentiment scores. Unfortunately, there is not any obvious relationship between them.
TF-IDF
TF, or Term Frequency, measures the count of words in a document. By using TF-IDF functions in NLTK packages, I gained words frequency for every words in the document:
The next step of TF-IDF is the IDF or inverse document frequency. Targets the words only appear in certain documents instead of in all the documents, and give them higher frequency. Now we have got a weight for every tokened word in every document and we can determine the important words from the unimportant ones.
Topic Model
Topic Modelling, as the name suggests, it is a process to derive hidden patterns exhibited by a text corpus. I built 9 topics for all the reviews and printed out top 20 frequent words for every topic:
For better visualization, I use WordCloud package in python to visualize the top 500 frequent words in every topic.
From above plots, we can see the word "fan" shows up in three topics. I have not found out why this is happened.