Web scraping project

Posted on Oct 30, 2017

This project was an exercise in scraping information from the web, cleaning it and gathering insights from it through visualization or using machine learning techniques where appropriate. The website scraped for the project was a movie review website www.empireonline.com. The information scraped included the movie title, duration, release date, review rating, name of reviewer, its certificate and the url which contains the complete review and a short preview of the movie.

When making plans to watch a movie, most people rely on movie review sites to get an opinion of the movie or on friends and family who might have already watched the movie. Most of the time, reliable ratings for movies are only available long after the movie has been released and in most cases full of spoilers.

The original intentions of the project were to scrape information publicly available on the net about a movie before its release like the name of the movie, its production studio and budget, the salaries of the main actors/ actresses, its director and its trailer script and use this information alone to come up with a rating for the movie before its release. This will help movie watchers in deciding whether to watch the movie or not without having to wait a few weeks or months for reliable ratings and getting many spoilers along the way.

The original scope of the project outline above had to be drastically reduced due to time constraints, the difficulties encountered in scraping the review site and also the lack of publicly available datasets for most of the desired variables.

The final product delivered was more a holistic approach to rating movies pre-release based on just the few variables which I was able to scrape namely title, duration, release month and certificate. This can be extended in the future to include more meaningful variables like the trailer, budget and cast and studio ratings.

The empireonline movie review website contained 475 pages with 24 movie links per page making a total of apporximately 11400 movies. The movie rating, reviews and the other variables can be scraped by navigating to the individual movie page. The website build-up upfront seemed ideal for scraping with scrapy shell but for some unexplained reason, the website could not be accessed used scrapy shell (only a https 504 error was being returned).  Scraping using selenium had to be utilized and this turned out to be an exercise on patience. The scraping process was plagued with numerous computer freeze-out, restarts and inconsistent data being written to disk with the 6 GB laptop running at almost 100 % cpu utilization. The freeze-outs and corrupt data was diminished by firing get requests for only 250 urls at a time which took between 40 and 70 minutes to be completed. Fortunately scraping from my was not blocked but the website seem to be implementing some throttling mechanisms which increased the scraping time the more i scraped.

Due to the throttling speed, all the urls could not be scraped. After removing the corrupt observations at the beginning and the NAs (since they were missing completely at random), the cleaned dataset for the analyses contained 3109 complete observations with 13 variables.

As part of the preprocessing, a sentiment analyses of the movie titles was performed using the textblob package in python which is an interface to the natural language toolkit (NLTK). It uses the NLTK corpora and comes equipped with a classifyer, translator, automatic language detection and many more features. The sentiments generated by the film title text and their numerical values were added to the dataset as new variables. Sentiment analyses on titles alone cannot be very effective since the text is too short but like I said earlier, analyses on film titles was just an oversimplified approach which ultimately needs to be substituted with more meaningful text like the film trailers.

An exploratory visual analyses of the dataset was then carried out, this is presented in the plots below:

A plot of the ratings  show that two third of the films had a rating of 3 or above. This ratio was also seen in the sentiment distribution on the next plot. The mappings of positive sentiments to a rating of 3 or higher has not been quantified here-in.

A look at the plot of the release years  below reveals that film production saw an explosion as from 1990 onward and witnessed a major dip betweeen 1999 and 2004. The dip between 1999 and 2004 can only be explained by the incompleteness of the dataset.

Plotting the length of the films against the release dates and years below shows a great variation in the length of films upto 2005 with many short films less that 30 minutes. As from 2005, the length of films sort of standardize to between 70 and 140 minutes. The increase in film length was probably linked to the technological improvements of digital storage media in the 1990s and 2000s.


The plot below shows that most films in the dataset had an above 15 certificate meaning that most films are suitable for people above 15 years of age. Unfortunately, the certification matrix is not the same in all countries producing films so generalisation should be cautioned.

A shiny app which can be accessed on this link (to be inserted) was developed for the graphical visualization of the dataset.

The prediction model shown below was built in the microsoft Azure machine learning studio to predict the movie ratings based on the title sentiment scores alone. A multiclass decision jungle algorithm was used since the predictions entails classifying the movies into the 5 rating classes.

From the confusion matrix below, it can be seen that the model was able to correctly classify 90.8 % of movies with a 3 rating but performed awfully on the other rating classes with the second best class having just an 8.5 correctness. That not withstanding, since the classes 3 and 4 had the highest number of movies, it was able to achieve a 37.4 % accuracy among the five classes. This is not impressive but still better than the 20% accuracy expected from predictions made by randomly guessing the five ratings.

As indicated above, more meaningful film pre-release ratings can be investigated using the movie trailers, budget, salaries of main actors / actresses and the studio plus directors ratings. This work simply exposed the possibility of such a rating tool and leaves the room open for their investigation.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI