Web scraping project
This project was an exercise in scraping information from the web, cleaning it and gathering insights from it through visualization or using machine learning techniques where appropriate. The website scraped for the project was a movie review website www.empireonline.com. The information scraped included the movie title, duration, release date, review rating, name of reviewer, its certificate and the url which contains the complete review and a short preview of the movie.
When making plans to watch a movie, most people rely on movie review sites to get an opinion of the movie or on friends and family who might have already watched the movie. Most of the time, reliable ratings for movies are only available long after the movie has been released and in most cases full of spoilers.
The original intentions of the project were to scrape information publicly available on the net about a movie before its release like the name of the movie, its production studio and budget, the salaries of the main actors/ actresses, its director and its trailer script and use this information alone to come up with a rating for the movie before its release. This will help movie watchers in deciding whether to watch the movie or not without having to wait a few weeks or months for reliable ratings and getting many spoilers along the way.
The original scope of the project outline above had to be drastically reduced due to time constraints, the difficulties encountered in scraping the review site and also the lack of publicly available datasets for most of the desired variables.
The final product delivered was more a holistic approach to rating movies pre-release based on just the few variables which I was able to scrape namely title, duration, release month and certificate. This can be extended in the future to include more meaningful variables like the trailer, budget and cast and studio ratings.
The empireonline movie review website contained 475 pages with 24 movie links per page making a total of apporximately 11400 movies. The movie rating, reviews and the other variables can be scraped by navigating to the individual movie page. The website build-up upfront seemed ideal for scraping with scrapy shell but for some unexplained reason, the website could not be accessed used scrapy shell (only a https 504 error was being returned). Scraping using selenium had to be utilized and this turned out to be an exercise on patience. The scraping process was plagued with numerous computer freeze-out, restarts and inconsistent data being written to disk with the 6 GB laptop running at almost 100 % cpu utilization. The freeze-outs and corrupt data was diminished by firing get requests for only 250 urls at a time which took between 40 and 70 minutes to be completed. Fortunately scraping from my was not blocked but the website seem to be implementing some throttling mechanisms which increased the scraping time the more i scraped.
Due to the throttling speed, all the urls could not be scraped. After removing the corrupt observations at the beginning and the NAs (since they were missing completely at random), the cleaned dataset for the analyses contained 3109 complete observations with 13 variables.
As part of the preprocessing, a sentiment analyses of the movie titles was performed using the textblob package in python which is an interface to the natural language toolkit (NLTK). It uses the NLTK corpora and comes equipped with a classifyer, translator, automatic language detection and many more features. The sentiments generated by the film title text and their numerical values were added to the dataset as new variables. Sentiment analyses on titles alone cannot be very effective since the text is too short but like I said earlier, analyses on film titles was just an oversimplified approach which ultimately needs to be substituted with more meaningful text like the film trailers.
An exploratory visual analyses of the dataset was then carried out, this is presented in the plots below:
A plot of the ratings show that two third of the films had a rating of 3 or above. This ratio was also seen in the sentiment distribution on the next plot. The mappings of positive sentiments to a rating of 3 or higher has not been quantified here-in.
A look at the plot of the release years below reveals that film production saw an explosion as from 1990 onward and witnessed a major dip betweeen 1999 and 2004. The dip between 1999 and 2004 can only be explained by the incompleteness of the dataset.
Plotting the length of the films against the release dates and years below shows a great variation in the length of films upto 2005 with many short films less that 30 minutes. As from 2005, the length of films sort of standardize to between 70 and 140 minutes. The increase in film length was probably linked to the technological improvements of digital storage media in the 1990s and 2000s.
The plot below shows that most films in the dataset had an above 15 certificate meaning that most films are suitable for people above 15 years of age. Unfortunately, the certification matrix is not the same in all countries producing films so generalisation should be cautioned.
A shiny app which can be accessed on this link (to be inserted) was developed for the graphical visualization of the dataset.
The prediction model shown below was built in the microsoft Azure machine learning studio to predict the movie ratings based on the title sentiment scores alone. A multiclass decision jungle algorithm was used since the predictions entails classifying the movies into the 5 rating classes.
From the confusion matrix below, it can be seen that the model was able to correctly classify 90.8 % of movies with a 3 rating but performed awfully on the other rating classes with the second best class having just an 8.5 correctness. That not withstanding, since the classes 3 and 4 had the highest number of movies, it was able to achieve a 37.4 % accuracy among the five classes. This is not impressive but still better than the 20% accuracy expected from predictions made by randomly guessing the five ratings.
As indicated above, more meaningful film pre-release ratings can be investigated using the movie trailers, budget, salaries of main actors / actresses and the studio plus directors ratings. This work simply exposed the possibility of such a rating tool and leaves the room open for their investigation.