Scraping Box Office Mojo

Christopher Redino

Posted on Mar 19, 2016

Contributed by Christopher Redino. Christopher is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - web scraping (due on the 6th week of the program).

Before joining the NYC Data Science Academy, I took on a small project of my own to play around with some of Python's more data science related modules. I scraped every movie page on Box Office Mojo using Beautiful Soup and urllib2, and then did some simple analysis using NumPy, Pandas and scikit-learn. The full code is available on my GitHub.

Box Office Mojo is a good site to practice scraping because it is (more or less) well organized so that a simple procedure can be repeated to scrape many pages with few exceptions.

I started by generating the list of movies I want to scrape, I do this by starting from the alphabetical index of movies on the site.

The python script, bom.py generates a list of links, each link leading to a different movie's page. Once all the movies on a given page have been added to the list, the script will go to the next page for that letter (if it exists). Once all pages for a given letter are exhausted, the scraper moves to the next letter.

scrape1 The "urlopen_with_retry" part is just a small modification to urlopen from urllib2 that adds retrying behavior; we don't want the whole scraper to fail if one of the many pages we navigate to is a little slow to load.

The "if__name__=="main" part is used to specify which thread will generate the list of movies, because this scraping process can be sped up a little if I divide it up among threads for parallel processing using the multiprocessing module for python.

Once we have the list of all movie links, the script runs through the whole list and for each movie page I scrape just a subset of the data on the page that I am interested in.

movie_example

From each movie page I scrape the title, domestic and world wide box office, the budget, the genre, a list of actors, directors, writers, producers and composers, as well as the rating, distributor, release date and runtime. Some of this code is straight forward, but the records on Box Office Mojo aren't perfectly uniform and some of the exceptions need to be handled here.

Finally, each movie's data is saved as a row in a CSV. Alternatively this can be saved to an SQL database instead; both versions of the code are available on my GitHub.

When everything is finished I have a CSV with data on about 13,000 movies and there are many things I can do with this data, but I just do two simple explorations here.

In movie_project1.py, I look at the relationship between a movie's domestic release date and average domestic box office, and then I go a little bit further and break this down by genre. The first step is to store the CSV data that I scraped into a Pandas data frame, as this is a bit easier to manipulate. The next few steps are some simple data cleaning (removing foreign only releases, forcing dates and dollar amounts to usable formats).

cleaning1

After these basic cleaning steps, I still need to clean up the genres a little. There is a lot of overlap in the genres as listed by Box office Mojo, (i.e. romantic comedy, sci-fi horror). My approach was to count a movie with overlapping genres multiple times, so if a movie is considered both comedy and animation, then it will show up twice, once in comedy and once in animation.

genre_clean

After sorting out the genres, I group the films by their weekend of release, calculate the average box office for each of these groups, and then rescale these averages so that I can show several genres on the same graph at the same scale.

aggregate

After these steps I plot the results for just a few genres and an inclusive category (I don't want to over clutter my graphic). There are many options for visualization/plotting in python, and for this visualization I decided to go with visvis to make a nice little 3D plot of my results.

Looking at the blue bars (all domestic films), the results are mostly what I would expect from my layman's knowledge of the film industry. We see a big lump in the summer months when block busters are released and more people are seeing movies, and then we see another sharp spike in the holidays, Thanksgiving and Christmas. Breaking it down by genre we see mostly the same pattern, but the data is a little more sparse now. There are a few small surprises when looking at it by genre: fantasy has a very large spike mid-summer, animation has a large spike that is absent in the other genres and romance spikes in the holiday season more strongly than other genres but there is no additional spike on Valentine's day which I might have expected.

My other small analysis was with some of the film maker data. In movie_project2.py I look at how the choice of director(s) and writer(s) can affect the domestic box office of a movie.

Again this requires some cleaning steps, as in movie_project1.py.

For each writer/director I assigned a score based on the scaled average domestic box office of their entire body of work. Each movie from box office mojo can then be assigned a director score and writer score. In cases where a film had more than one writer or more than one director I used the average of scores.

I can then make a 3D scatter plot of all the movies I scraped, showing how the two scores relate to the average box office. We can go a bit further and fit a plane to this scatter plot with multiple linear regression.

RegressionSurface

Intuitively it is obvious that better writers/directors will result in a larger box office return (the score is just based on their previous box office track record after all) but what is surprising is how much spread there is in the box office returns of films with a high writer/director score. If I come back to this later to do something more involved I think I would want to look at box office scaled by budget instead, as I think this would tighten up the points around the plane and make a model of the writer's/director's impact on the box office more meaningful. The very low points with a large writer director score can easily be explained, as not necessarily bad films, but perhaps more artistic, smaller budget films that successful writers and directors take on.

About Author

Christopher Redino

The common thread through all of Christopher's endeavors is his love of problem solving, with his usual methods being analytical and computational in nature. Having learned coding at an early age, Christopher picks up new programming languages quickly...

View all posts by Christopher Redino >

Python

Can the data from EA's FIFA Potential Rating Help Bettors?

Data Visualization

Using Data to Get Cats Adopted on petfinder.com

Data Visualization

Wine 101: Gathering Data From Vivino

Python

Using Data to Analyze The Library of Audible

Web Scraping

DATA STUDYING THE LABOR MARKET DURING A PANDEMIC

Cancel reply

You must be logged in to post a comment.

Google September 15, 2021

Google Below youll obtain the link to some sites that we think you'll want to visit.

Google January 22, 2021

Google Here are a number of the websites we suggest for our visitors.

Google December 21, 2020

Google Sites of interest we've a link to.

Google April 15, 2020

Google Although internet websites we backlink to beneath are considerably not connected to ours, we really feel they are really really worth a go by way of, so possess a look.

Google March 23, 2020

Google Check beneath, are some completely unrelated internet sites to ours, on the other hand, they are most trustworthy sources that we use.

fake cartier love bracelt January 17, 2017

Definitely a great enhancement. I'm just wondering why nothing was said about it, especially with it being in the top 10 enhancements. fake cartier love bracelt http://www.thislovebangle.cn/

bracelet cartier price replique January 11, 2017

cartierlovejesduas I take pleasure in, result in I found exactly what I used to be looking for. You've ended my four day lengthy hunt! God Bless you man. Have a nice day. Bye bracelet cartier price replique http://www.marquebijoux.com/

線上投注站 September 19, 2016

Thanks , I've just been looking for info about this topic for ages and yours is the greatest I've discovered till now. But, what about the bottom line? Are you sure about the source?

copie bracelet cartier vis September 16, 2016

cartierlovejesduas "attempted"??? so you didn't read quite well the whole story did you? copie bracelet cartier vis http://www.bestleve.com/

collier alhambra van cleef trefle imitation September 13, 2016

“graças ao piloto” estão todos vivos. a frase que gostariamos de ouvir collier alhambra van cleef trefle imitation http://www.vancleefalhambra.com/fr/cheap-vintage-alhambra-ring-diamond-vcara40900-p333.html

faux bracelet or rose August 3, 2016

cartierbraceletlove love love this photo of a most beautiful and elegant lady with such great style!! anyone know where to find this exquisite necklace?thanks!! faux bracelet or rose http://www.bestcalove.net/fr/cartier-love-bracelet-imitation-en-or-rose-avec-4-diamants-p-185.html

Scraping Box Office Mojo

About Author

Christopher Redino

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Scraping Box Office Mojo

About Author

Christopher Redino

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!