Scraping Box Office Mojo

Posted on Mar 19, 2016
Contributed by Christopher Redino. Christopher is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - web scraping (due on the 6th week of the program).

Before joining the NYC Data Science Academy, I took on a small project of my own to play around with some of Python's more data science related modules. I scraped every movie page on Box Office Mojo using Beautiful Soup and urllib2, and then did some simple analysis using NumPy, Pandas and scikit-learn. The full code is available on my GitHub.

Box Office Mojo is a good site to practice scraping because it is (more or less) well organized so that a simple procedure can be repeated to scrape many pages with few exceptions.

I started by generating the list of movies I want to scrape, I do this by starting from the alphabetical index of movies on the site.

alpha_index

The python script, bom.py generates a list of links, each link leading to a different movie's page. Once all the movies on a given page have been added to the list, the script will go to the next page for that letter (if it exists). Once all pages for a given letter are exhausted, the scraper moves to the next letter.

scrape1The "urlopen_with_retry" part is just a small modification to urlopen from urllib2 that adds retrying behavior; we don't want the whole scraper to fail if one of the many pages we navigate to is a little slow to load.

The "if__name__=="main" part is used to specify which thread will generate the list of movies, because this scraping process can be sped up a little if I divide it up among threads for parallel processing using the multiprocessing module for python.

parallel.jpg

Once we have the list of all movie links, the script runs through the whole list and for each movie page I scrape just a subset of the data on the page that I am interested in.

movie_example

 

From each movie page I scrape the title, domestic and world wide box office, the budget, the genre, a list of actors, directors, writers, producers and composers, as well as the rating, distributor, release date and runtime. Some of this code is straight forward, but the records on Box Office Mojo aren't perfectly uniform and some of the exceptions need to be handled here.

scrape2.jpg

Finally, each movie's data is saved as a row in a CSV. Alternatively this can be saved to an SQL database instead; both versions of the code are available on my GitHub.

writefil.jpg

When everything is finished I have a CSV with data on about 13,000 movies and there are many things I can do with this data, but I just do two simple explorations here.

In movie_project1.py, I look at the relationship between a movie's domestic release date and average domestic box office, and then I go a little bit further and break this down by genre. The first step is to store the CSV data that I scraped into a Pandas data frame, as this is a bit easier to manipulate. The next few steps are some simple data cleaning (removing foreign only releases, forcing dates and dollar amounts to usable formats).

cleaning1

After these basic cleaning steps, I still need to clean up the genres a little. There is a lot of overlap in the genres as listed by Box office Mojo, (i.e. romantic comedy, sci-fi horror). My approach was to count a movie with overlapping genres multiple times, so if a movie is considered both comedy and animation, then it will show up twice, once in comedy and once in animation.

genre_clean

After sorting out the genres, I group the films by their weekend of release, calculate the average box office for each of these groups, and then rescale these averages so that I can show several genres on the same graph at the same scale.

aggregate

 

After these steps I plot the results for just a few genres and an inclusive category (I don't want to over clutter my graphic). There are many options for visualization/plotting in python, and for this visualization I decided to go with visvis to make a nice little 3D  plot of my results.

Genres3D.png

Looking at the blue bars (all domestic films), the results are mostly what I would expect from my layman's knowledge of the film industry. We see a big lump in the summer months when block busters are released and more people are seeing movies, and then we see another sharp spike in the holidays, Thanksgiving and Christmas. Breaking it down by genre we see mostly the same pattern, but the data is a little more sparse now. There are a few small surprises when looking at it by genre: fantasy has a very large spike mid-summer, animation has a large spike that is absent in the other genres and romance spikes in the holiday season more strongly than other genres but there is no additional spike on Valentine's day which I might have expected.

My other small analysis was with some of the film maker data. In movie_project2.py I look at how the choice of director(s) and writer(s) can affect the domestic box office of a movie.

Again this requires some cleaning steps, as in movie_project1.py.

For each writer/director I assigned a score based on the scaled average domestic box office of their entire body of work. Each movie from box office mojo can then be assigned a director score and writer score. In cases where a film had more than one writer or more than one director I used the average of scores.

I can then make a 3D scatter plot of all the movies I scraped, showing how the two scores relate to the average box office. We can go a bit further and fit a plane to this scatter plot with multiple linear regression.

RegressionSurface

Intuitively it is obvious that better writers/directors will result in a larger box office return (the score is just based on their previous box office track record after all) but what is surprising is how much spread there is in the box office returns of films with a high writer/director score.  If I come back to this later to do something more involved I think I would want to look at box office scaled by budget instead, as I think this would tighten up the points around the plane and make a model of the writer's/director's impact on the box office more meaningful. The very low points with a large writer director score can easily be explained, as not necessarily bad films, but perhaps more artistic, smaller budget films that successful writers and directors take on.

About Author

Christopher Redino

The common thread through all of Christopher's endeavors is his love of problem solving, with his usual methods being analytical and computational in nature. Having learned coding at an early age, Christopher picks up new programming languages quickly...
View all posts by Christopher Redino >

Related Articles

Leave a Comment

Google September 15, 2021
Google Below you’ll obtain the link to some sites that we think you'll want to visit.
Google January 22, 2021
Google Here are a number of the websites we suggest for our visitors.
Google December 21, 2020
Google Sites of interest we've a link to.
Google April 15, 2020
Google Although internet websites we backlink to beneath are considerably not connected to ours, we really feel they are really really worth a go by way of, so possess a look.
Google March 23, 2020
Google Check beneath, are some completely unrelated internet sites to ours, on the other hand, they are most trustworthy sources that we use.
fake cartier love bracelt January 17, 2017
Definitely a great enhancement. I'm just wondering why nothing was said about it, especially with it being in the top 10 enhancements. fake cartier love bracelt http://www.thislovebangle.cn/
bracelet cartier price replique January 11, 2017
cartierlovejesduas I take pleasure in, result in I found exactly what I used to be looking for. You've ended my four day lengthy hunt! God Bless you man. Have a nice day. Bye bracelet cartier price replique http://www.marquebijoux.com/
線上投注站 September 19, 2016
Thanks , I've just been looking for info about this topic for ages and yours is the greatest I've discovered till now. But, what about the bottom line? Are you sure about the source?
copie bracelet cartier vis September 16, 2016
cartierlovejesduas "attempted"??? so you didn't read quite well the whole story did you? copie bracelet cartier vis http://www.bestleve.com/
collier alhambra van cleef trefle imitation September 13, 2016
“graças ao piloto” estão todos vivos. a frase que gostariamos de ouvir collier alhambra van cleef trefle imitation http://www.vancleefalhambra.com/fr/cheap-vintage-alhambra-ring-diamond-vcara40900-p333.html
faux bracelet or rose August 3, 2016
cartierbraceletlove love love this photo of a most beautiful and elegant lady with such great style!! anyone know where to find this exquisite necklace?thanks!! faux bracelet or rose http://www.bestcalove.net/fr/cartier-love-bracelet-imitation-en-or-rose-avec-4-diamants-p-185.html

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI