Data Visualizing Hollywood BoxOffice Revenue
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Sricharan Maddineni. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his first class project - R visualization (due on the 2th week of the program).
My goal was to analyze the accuracy of news headlines relating to Hollywood; including but not limited to the changes in domestic versus overseas BoxOffice revenue and the marketability of different genres overseas. Specifically, I focused on articles that had little or no visualizations, but drew clear conclusions based on general trends from data.
Outline
-
Data
-
Headlines
-
Visualizations
-
Conclusion
Data
I utilized two different websites for my analysis: IMDB and BoxOfficMojo. IMDB datasets were used to aggregate movie ratings, and the BoxOfficeMojo dataset was used for movie finance analysis. These datasets were cleaned and joined to create a single dataset containing movie ratings and revenues.
There was an alternative IMDB dataset which contained aggregate movie ratings and finances, but the IMDB datasets I chose had a subset of 668 IMDB users who all reviewed the same subset of movies. I considered this a more robust dataset since all 668 individual users reviewed the same set of movies, making their ratings more comparable - as opposed to the alternate dataset where each movie had a varying number of users that rated it.
First IMDB dataset containing movie ratings by movieID.
Cleaning second IMDB dataset and then Joining.
BoxOfficeMojo Data Set

after
It was important to clean the years in both datasets because multiple movies had the same name and incorrect matching would occur without joining by year and title.
Final Movie Data Set
Headline #1
โWhere Americans once were the only game in town for Hollywood, U.S. audiences are taking a back seat to moviegoers across the globe โ particularly in Asia.โ
โAnd foreign markets are getting the industry's highest-profile films first. Battleship opened in Asia and Europe more than a month before it reached the USA last May.
Data Visualization
Conclusion: Foreign revenue has accounted for an increasing percentage of total revenue every year since 1992 as shown by the increasing slope values for the regression lines.
Headline #2
โBig noisy spectacle travels best. Jason Statham, the close-cropped star of many a mindlessly violent film, is a particular Russian favourite. Films based on well-known literature (including cartoon books) and myths may also fare well.โ
โComedy travels badly: Will Ferrell and Adam Sandler provoke guffaws at home but incomprehension abroad"
Data Visualization
Conclusion: The overseas density plot confirms that Drama and Comedy genres perform worse overseas when compared to domestically. A majority of Drama/Comedy movies generate less than 50 mil overseas, while Action and Animation genres show a much wider distribution of revenues (overseas).
Headline #3
โ...little effort is being made to deliver sophisticated storytelling ... movies are crafted mainly to provoke visceral - as opposed to intellectual responseโ
bbc.com/culture/story/20130620-is-china-hollywoods-future
Visualization
Conclusion: The rating sweet spot that generates the most revenue is between 3.5 and 3.8. Movies that score greater than 4 show a sharp decline in revenues. This could be due to the fact that the average movie goer more easily appreciates an average movie (cough cough ** Michael Bay movies).
Visualization
Conclusion: There seems to be a linear trend between the number of movies a studio produces and it total domestic revenue. This doesnโt have to be the case; for example, Lionsgate produced 5 of the top 100 movies in 2015 and could have generated 100mil in revenue (~50mil actual). This leads me to believe that movie studios do equally well selecting which movies to produce.
Data Visualization
Conclusion: There seems to be a linear trend between how much revenue a movie made on its opening weekend and its lifetime domestic revenue. Hollywood considers opening weekend numbers as a good predictor of how well the movie will perform and this plot supports that theory.
Data Visualization
Conclusion: The Highest Grossing Movie per year accounted for a decreasing percentage of total BoxOffice Revenue. This could suggest that studio's are either making more money per movie, producing more movies or a combination of the two. Further analysis is required from different datasets.
Final Thoughts
My data visualizations confirmed many of the conclusions drawn in the news articles. What I found most interesting was how good a predictor opening weekend turns out to be for overall performance and that movie studios are evenly matched in terms of how well they select movies. Since audience reception is such a complex factor to predict, it's surprising that the studios are consistently able to make good decisions.