Data Visualizing Hollywood BoxOffice Revenue

Posted on Feb 1, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Sricharan Maddineni. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his first class project - R visualization (due on the 2th week of the program).

My goal was to analyze the accuracy of news headlines relating to Hollywood; including but not limited to the changes in domestic versus overseas BoxOffice revenue and the marketability of different genres overseas. Specifically, I focused on articles that had little or no visualizations, but drew clear conclusions based on general trends from data.


  1. Data
  2. Headlines
  3. Visualizations
  4. Conclusion


I utilized two different websites for my analysis: IMDB and BoxOfficMojo. IMDB datasets were used to aggregate movie ratings, and the BoxOfficeMojo dataset was used for movie finance analysis. These datasets were cleaned and joined to create a single dataset containing movie ratings and revenues.

There was an alternative IMDB dataset which contained aggregate movie ratings and finances, but the IMDB datasets I chose had a subset of 668 IMDB users who all reviewed the same subset of movies. I considered this a more robust dataset since all 668 individual users reviewed the same set of movies, making their ratings more comparable - as opposed to the alternate dataset where each movie had a varying number of users that rated it.

First IMDB dataset containing movie ratings by movieID.

before Data Visualizing Hollywood BoxOffice Revenue             after Data Visualizing Hollywood BoxOffice Revenue

Cleaning second IMDB dataset and then Joining.

Data Visualizing Hollywood BoxOffice Revenue



final cleaned IMDB dataset

BoxOfficeMojo Data Set





It was important to clean the years in both datasets because multiple movies had the same name and incorrect matching would occur without joining by year and title.

Final Movie Data Set


Headline #1


“Where Americans once were the only game in town for Hollywood, U.S. audiences are taking a back seat to moviegoers across the globe — particularly in Asia.”

“And foreign markets are getting the industry's highest-profile films first. Battleship opened in Asia and Europe more than a month before it reached the USA last May.

Data Visualization

Screen Shot 2016-02-01 at 2.54.57 PM

Conclusion: Foreign revenue has accounted for an increasing percentage of total revenue every year since 1992 as shown by the increasing slope values for the regression lines.

Headline #2


“Big noisy spectacle travels best. Jason Statham, the close-cropped star of many a mindlessly violent film, is a particular Russian favourite. Films based on well-known literature (including cartoon books) and myths may also fare well.”

“Comedy travels badly: Will Ferrell and Adam Sandler provoke guffaws at home but incomprehension abroad"

Data Visualization


Conclusion: The overseas density plot confirms that Drama and Comedy genres perform worse overseas when compared to domestically. A majority of Drama/Comedy movies generate less than 50 mil overseas, while Action and Animation genres show a much wider distribution of revenues (overseas).

Headline #3


“...little effort is being made to deliver sophisticated storytelling ... movies are crafted mainly to provoke visceral - as opposed to intellectual response”



Conclusion: The rating sweet spot that generates the most revenue is between 3.5 and 3.8. Movies that score greater than 4 show a sharp decline in revenues. This could be due to the fact that the average movie goer more easily appreciates an average movie (cough cough ** Michael Bay movies).



Conclusion: There seems to be a linear trend between the number of movies a studio produces and it total domestic revenue. This doesn’t have to be the case; for example, Lionsgate produced 5 of the top 100 movies in 2015 and could have generated 100mil in revenue (~50mil actual). This leads me to believe that movie studios do equally well selecting which movies to produce.

Data Visualization


Conclusion: There seems to be a linear trend between how much revenue a movie made on its opening weekend and its lifetime domestic revenue. Hollywood considers opening weekend numbers as a good predictor of how well the movie will perform and this plot supports that theory.

Data Visualization


Conclusion: The Highest Grossing Movie per year accounted for a decreasing percentage of total BoxOffice Revenue. This could suggest that studio's are either making more money per movie, producing more movies or a combination of the two. Further analysis is required from different datasets.

Final Thoughts

My data visualizations confirmed many of the conclusions drawn in the news articles. What I found most interesting was how good a predictor opening weekend turns out to be for overall performance and that movie studios are evenly matched in terms of how well they select movies. Since audience reception is such a complex factor to predict, it's surprising that the studios are consistently able to make good decisions.

About Author

Sricharan Maddineni

Sricharan Maddineni was a Neuroscience undergrad at Rutgers university. He is a professional music producer turned Data Scientist who has worked with major artists like Kid Ink, Dj Mustard, BMG and garnered over 18 million plays. He has...
View all posts by Sricharan Maddineni >

Leave a Comment

Google May 7, 2021
Google The details mentioned in the post are a number of the most effective available.
Google April 30, 2021
Google Check below, are some entirely unrelated sites to ours, on the other hand, they are most trustworthy sources that we use.
Google September 15, 2019
Google The time to read or pay a visit to the content or web-sites we've linked to beneath.
Google September 14, 2019
Google Usually posts some incredibly fascinating stuff like this. If you’re new to this site.
Facebook Hacking Tools October 3, 2016
Thanks for finally writing about >blog topic <Liked it!

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI