Studying Data to Predict Movie Stocks Sentiment

Posted on Jan 19, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


The following data analysis seeks to shed some light on the opaque world of movie producers and distributors. Particularly, the objective of the analysis is to answer the next questions with regards the movie industry and its players:

  • Are¬†cinema ticket prices a useful variable to extract macroeconomic insights?
  • Is the Opening weekend helpful to predict a studio/company performance?
  • What kind of market is the movie industry (monopoly, oligopoly, perfect competition, etc)
  • Which studios have been winners (losers) over the last years? What are the main reasons behind their rise (fall)?
  • Does box office success in terms of gross revenue guarantee profitability?
  • Are¬†weekly box office figures¬†useful to predict stock price sentiment?

Scraping Data

Python has been used for web scrapping using both Beautiful soup (top-down analysis) and Scrapy (company-specific information) libraries. Python is the favorite choice for web scrapping for two main reasons i) it allows to manipulate strings very easily and ii) it posses many scrapping libraries that makes web harvesting much easier and intuitive.

Other data science languages like R have also web scrapping packages, but the fact that R is an interpreted language - for which most of its implementations execute instructions directly, without previously compiling a program into machine-language instructions - makes it less apt for web scrapping than a compiled language like Python.

The information has been scrapped from multiple websites in order to obtain the most comprehensive set of data as possible:

  • Federal Reserve Economic Data¬†( is a database maintained by the Research division of the Federal Reserve Bank of St. Louis that has more than 421,000 economic time series from 81 sources.
  • Box Office Mojo ( is the leading online box-office reporting service. Box Office Mojo is owned and operated by IMDb (
  • (http:// operated by Nash Information Services, LLC; a provider of movie industry data and research services to major financial institutions, media companies, investors, data analysis companies and production companies. This website as been used as backup to fill budget data not available in Box Office Mojo.

The Movie Industry: A Brief Intro

The global film industry is a 38 bn USD market as of 2016 with future global box office revenue expected to reach 50 bn USD by 2020 tantamount to a 7.1% annualized growth. China and India are the two largest markets in terms of tickets sold per year; while the United States is on the third position. Nevertheless, the US ranks first with more than half of the market when considering total revenue in US dollars. As it will be showed later in the analysis, there are four particular trends impacting the industry:

  1. China and India overtaking the US: China is expected to become the world's biggest film market in 2017 with growth figures surpassing 50% YoY (Year-over-Year). Major film studios are forming new partnerships with Chinese companies to gain better access to the world's fastest-growing film market with hits like Furious 7 and Jurassic World being the result of co-production efforts.
  2. Movie franchises: Film universes have become pivotal in the long term strategy of top studios such as Disney or Time Warner. The expansion of comic-book universes (e.g. Marvel), animated films (e.g. Frozen) and old-school franchises (e.g. Star Wars) with 5-year release plans including spin-offs and tie-ins has become essential to sustain top studio theatrical earnings.
  3. Premium formats: The number of big budget productions have increased due to Chinese companies getting more aggressive and signing up co-production deals with western partners. This bigger focus on blockbuster films seek to benefit premium formats (e.g. 3D, IMAX, etc) and boost premium sales mix, particularly in Western Europe and Asian territories.
  4. VOD (Video-On-demand): A staggering 53 percent of American adults prefer watching movies at home. Surprisingly, home movie on-demand main victim has not been theatrical box office revenues but the US cable-TV industry i,e, "cord-cutting" effect. Moreover, small-budget pictures are having a harder time standing out at multiplexes but finding ample new avenues through online distribution(HBO, Netflix and Amazon Prime).

Are cinema ticket prices a useful variable to extract macroeconomic insights?

Ticket inflation was a good leading indicator of overall consumer inflation before the 1980s decade, particularly at predicting turning points in inflation. Non-staples goods and services consumption - such as films and other leisure activities - are the first to be trimmed by a consumer's budget when consumer sentiment weakens. Hence, at a time when the cinema industry had not experienced any significant disruption in its supply side, tickets price swifts had more to do with supply-demand dynamics:

Studying Data to Predict Movie Stocks Sentiment

Inflation Gap

Nevertheless, disrupting changes have been the norm in the industry since the 1980s. Inherent supply turmoils like VTRs (videotape recorders) or VOD(Video-on-Demand) have detached industry incumbents' pricing actions from a traditional supply-demand interplay. This has rendered ticket price useless as a tool to predict overall consumer inflation .

Studying Data to Predict Movie Stocks Sentiment


In particular, 3 key periods are remarkably important in terms of the gap between ticket prices and overall inflation:

  1. Oil Crisis(1973-1983): oil crisis fuelled inflation above 10% while ticket prices only soared approximately 3%. The advent of VTRs (Video Tape Recorders) did not have an immediate effect until mid 1980s, when a format winner dominated the game (VHS).
  2. VTRs (1988-1994): VHS turned out to be the clear winner and VTRs manufacturers focus on VHS provoked a sudden cheapening in device prices boosting the videotape renting industry and bringing down cinema ticket prices.
  3. Premium Formats (2000-2010): cinemas and movie producers developed new ways of experiencing movies (IMAX, 3D, etc) that allow them to mark up cinema-goers dramatically.

The most interesting period one seems to be the one we are living since 2006 as reflected by the two tables below. The cinema-CPI inflation gap has been systematically above 1% for the period 2005-2015. Most importantly, this spread has remained above 1% in the most recent 2013-2015 period, which it is quite impressive as industry incumbents have been able to retain such a bargaining power against consumer even during deflationary times.

Studying Data to Predict Movie Stocks Sentiment

The next question to be asked is whether or not this is a sustainable advantage thus making necessary to check bottom up figures per top studio and understand more thoroughly the competitive degree of the movie industry.

Is the Opening weekend helpful to predict a studio/company performance?

The plot below points out the remarkable impact a film opening weekend has to determine its failure or success: opening weekend gross revenue importance has sky-rocketed to more than 30% of the total box office from a 15% level during the 1990s. The first days of a movie release give nowadays more clues than ever about its ultimate economic performance. Key trends such as blockbuster budget productions, more intense marketing efforts, new premium formats introduction and a shortage of screens to sustain a larger than ever before supply of releases are among the forces driving studios to bet heavily towards the opening weekend and hope for the best.



Another paradoxical effect is that of the average opening weekend measured in million dollars. Although the market in terms of revenue has grown significantly, and few new competitors have entered successfully since 2000, the average opening weekend per title in million dollars has been very volatile and flattish since the turn of the century. Once more the top studios new strategy based on movie franchise expansion, blockbuster and premium formats are behind this trend as the next example underscores:

Imagine a market in two different points in time: a 5-player industry with fairly the same revenue per player (40 mill USD) and an aggregated 200 mill USD size as of 2000. Later in 2016, the market has grown to 240 mill USD attracting an additional 2 players.

Nonetheless, the market average gross revenue is now lower at 34.3 mill USD compared to the initial 40 mill USD. What has happened? the blockbuster effect enters here: the two biggest players have grown dramatically (Top Studios) yet the remainder players have seen their importance downsized (mid-tier and lower-tier producers). Now you can have an idea upon what has been going on in the movie industry over the last 15 years.


What kind of market is the movie industry (monopoly, oligopoly, perfect competition, etc)?

The 4-firm concentration ratio is the measure of the percentage market share in an industry held by the largest 4 firms within that industry. A concentration ratio below 50% is featured by industries with high competition levels, whereas a ratio above 80% is symptom of Oligopoly or even Oligopoly when the ratio goes well above 95%.

The left chart below shows how the industry has experienced a 4-firm concentration ratio consistently above 50% since 2000 and even earlier. Although competitive levels surged dramatically for the period 2010-2013, a sharp reversal has occurred over the last 3 years as Disney, Time Warner and Comcast have imposed a relentless high-budget blockbuster model difficult to follow by other peers.



Another useful to measure an industry-specific competitive level is the Herfindahl-Hirschman index (HHI). This measure is a commonly accepted measure of market concentration and obtained by simply summing the square of the market share of each firm competing in the market.The U.S. DOJ (Department of Justice) uses the HHI for evaluating corporate mergers and acquisitions.

The U.S. DOJ considers a market with an HHI of less than 1,500 to be a competitive marketplace, an HHI of 1,500 to 2,500 to be a moderately concentrated marketplace, and an HHI of 2,500 or greater to be a highly concentrated marketplace. The HHI chart provides similar readings to the 4-firm concentration ratio: the movie industry has become a moderately concentrated market (HHI above 1,500) in 2016. In fact, this change in the degree of competitiveness is a clear symptom of the industry trends mentioned in the introduction and explain why studios have been able to mark up consumers well-above overall inflation levels.

What studios have been the winners (losers) over the last years? What are the main reasons behind their rise (fall)?

The first plot below displays parent company ranking positions by box office gross revenue.  The main takeaways is that almost all the players in the industry move from occasional number ones to some position around the top 5. Barriers of entry are high, for which reason only one underdog has managed to rise since 2000: Lionsgate. The company has been able to monetize several licenses such as The Twilight Saga or The Hunger Games promoting itself from 14th position in 2000 to 7th in 2016.

On the other hand, Viacom-owned Paramount pictures has been the main victim as the studio has not been able to build sustainable cinematic universes like Disney or Time Warner despite its success with some franchises like the Start Trek reboot. The second plot summarizes market share information for the top-6 parent companies: Disney, Time Warner and Comcast dominance is indisputable summing up more than 60% of the total movie industry gross box office as of 2016.


Is there a clear link between box office success and production budget?

An open debate in the industry is whether or not a big production budget and an aggressive marketing campaign guarantee box office success. When regressing the overall box office revenue figures against budget and marketing expenses, the initial regression model outcome is rather spurious and statistically non-significant. After several tests, two dummy variables are entered into the model:

  • Genre: Genre plays¬†a pivotal¬†role in a movie success in terms of gross box office numbers and profitability. The table below shows the¬†top 5 best and worst box office genres.¬†The dummy variable inserted in the model takes value 1 when a movie genre belongs to those genres ranking above the average in terms of box office success and 0 otherwise.
  • Season: ¬†another factor that logically plays an important role is seasonality. Then it makes sense to include a second dummy variable with value 1 when their release date occurs during traditionally low season months like September, January, February and March or 0 otherwise.


The final model regression results displayed below were still disappointing: the model minimizes AIC and BIC with regards other tested prior models but with a low coefficient of determination that can only explain half of the variance of the response variable (0.478). The coefficient estimates are all statistically significant.

Contrary to expectations, the marketing coefficient seems to be negatively related to box office success or, in other words, marketing intensity has soared well above optimal levels over the last years that no longer is a positive contributor. Digging for official marketing figures is complicated due to company secrecy, yet franchises like Transformers filtered this item a couple of years ago allowing to carry out a sort of "control group" analysis: 100 mill in 2014 (Transformers 3), 175 mill USD in 2012 (Transformers 2) and 150 mill USD in 2007 (Transformers 1) tantamount to a 30% increase in marketing costs.


Parent Company

The economics of each studio are very different with some finding more economies of scale and scope in particular areas that may make them excel in extracting value added from their production and marketing budgets. For this reason, the same regression equation was run for every parent company in our sample. Nevertheless, the results are not very encouraging either as the plot below showcases: the adjusted coefficients of determination are well below the minimum required 0.8 threshold and vary significantly from one company to another. In other words, it is not advisable to apply this model across the board.


Does box office success in terms of gross revenue guarantee profitability?

Modelling profitability metrics for these companies is a challenging task. While box office figures are publicly-available numbers, they are only the top-line part of the equation. Regrettably, there is a lot of opacity from big studios to provide transparency with regards  other important factors such as marketing and promotion costs or the ticket rebate rates negotiated with theater companies. After consulting multiple industry sources,  the next assumptions are considered to be imperfect but a realistic framework to attain a realistic set of profitability figures per movie and studio for the period 2010-2016:

  • Production Budget: official figures are considered obtained scrapping websites such as Box Office Mojo, and IMDb.
  • Marketing Expenses:¬†industry research shows the next relationship between production budgets and marketing costs: 150% production budget for movies with budgets of 5 mill USD ,¬†100% for productions with budgets above 5 mill USD but below 40 mill, and 50%¬†for production budgets above 40 mill USD.
  • Theater tickets rebate: Studios normally arrange a 55-65% rebate with theaters. On a major blockbuster the studio can take 65% of the total gross throughout the run. On smaller movies, the scale starts high and goes down as the film ages. The model takes an estimate of approximately¬†65% for productions above 40 mill USD budget, 60% for movies above 5 mill USD and 55% for small productions.

Annualized Return Return on Capital

The chart below displays annualized return on capital (ROC) - calculated as net profit per movie divided by marketing costs and production budget - for the studios parent companies for the period 2010-2016. The chart is simply taking into account up all inflows and outflows per year for each parent company and calculating an annualized ROC number that provides a good starting point to study the profitability of the filming segments of the firms in our sample. Apparently there seems to be a correlation between top box office success (market share) and overall profitability:


Nevertheless, parent company aggregated figures could be masking the truth. A studio might have been lucky with one title, whose success has more than offset the mediocrity of the rest of its pipeline. To try to differentiate between top-down and bottom-up profitability, the next box-plot displays the average ROC per movie for each one of the studios in the sample. A fascinating insight is that it can be confirmed that the distribution of ROCs across studios seems to be skewed to the right (positive values) and posses a significant kurtosis due to the "blockbuster" effect.

Average ROC per Movie

Another observation is that the aggregate profitability for studios like Fox, Comcast (Universal) or Lionsgate is heavily dependent on few movies to save the season since the median profitability for their average movie is in the red. Disney, Viacom (Paramount) and Time Warner are the only parent companies able to be in positive territory at an aggregate and per movie level.



Are weekly box office figures useful to predict stock price sentiment?

The parent companies owning movie studios considered in this analysis for the period 2010-2016 have different degrees of dependency with regards their theatrical business.

The plot underneath highlights information extracted from each company's SEC 10-k annual report and how some companies have a radically different revenue segmentation than others with regards movie-related inflows: small and less diversified players such as Lionsgate or The Weinsten Co - more than 80% revenues linked to the film/theatrical industry - are expected to be more sensitive to box office numbers releases than other big conglomerates like Sony: the Japanese conglomerate has an important market share in the pictures industry, but this segment only accounts for 5% of the company total revenues.



Furthermore, it is critical to consider that big guns like Time Warner, Disney or Comcast (Universal) not only obtain revenue from the theatrical release. Their most important strength is to leverage their intellectual property: for instance, Disney is able to monetize its Marvel Comics Universe and Star Wars Universe via both cinema tickets and, most importantly, lending these intangible assets rights to toy manufacturers like Hasbro and video-game developers like Capcom or Activision. This movie "echo" effect expands across all the other segments of the income statement but it is very difficult to quantify from box office numbers.

A box office profitability index is constructed using using weekly box office figures and the aforementioned industry assumptions (marketing costs, theater rebate rates, etc). The main objective is to have a visual preliminary idea about whether or not this weekly box office index could add information that allows us to predict abnormal returns (stock price return over benchmark) for our parent company stocks.


Lionsgate is the fist company tested with encouraging results: positive signals (rising box office index above 8wk MA) are correlated to positive future returns, while negative signals (index breaking down 8wk MA) have also been a good leading indicator for stock corrections. Hence, the preliminary thinking for Lionsgate is that weekly box office numbers are useful to help predicting stock performance.



The plot below replicates the same analysis with Disney, yet the results are less encouraging this time. Our box office index seems to be making a good choice when pointing towards a long term trends; however, the stock experienced several corrections that were not captured by negative signals in our indicator. Disney has been very successful in its movie business throughout the last years, yet other businesses such as its cable network ESPN have been under stress as subscribers cancel their cable subscriptions  aka "cord cutting"  and embraced VOD (Video-on-demand) options. As a result, Disney' shares were hit by a  factor that has nothing to do with its box office success.


The initial findings suggests our box office index contains alternative information that could be useful tool to predict stock sentiment for companies that generate the majority of their revenues from movie production/distribution business i.e. Lionsgate.


However, this is a very preliminary analysis and more in-depth quantitative backtesting is required to optimize its implementation, The weekly box office index may also be useful-  but not pivotal - when analyzing other movie-related, but more diversified, media firms such as Disney or Time Warner. In the latter case, our box office index could be important as part of a set of KPI (Key Performance Indicators) to generate more accurate EPS (earnings per share) estimates within the period between quarterly earnings report.

Click here to check code in Github

About Author

Carlos Salas Najera

Carlos is a passionate individual for investments and technology with a long/short equity analysis and portfolio management experience approach that combines his fundamental, quantitative and data science skills in order to deliver superior returns. His core strength lies...
View all posts by Carlos Salas Najera >

Related Articles

Leave a Comment

HackBlaze October 16, 2017
Thank you, I've recentfly been looking for info about this subject for ages and yours is the greatest I've discolvered so far. But, what concerning the bottom line? Are you suree concerning the supply?
Quyen October 10, 2017
It's an amazing piece of wriing iin favor of all the online users; they will get benefit from it I am sure.
Clarita May 25, 2017
This is a very good hints especially to those new to blogosphere, short and accurate info... Thanks for sharing this one. A must read post.
Augustus May 21, 2017
Rather! This was a truly excellent post. Thank you for your provided advice

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI