And The Oscar Goes To ...

Kelly Mejia Breton
Posted on Jul 2, 2016

Contributed by Kelly Mejia Breton. She took the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on her final class project - Capstone, due on the 12th week of the program.

Have you ever seen a marketing ad for a movie and thought, wow I have to see that! Then you go see it, it’s a great film, the actor roles are amazing, in your book it’s won an Oscar, and it’s not even nominated?  What makes viewers rank a movie in the top ten, is there an underlying marketing strategy within the details of the movie such length, genre, rating? Is there a good model fit to predicting if a top ten film, ranked by viewers, will win an Oscar?

My client is a fashion designer who provides clothing for celebrities and has hired me to answer these questions.  This year for the academy awards they would like to only dress those who have a high likelihood of wining an Oscar given the film was ranked top 10 by viewers.  Their goal is for their designs to live long after the red carpet, in the photos and video clips that follow for years to come when the award is mentioned.

The Dataset

The data is a combination of the Blockbuster Database and Oscar Demographics dataset provided by Open Data Soft.  Containing the top ten annual films for the past 40 years ranked by IMDb (Internet Movie Database) viewers. Using the Oscar dataset to create a classification column to record whether the film or anyone or anything affiliated with the film won an Oscar.  The original dataset contained 398 observations and 21 variables.

Exploratory Data Analysis


Exploring the distribution of the MPAA Ratings we see PG 13 films have the highest box office receipts, followed by General Audience films, then PG and final rated R films.


Now taking a look at the proportion of films who won Oscars by ratings. We see on the contrary those films rated R, have won the most Academy Awards, where films rated G for general audience have not won any Oscars in this dataset. Interesting, seems like there is a trade off with box tickets and Oscars.


By Genre, box office sales are the highest for Adventure films followed by Action films.


Oscar winners the highest are Drama followed by Romance.  Just looking at some EDA charts I would advise my client as a general rule to pick a rated R film with a drama genre and their likelihood of winning the oscars would increase.

Unsupervised Learning: K-Means


I wondering if there are groups within our dataset that tell us some additional information? KMeans with 4 clusters with the length of the movie and box office tickets alone we are unable to tell the rating of the movie, although if you look very closely you can see that there is a rating group that tends to have longer films with higher box office sales. But the other groups do not seem to be so defined.  This could definitely be looked into further. Maybe help cut down some of data collection time.


After fitting different clusters taking a look at a scree plot to help determine the number of clusters, we see that maybe kmeans is not a good option for the movies data since there isn’t a strong defined elbow, where the within-cluster variance no longer decreases.

Machine Learning – Logistic Regression

Estimate Std. Error z value Pr(>|z|)
Box Office Tickets -2.26E-09 1.34E-09 -1.688 0.09139 .
IMDb Rating 2.83E+00 9.52E-01 2.975 0.00293 **
Rank in Year -2.80E-01 1.37E-01 -2.044 0.04093 *
Romance 2.43E+00 8.36E-01 2.908 0.00364 **
Drama 1.56E+00 7.97E-01 1.962 0.04972 *
Adventure -3.35E+00 1.36E+00 -2.466 0.01366 *
Western 2.64E+00 1.30E+00 2.029 0.04241 *
        Signif. Codes:   0 ‘** *’ 0.001 ‘* *’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I fit a logistic regression and received a significant coefficients for Box office tickets, IMDb Rating, Rank in Year, Romance, Drama, Adventure, Western.  Returning a true negative of 98.63% and true positive of 81.48%

Not bad.  After fitting a reduced model with only the significant coefficients I increased my type II error, and reduced my true positives rate by over 30% to 47%.

Machine Learning – Random Forest

Let’s try fitting a Random Forest


with 8 variables selected at random and and 500 trees, the train set returned a true negative rate of 70% and a true positive rate of 55% the test set returned a true negative of 97.14% slightly lower than the training set as expected and a true positive of 33%


I took a look at the number of trees by errors, see if I can tune the number of trees.  I tried lower number of tree which increased % of variance explained which is great but ultimately lowered my true positive and true negative percentage.

Next Steps

Gathering more data on the movies that win Oscars, to improve the algorithm

Research if there is a pattern within movie titles?  Do the titles tend to have positive, negative, or neutral words?

Looking at the production studios to see if certain studios tend to receive more Oscar awards? If so add it to the model

Look at the area under the curve, adjust my threshold to have better sensitivity and specificity.

Finalize my shiny app

About Author

Kelly Mejia Breton

Kelly Mejia Breton

Kelly Mejia Breton is a driven and determined Senior Analyst with nearly 15 years of proven data analytics expertise. Most recently focused on forecasting short-term and long-term global crude oil and product prices for PIRA Energy Group. Previously...
View all posts by Kelly Mejia Breton >

Related Articles

Leave a Comment

Google January 10, 2021
Google Here are some links to sites that we link to mainly because we consider they may be really worth visiting.
Google January 7, 2021
Google Wonderful story, reckoned we could combine some unrelated data, nevertheless truly really worth taking a search, whoa did 1 study about Mid East has got far more problerms as well.
CBD For Dogs December 18, 2020
CBD For Dogs [...]Here are a number of the internet sites we advocate for our visitors[...]
MKsOrb November 14, 2020
MKsOrb [...]Wonderful story, reckoned we could combine a handful of unrelated information, nevertheless actually worth taking a look, whoa did a single discover about Mid East has got a lot more problerms also [...]
Google November 9, 2020
Google Check beneath, are some completely unrelated websites to ours, even so, they're most trustworthy sources that we use.
Google October 24, 2020
Google Here is a great Weblog You may Locate Fascinating that we encourage you to visit.
Windows RDP August 28, 2020
Windows RDP [...]just beneath, are numerous absolutely not connected web-sites to ours, nonetheless, they are surely worth going over[...]
Avatar August 5, 2020 [...]we prefer to honor a lot of other web internet sites on the web, even if they aren’t linked to us, by linking to them. Below are some webpages really worth checking out[...]
Avatar July 30, 2020 [...]always a big fan of linking to bloggers that I like but do not get lots of link love from[...]
cbd oil for pain July 9, 2020
cbd oil for pain [...]please take a look at the internet sites we comply with, including this one particular, as it represents our picks from the web[...]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp