Analyses of Youtube Statistics in the U.S.A

Posted on Sep 3, 2019

"The joy of YouTube is that you can create content about anything you feel passionate about, however silly the subject matter."

Zoe Sugg

To understand how I use Data & AI to make actionable insights, please check out my strictly by the numbers player grouping and player comparison dashboards!

Introduction:

Whether you are trying to learn how to speak a new language, learn a new skill, or showcase your personal talent, Youtube typically tends to be the first option that individuals gravitate towards. The convenience of going viral and being recognized from anywhere in the world is something I believe has been under-appreciated in this age of rapid technological advancement. Youtube has revolutionized the world ever since its emergence on the scene, enabling the "average joe" by providing a platform for entrepreneurship and success in ways that would be improbable without its existence.

With that being said, becoming known globally in many cases can be attributed to a single video. If we look amongst those videos that have successfully gone viral, is there a common denominator amongst them?

Background Information:

As a frequentist on Youtube, I was motivated to know a bit more about trending youtube videos. The data set I decided to use was acquired from https://www.kaggle.com/datasnaek/youtube-new, which provided daily trending video statistics, with up to 200 trending videos per day. Approximately 40,000+ observations were used to perform the data analysis, using the listed features below:

  • Video Id  (unique identifier for each uploaded video)
  • Trending_date
  • Channel Title
  • Category_id 
  • Video Description
  • Publish Time
  • Tags
  • Views, likes, dislikes & comments
  • thumbnail link
  • video_error_or_removed
  • Ratings Disabled &Comments Disabled

I also did some additional feature engineering that I felt was necessary in analyzing the data. Some of these additional features include:

  • Like Percentage (Proportion of likes)
  • Trending Diff (number of days before trending)
  • Category (Utilized the cateogry_id to derive a new column called "Category", that included the category names)

This project was programmed in R Shiny, which is an R package that makes it easy for anyone to visualize my findings. The source code is available in my Github repository.

 

Figure 1: Views per Category.

The first thing I decided to look at was the distribution of views per Category. It was not surprising to see that approximately 63% of the total views amongst trending videos were in the Music and Entertainment category.

Figure 2: Word Cloud

I then decided to investigate the most frequently used words in Youtube video titles in order to see if there were any keywords in a videos title that was directly correlated to success. I found that "official, trailer, & video" were the most frequently used words in video titles. I was able to do this via Natural Language Processing, eliminating the "stopwords" & miscellaneous characters, unveiling relevant key words.

Figure 3: Channels with the most Trending Videos.

Looking more closely at the channels that had the highest tally in trending videos, I noticed that ESPN superseded its contemporaries. This results were expected, as ESPN never fails to not only cater to people of different sports, but are also reputable for their contentious "Hot Takes" debates.

Figure 4: Number of Days Before Trending

Lastly, I wanted to examine a time series depiction how certain videos may trend over time. I figured looking into how many days it took a video to trend from when it was uploaded would be a good indicator of the distribution of trending videos as a function of time (days in this case).

It was fascinating to see that the bulk of trending videos were between 2 and 7 days. What does this really mean? Well, my intuition is that if you expect your video to be a "one hit wonder", you better hope it starts trending within a week. It is important to note that a video can still be successful even if it does not trend within a week. This is because videos generally accumulate views over time and could still be very profitable.

Conclusion & Future Work:

In summary, my findings could be summarized below:

  • 61% of views on Youtube come from the Music & Entertainment categories.
  • ESPN has the highest number of trending videos.
  • The results show that high trending videos generally start trending between 2 to 7 days.
  • "Official Music Video" are the most popular keywords used in formulating Youtube video titles.

In order to come to a final verdict about whether there exists a true pattern amongst trending Youtube videos, this project will have to branch further into international Youtube Statistics. One useful analysis would be to investigate categories that trend in other continents, while seeing if  a correlation between culture and Youtube popularity exists.

I would also look at the choices of tags used in these popular videos. Exploring tag names could be very useful alongside frequently used title words, as tags are typically used for Search Engine Optimizations. It would be interesting to see what types of tag strategies the top videos on Youtube could be employing into their videos.

 

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author

Precious Chima

Precious Chima is a Data Scientist, Solutions Architect, Technical Consultant, and Inventor working at IBM. Precious has extensive experience designing, architecting, implementing, and executing novel, cutting-edge solutions to various industries. To learn more about his passion projects, check...
View all posts by Precious Chima >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI