Analyses of Youtube Statistics in the U.S.A

Precious Chima
Posted on Sep 3, 2019

"The joy of YouTube is that you can create content about anything you feel passionate about, however silly the subject matter."

Zoe Sugg

Introduction:

Whether you are trying to learn how to speak a new language, learn a new skill, or showcase your personal talent, Youtube typically tends to be the first option that individuals gravitate towards. The convenience of going viral and being recognized from anywhere in the world is something I believe has been under-appreciated in this age of rapid technological advancement. Youtube has revolutionized the world ever since its emergence on the scene, enabling the "average joe" by providing a platform for entrepreneurship and success in ways that would be improbable without its existence. With that being said, becoming known globally in many cases can be attributed to a single video. If we look amongst those videos that have successfully gone viral, is there a common denominator amongst them?

Background Information:

As a frequentist on Youtube, I was motivated to know a bit more about trending youtube videos. The data set I decided to use was acquired from https://www.kaggle.com/datasnaek/youtube-new, which provided daily trending video statistics, with up to 200 trending videos per day. Approximately 40,000+ observations were used to perform the data analysis, using the listed features below:

  • Video Id  (unique identifier for each uploaded video)
  • Trending_date
  • Channel Title
  • Category_id 
  • Video Description
  • Publish Time
  • Tags
  • Views, likes, dislikes & comments
  • thumbnail link
  • video_error_or_removed
  • Ratings Disabled &Comments Disabled

I also did some additional feature engineering that I felt was necessary in analyzing the data. Some of these additional features include:

  • Like Percentage (Proportion of likes)
  • Trending Diff (number of days before trending)
  • Category (Utilized the cateogry_id to derive a new column called "Category", that included the category names)

This project was programmed in R Shiny, which is an R package that makes it easy for anyone to visualize my findings. The source code is available in my Github repository.

 

Figure 1: Views per Category.

The first thing I decided to look at was the distribution of views per Category. It was not surprising to see that approximately 63% of the total views amongst trending videos were in the Music and Entertainment category.

Figure 2: Word Cloud

I then decided to investigate the most frequently used words in Youtube video titles in order to see if there were any keywords in a videos title that was directly correlated to success. I found that "official, trailer, & video" were the most frequently used words in video titles. I was able to do this via Natural Language Processing, eliminating the "stopwords" & miscellaneous characters, unveiling relevant key words.

Figure 3: Channels with the most Trending Videos.

Looking more closely at the channels that had the highest tally in trending videos, I noticed that ESPN superseded its contemporaries. This results were expected, as ESPN never fails to not only cater to people of different sports, but are also reputable for their contentious "Hot Takes" debates.

Figure 4: Number of Days Before Trending

Lastly, I wanted to examine a time series depiction how certain videos may trend over time. I figured looking into how many days it took a video to trend from when it was uploaded would be a good indicator of the distribution of trending videos as a function of time (days in this case). It was fascinating to see that the bulk of trending videos were between 2 and 7 days. What does this really mean? Well, my intuition is that if you expect your video to be a "one hit wonder", you better hope it starts trending within a week. It is important to note that a video can still be successful even if it does not trend within a week. This is because videos generally accumulate views over time and could still be very profitable.

Conclusion & Future Work:

In summary, my findings could be summarized below:

  • 61% of views on Youtube come from the Music & Entertainment categories.
  • ESPN has the highest number of trending videos.
  • The results show that high trending videos generally start trending between 2 to 7 days.
  • "Official Music Video" are the most popular keywords used in formulating Youtube video titles.

In order to come to a final verdict about whether there exists a true pattern amongst trending Youtube videos, this project will have to branch further into international Youtube Statistics. One useful analysis would be to investigate categories that trend in other continents, while seeing if  a correlation between culture and Youtube popularity exists. I would also look at the choices of tags used in these popular videos. Exploring tag names could be very useful alongside frequently used title words, as tags are typically used for Search Engine Optimizations. It would be interesting to see what types of tag strategies the top videos on Youtube could be employing into their videos.

 

 

About Author

Precious Chima

Precious Chima

Precious Chima is NYC Data Science Fellow with a Bachelors Degree in Applied Mathematics & Statistics from Stony Brook University. Prior to enrolling in the NYCDSA, he worked in the Oil & Gas industry, specializing in optimizing drilling...
View all posts by Precious Chima >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp