Scraping TED Talks: Trends in Global Issues, Science, and Technology

You-Sun Nam
Posted on Jul 16, 2020

"Scraping TED Talks" is a longitudinal examination of trendiness of TED Talks on global issues, science, and technology. After extracting and transforming unstructured data from multimedia content, different methods and different measures of trendiness were used to inform analysis. Taken together, both methods reveal different sides of the story behind the numbers, as well as the evolution of trends. A composite measure of trendiness was constructed to gain a deeper understanding of the overall trending landscape. 

Author: You-Sun Nam, Data Science Fellow
Quick LinksGitHub | Primary Data | Portfolio

Table of Contents

Note: Parentheses indicate estimated length in minutes ("min") or seconds ("sec")

Executive Summary

Length: 2 min

After scraping, transforming, and analyzing unstructured data from TED Talks in global issues/technology and science/technology, the following insights can be made:

  1. Unsurprisingly, technology informs the future orientation and practical application of TED Talk content when cross-referenced with global issues or science. This is not surprising given the nature of technology, but the inclusion of this category has had the following in both categories: boosting long-term or future-oriented issues, while downgrading past or short-term current issues.
  2. Controlling for frequency of TED Talk tags versus without tells different sides of the story behind trends. Without controlling for tag frequency, total count provides a broad, macro understanding of overall trends. Controlling for tag frequency into account (hereby abbreviated as "tag frequency"), on the other hand hand, reveals rising trends within this broad context, previously obscured by total count.
  3. Both methods (total count and tag frequency) may inform the evolution of trends. Total count identifies current, mainstream trends that are most likely industry-driven, where as tag frequency identifies up-and-coming trends that have yet to be mainstreamed but is gaining popularity among the audience.
  4. Distinguishing trendiness by different measures of trendiness —hits, audience engagement, and worldwide appeal — are irrelevant by total count, but relevant after controlling for tag frequency. Further research is needed to examine if and how specific trends in each measure identified by tag frequency are interrelated or manifest thematically.

Technical Notes

Demonstrated skills, language(s), and tools:

  • Web scraping: Selenium with Python
  • Data cleaning: R
  • Data visualization: R

Business Case

Length: 2 min

Introduction

If a picture is worth a thousand words, then how much is a video worth? Outside of Excel spreadsheets and existing databases, there is an extraordinary amount of unstructured data to explore. Let's take a look at a sample TED Talk as an example:


Figure 2.1.1 A sample TED Talk

How many data points can you spot in the above screenshot? Here are some data points to start things off: title, speaker, summary, date, location, number of views, transcript, reading list, footnotes, number of comments, the type of comments...

Remember, these are data points from one video. Imagine all the data points from hundreds and thousands of related videos. What does all this data mean, individually and collectively?


Figure 2.1.2 A collection of TED Talks, filtered by category

More importantly, which data points are worth extracting? The answer to this question is: It depends on your research question.

Purpose

What are some trends in global issues, science, and technology companies can tap into?

After scraping and transforming unstructured data from TED Talks on global issues, science, and technology, I conduct a longitudinal analysis of industry trends global development organizations, government agencies, tech companies, and marketers. With scraped variables serving as a proxy for trends or a different aspect of trendiness, I demonstrate how different methods and different measures can provide a wide range of business insights.

With each scraped numerical variable representing a different aspect of trendiness, I then aggregate relevant variables to construct a multivariate indicator. This multivariate indicator clarifies and provides a deeper  understanding of trends in global issues, science, and technology on a macro level.

Methodology

Length: 1 min

Scraped Variables

Within the scope of global issues, science, and technology, I scraped the following data points using Selenium with Python:

Textual data Numerical data
Tags Number of views
Title Number of comments
Speaker Number of translations
Summary Year
Transcript  


Tags are the main focus of study, as a proxy for trends.

Number of views, number of comments, and number of translations are numerical variables measuring trendiness. Each numerical variable serves as a proxy indicator for an aspect of trendiness, which is detailed in the next subsection.

The first part of analysis involves examining tags by each numerical measure of trendiness. Two methods, total count and tag frequency, were applied to each measure.

Constructing Composite Measure of Trendiness

The numerical variables were aggregated to create a multivariate indicator of trendiness. The measures of trendiness are as follows, with each  numerical variable as a proxy indicator for some aspect of trendiness:

  • Number of views as hits
  • Number of comments as audience engagement
  • Number of translations as worldwide appeal

Each numerical measure was weighted equally.

The second part of the analysis involves examining tags by this composite measure of trendiness.

Trends by Total Count vs Tag Frequency

Total Length: 8.5 min

There are multiple ways to measure trendiness by tags (as opposed to overall trendiness, which is what the previous measures indicate). In this project, we use two methods: first by total tag count, and second by tag frequency. Afterwards in the "Total Count vs Tag Frequency: Which is Right?" section, the different stories told by each method are analyzed and used to inform which method provides more value.

Total Count

Length: 2 min

The figures below illustrate the Top 20 trends in either category –– Global Issues and Technology or Science and Technology –– according to total number of views (hits), number of comments (audience engagement), and number of translations (worldwide appeal).

Global Issues and Technology

Note that aside from a few exceptions, the top trends remain consistent across all measures and are fairly interchangeable in ranking.


Figure 4.1.1.1 Top 20 Trends in Global Issues and Technology by Number of Views


Figure 4.1.1.2 Top 20 Trends in Global Issues and Technology by Number of Comments


Figure 4.1.1.3 Top 20 Trends in Global Issues and Technology by Number of Translations

For easy comparison, here's the top five trends across all three measures, total number of views, total number of comments, and total number of translations:

Rank Total Views Total Comments Total Translations
1 Climate change Business Culture
2 Future Culture Business
3 Culture Design Design
4 Business Politics Future
5 Environment Climate change Climate change


The trends seem to hold across each numerical measure of trendiness, with rather insignificant interchangeability in rankings. Let's see if the same pattern of consistency occurs with a different category of videos. 

Science and Technology

Having carried out the same analysis with a different category (this time, science and technology), we can see the same pattern of consistency and ranking interchangeability occur.


Figure 4.1.2.1 Top 20 Trends in Science and Technology by Number of Views


Figure 4.1.2.2 Top 20 Trends in Science and Technology by Number of Comments


Figure 4.1.2.3 Top 20 Trends in Science and Technology by Number of Translations

For easy comparison, here's the top five trends across all three measures, total number of views, total number of comments, and total number of translations:

Rank Total Views Total Comments Total Translations
1 Innovation Innovation Innovation
2 Future Future Future
3 Invention Engineering Biology
4 Engineering Invention Invention
5 Design Biology Medicine
6 Biology Medicine Design


What can we conclude so far? Based on the same pattern of consistency and ranking interchangeability, it is likely that the top trends are more informed by the generalizability and frequency of the tags, rather than illustrating anything meaningful.

Tag Frequency

Length: 2.5 min

This time, let's carry out the same analysis, but controlling for frequency of tags. This should also weed out the issue with broad, generalizable tags.

Global Issues and Technology

After controlling for tag frequency, note that the pattern of consistency and ranking interchangeability has pretty much disappeared.


Figure 4.2.1.1 Top 20 Trends in Global Issues and Technology by Number of Views Per Tag Count


Figure 4.2.1.2 Top 20 Trends in Global Issues and Technology by Number of Comments Per Tag Count


Figure 4.2.1.3 Top 20 Trends in Global Issues and Technology by Number of Translations Per Tag Count

For easy comparison, here's the top five trends across all three measures, number of views per tag frequency, number of comments per tag frequency, and number of translations per tag frequency:

Rank Views Per
Tag Frequency
Comments Per
Tag Frequency
Translations Per
Tag Frequency
1 Rocket Science Iraq Vaccines
2 Mars Europe Plastic
3 Industrial Design Online video Iraq
4 Life Military Interview
5 Religion Demo Library


No longer do number of views, number of comments, number of translations correspond to each other after controlling for tag frequency. Here we can see several, interesting patterns to differentiate by hits, audience engagement, and worldwide appeal.

Science and Technology

After controlling for tag frequency for science and technology TED talks, the pattern of consistency and ranking interchangeability has dramatically reduced.


Figure 4.2.2.1 Top 20 Trends in Science and Technology by Number of Views Per Tag Count


Figure 4.2.2.2 Top 20 Trends in Science and Technology by Number of Comments Per Tag Count


Figure 4.2.2.3 Top 20 Trends in Science and Technology by Number of Translations Per Tag Count

For easy comparison, here's the top five trends across all three measures, number of views per tag frequency, number of comments per tag frequency, and number of translations per tag frequency:

Rank Views Per
Tag Frequency
Comments Per
Tag Frequency
Translations Per
Tag Frequency
1 Manufacturing Social media Toy
2 Social media Gaming Personality
3 Gaming Compassion Language
4 Compassion Body language Introvert
5 Body language Birds Evolutionary Psychology


After controlling for tag frequency, we can see number of views and number of comments tend to correspond for the top 5 trends, with more variation in later rankings. On the other hand, there is little relationship between number of translations and the other two measures, at least for the top five trends.

Total Count vs Tag Frequency: Which is Right?

Length: 3.5 min

Different methods (total count and tag frequency) tell different stories.  Which story is "right"? Which method should be used to measure trendiness? The short answer is 'both,' in that both stories are "right" and both methods should be used to measure trendiness. So if both methods are right, how do we account for the different conclusions?

Tag count paints a broad picture of trends in global issues, science, and technology. It is primarily useful for gaining an overall understanding to contextualize. Tag frequency, on the other hand, gives us more meaningful insight into the rising trends obscured by total count.

Business Value

To illustrate the business value of using both methods to tell a different aspect of the story, let's compare the top five global issues and technology trends identified by total count...

Rank Total Views Total Comments Total Translations
1 Climate change Business Culture
2 Future Culture Business
3 Culture Design Design
4 Business Politics Future
5 Environment Climate change Climate change


...to the top five global issues and technology trends identified by tag count after controlling for tag frequency.

Rank Views Per
Tag Frequency
Comments Per
Tag Frequency
Translations Per
Tag Frequency
1 Rocket Science Iraq Vaccines
2 Mars Europe Plastic
3 Industrial Design Online video Iraq
4 Life Military Interview
5 Religion Demo Library


Assume you're a marketing analyst at a multinational technology firm that is interested in expanding their presence in international affairs. Your task is to identify current and future trends that can be used to inform the direction of the firm's tech products and services, which would be used by the the firm's clients to address global issues. Afterwards, you are to present your analysis to the upper management. How will you reconcile the different results?

If total count provides insight into the broad, overall trends and controlling for tag frequency uncovers trends obscured by this border context, then here is how you might distinguish the trends identified by each method:

  • Total count: Broadly speaking, the top five trending TED Talks in global issues and technology revolve around current societal problems that have continuity into the future. Environmental issues, such as climate change, is one prominent example. Measured by total count, these trends illustrate current trends. This is because we are not taking tag frequency into the account, which means popularity of these trends are influenced by sheer number. In turn, this suggests these trends are throughly mainstreamed, driven by the industry as a whole. 
  • Tag frequency: After controlling for tag frequency, we see that the top five trending TED Talks are far more specific in topic and scope. Given the different results across each numerical measure of trendiness, there appears to be little to no correspondence between hits, audience engagement, and worldwide appeal. If tag count provides us insight into current trends, then tag frequency identifies up-and-coming trends. Because the popularity of these trends is not influenced by sheer number, these trends are most likely not mainstreamed and driven by key, individual players.
  • Business recommendation: First, improve and refine current line of technological products and services for current, future-impacting issues such as climate change, but expect some market saturation. Second, the identified up-and-coming trends inform the direction R&D should take when designing and targeting future technological products and services. However, more research on the identified up-and-coming trends is needed beforehand.

Trends by Composite Measure

Length: 3 min

Recall from the "Methodology" section that a composite measure of trendiness was constructed by equally weighting each individual numerical measure. Each numerical measure served as a proxy for the following (a.k.a. what we wanted to measure):

  • Number of views as hits
  • Number of comments as audience engagement
  • Number of translations as worldwide appeal

Going back to tag count, let's broaden our understanding of the overall context using this composite measure of trendiness.

Global Issues and Technology

Below is a lollipop graph depicting the most trending and least trending TED Talks in global issues and technology:


Figure 5.1 Top 10 and Bottom 10 Trends in Global Issues and Technology by Average Composite Measure of Trendiness

At this point into the analysis, the top 10 trends shouldn't come off as a surprise. From this graph, we can see that the most trending TED Talks focus on practical applications ("business," "design," "invention," "communication", "collaboration") to current problems that will continue to affect society in the future ("future," "climate change"/"environment", and "politics"/"culture"). In comparison to the bottom 10 trends, the top 10 trends are globally applicable and broad in scope.

The least trending TED Talks in global issues and technology, on the other hand, are less interconnected. Unsurprisingly, the bottom 10 trends are also less broad in scope than the top 10 trends. We also know from the first part of analysis that these topics are not necessarily unpopular because of tag frequency.  (Even after controlling for tag frequency, none of these topics appear in the top 20 trends.) Taken together, all of these points suggest that these trends are fairly niche, appealing to a minority of global issues and technology-browsing TED audience.

Science and Technology

Let's take a look at the most trending and least trending TED Talks in science and technology:


Figure 5.2 Top 10 and Bottom 10 Trends in Science and Technology by Average Composite Measure of Trendiness

The most trending TED Talks in science and technology focus on practical applications to practical applications ("innovation," "invention," "engineering," "medicine," "design," "biotech") to biological issues ("biology," "health," "brain") with implications for the future ("future"). In comparison to the bottom 10 trends ("Middle East," "South America"), the top 10 trends are international and broad in scope. These results are fairly similar in theme to the most trending talks in global issues and technology, undoubtedly the influence of the 'technology' category.

The last trending TED Talks in science and technology, on the other hand, are less interconnected. Unsurprisingly, the bottom 10 trends are also less broad in scope than the top 10 trends. We also know from the first part of analysis that these topics are not necessarily unpopular because of tag frequency.  (Even after controlling for tag frequency, none of these topics appear in the top 20 trends.) Taken together, all of these points suggest that these trends are fairly niche, appealing to a minority of science and technology-browsing TED audience. They may, however, appeal to a different segment of TED audience.

Future Updates

Length: 27 seconds

  1. Replicate project with a larger sample size, i.e. similar videos outside of TED Talks
  2. Examine the popularity of speakers as a variable. How does a speaker's popularity and reputation affect these measures of trendiness?
  3. Generate meaningful subcategories by analyzing textual data using Topic Modeling
  4. Analyze case studies: Conduct sentiment analysis using NLP on comments left on most popular TED Talks in global issues, science, and technology

Appendix

Length: 46 seconds

The first and second parts of analysis were longitudinal in nature, focused on identifying trends over time. A shorter longitudinal study or a cross-sectional analysis restricted to a specific year can also be conducted. Brief analysis was carried out in efforts to understand how time might affect the trendiness of TED Talks in global issue, science, and technology. Each numerical measure of trendiness displayed different trends in both global issues/technology and science/technology.

Exploratory data analysis suggests that more investigation is needed into dramatic spikes in specific years, especially the year(s) that overlap across each measure of trendiness. Speaker should also be taken into account when conducting cross-sectional analysis.


Contact

If you have any questions or comments, please feel free to reach out to me on LinkedIn or GitHub.

Quick LinksGitHub | Primary Data | Portfolio

About Author

You-Sun Nam

You-Sun Nam

You-Sun Nam is a Data Science Fellow at the NYC Data Science Academy. With programming fluency in Python, R, and SQL and an academic background in statistics, she specializes in executing data science projects backed with rigorous research...
View all posts by You-Sun Nam >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp