NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > R Shiny > Reddit Controversy Sentiment Analysis

Reddit Controversy Sentiment Analysis

Hadar Zeigerson
Posted on Nov 11, 2021

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

GitHub RShiny

Who cares about controversies on Reddit?

The benefit of controversies is that they challenge both the actors and the witnesses to clarify their values. The two hot-topic US controversies highlighted by national news this month---Dave Chappelle's 'The Closer' and Joe Manchin's rejection of climate action policy in Congress---have been no exception. 

Though many US citizens have been unaware of or silent about these topics, a substantial force have been speaking out.  People have been voicing their values in the form of street protests, legal actions, dissent in work spaces, phonebanking to voters in relevant states, and social media organizing. Others have written blogs, news articles, and lengthy social media posts.  Still others have voiced their opinions via online platforms for civil (and often uncivil) discourse. One such platform is Reddit, which Wikipedia calls "an American social news aggregation and discussion website."

This project sought to use sentiment analysis via natural language processing (NLP) to explore:

How do Reddit users feel about the selected controversies?
Do Reddit users generally care more about one than the other?

This query was answered in the form of an RShiny app, a prototype of a tool that could provide stakeholders with interactive insight regarding how text creators (here Reddit users) feel about a given topic. 

Stakeholders of this study include: anyone invested in public opinion of policies passed by Congress; anyone who would like to know public opinion of Netflix and its products; Joe Manchin and Dave Chappelle, whose reputations are discussed on this platform; US citizens who are curious to know how their opinion is reflected by Reddit users collectively.

The Controversies

- On October 5th, 2021, Netflix premiered Dave Chapelle 's hour-long stand-up comedy show special 'The Closer' (2021) on it's streaming platform. Within a few days of it's release, viewers of all backgrounds became vocally critical of the harmful jokes Chappelle made in the show, jokes expressing anti-trans/lbtq, anti-asian, anti-semitic, pro-transphobic, pro-racist, and pro-mysogynist sentiments. In an escalating battle for Netflix to discontinue streaming Chappelle's special, Netflix employees have been protesting with political action in various forms. These include staging a walkout, leaking profitability data to Bloomberg News, and filing a federal labor charge

- Over the past month, senators in Congress have been negotiating an infrastructure bill that is a key part of POTUS Joe Biden's 'Build Back Better' agenda. One of the key actors in these negotiations has been Joe Manchin, who effectively blocked ambitious climate- and social-action policy from being passed into the House of Representatives. Manchin's private shares in coal brokerage Enersystems and receiving large donations from coal, oil, and gas corporations has called into question his motives as a public servant in Congress, making him a controversial figure.

What is sentiment analysis?

Sentiment analysis is an in-depth investigation of emotion in text. It is often used to assess public opinion of a company or its product via reviews, news, social media or other sources. A plethora of tools exist to collect the text and derive emotion-related information (in the NLP world called "valence"). These tools have a wide range of algorithmic complexity, from hand-built lexicons (dictionaries of words manually assigned sentiment-valence value) to sophisticated machine learning models that derive domain-specific word embeddings for a given niche of language; from binary measures of positive and negative valence to expanded emotional palettes that include โ€œdisgustโ€, โ€œsurpriseโ€, โ€œfearโ€, and โ€œtrustโ€.

Put simply, for sentiment analysis words are given a numerical value representing the valence (strength of emotion) they convey, then these numbers are aggregated to suggest a gestalt emotion of the text. 

Text Mining and Sentiment Analysis Tools

For this study, RedditExtractoR,  a simple-to-use R package designed specifically for scraping topic-specific data from Reddit, was used to collect Reddit threads (conversation boards) related to the selected controversies. RedditExtractoR gathered text data from the thread title, thread content (here called โ€œpostโ€), and comments. It also gathered relevant information about each thread and comment, such as the subreddit in which the thread was posted (topic-specific forum within Reddit), the number of upvotes or downvotes the thread or comment received, the date of posting, as well as the relationship between comments (whether a comment was a response to the comment before it).

Text items were analyzed using social-media-trained sentiment analysis tool VADER (Valence Aware Dictionary and sEntiment Reasoner). This sentiment analysis package provides a net score for positivity, negativity, neutrality, as well as a compound score that considers the interaction of the three preceding. This tool was simple to use and accounted for most slang words and expressions used exclusively in social media.

Pre-processing: Word Clouds

A popular and relatable product of NLP is the word cloud plot, where frequently used words in the analyzed text are clustered together. Font size and color is determined by the frequency of the word's appearance.

Dave Chappelle Controversy Word Cloud | Joe Manchin Controversy Word Cloud

Exploratory Data Analyses (EDA)

The results of this query were sparse due to the amount of time dedicated to exploring tools for NLP and the creation of an engaging RShiny app. This prototype is a living, breathing dashboard that will be completed shortly. The plots below contrast the levels of positivity and negativity of text data relating to each controversy.

Titles texts

The histograms below suggest that a little over half of the titles for the Joe Manchin controversy showed no positive valence and some negative valence. This is in contrast with the Dave Chappelle controversy, for which over half of the titles had positive valence as well as negative.

As seen in the scatterplot, a highly neutral thread title about the Joe Manchin controversy aggregated over 200% of upvotes compared to the Dave Chappelle controversy. Conversely, the Dave Chapelle controversy had a few highly upvoted thread titles that cumulatively had much more discussion than the Joe Manchin controversy threads (as measured by the number of comments, visualized with data point size). The thread with the greatest number of comments (of the Dave Chappelle controversy) also had stronger valence, both positive (~0.31) and negative (~0.4). Further analyses is required to determine the relationship between emotional valence and text popularity

 
Comparative EDA plots for title texts:
Distribution of sentiment is shown by histograms (left);
Scatter plots depict the relationship between thread popularity and title text valence (right)

The time series plots below illustrate the emotional valence of the thread titles over the past month. The lines are a product of the range of sentiment expressed on a given day. Therefore if a title or number of titles published on a given day reflect a range of positive or negative sentiment, which produces the vertical line effect.

The bottom chart suggests a slight decrease in the negative valence of titles over the last two weeks before this data was collected, for both controversies. Both charts reflect the delayed onset of titles relating to the Dave Chappelle controversy, which first appeared on October 7th compared to Joe Manchin controversy's presence since the beginning of the month.

 
Time series plot the evolution of sentiment in texts over time.

Post texts

In contrast with the title texts, post texts demonstrated much higher valence generally, both positive and negative. There were also drastically fewer post texts, which created a comically sparse time series plot (below). The reason for this is that titles are required to open a thread, whereas post text is not. However, post text is often where the author of the thread extrapolates upon their opinion regarding the subject of the thread, which explains the high levels of valence in this type of text.

 
Comparative EDA plots for post texts:
Distribution of sentiment is shown by histograms (left);
Scatter plots depict the relationship between thread popularity and post text valence (right)

Contrastingly, while posts from both the Dave Chappelle and Joe Manchin controversies showed similar distributions of positive valence, the negative valence distribution reveals a broader and more extreme negative valence for Dave Chappelle controversy posts compared to Joe Manchin posts. The time series below reflects this pattern, as well.

The scatter plots show that of the threads with post text, two Dave Chappelle controversy-related threads received significantly higher attention, as measured by both upvotes and number of comments. These popular posts both had similar ratios of positive to negative valence, a pattern that might be worth investigating in the context of other controversies.

 
Time series plot the evolution of sentiment in texts over time.

Comment texts

Comment texts are ambiguated by the many argumentative and off-topic discussions that eddy from the thread. There are distinct patterns in distribution due to sheer quantity of text. Again, text popularity and valence seem unrelated.

 
Comparative EDA plots for comment texts:
Distribution of sentiment is shown by histograms (left);
Scatter plots depict the relationship between thread popularity and comment text valence (right)

One pattern worth noting is the momentary disappearance of valence with the Joe Manchin controversy upon the beginning of the Dave Chappelle controversy. It's possible that these trends might be related (for example, if commentators of the Joe Manchin Controversy were temporarily prioritizing conversations about Dave Chappelle). However, there is not enough information to draw any conclusions.

Time series plot the evolution of sentiment in texts over time.

General Conclusions

Both controversies had highly comparable counts of all texts (generally the Dave Chappelle controversy had slightly more, at 59 threads and circa 4,600 comments compared to Joe Manchin controversy's 40 threads and around 3,500 comments. A few interesting patterns and phenomena were revealed, but deeper analysis is needed before any impactful conclusions can be drawn.

Future Work and Final Remarks

Due to the niche use of language within individual subreddits, sentiment evaluations in this project are subject to a potentially large margin of error.

Luckily, solutions to this issue exist. A project by William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky , Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora (2016) contains code that could be adapted to create unsupervised machine learning models that automatically generate and update individualized lexicon dictionaries for every subreddit scraped. This would be useful both for improving the accuracy of the data presented, as well as to open the possibility of a self-updating app that relays the progression of sentiments over time.

Additionally, these models could incorporate algorithms inspired by packages like syuzhet (among others) that currently have limited valence sensitivity but offer a broader range of emotional information (surprise, disgust, fear, anger, etc) or the ability to pull up the topics associated with high-valence language.

It should also be noted that even if we had collected all the online data available in various forms and analyzed using the most sophisticated domain-specific word-embeddings and lexicons produced by NLP machine learning algorithms to accurately hone the domain-specific sentiment values, the resulting data would still not account for the offline conversations and actions that might powerfully sway our understanding of gestalt collective opinion.

That said, this project provides a promising prototype of a tool that can be used to connect its users to the sentiments of a given text-source by providing an interactive data dashboard that visually reflects a sample of public opinion regarding a controversial subject. 

Sample memes of controversies:
(Dave Chappelle controversy (left), Joe Manchin controversy (right)

About Author

Hadar Zeigerson

Hadar graduated from Colorado College in 2015 with a STEM-heavy degree in Psychology. After graduating, they spent five years working and volunteering in community-level regenerative agriculture and two years in big picture learning education. This professional journey led...
View all posts by Hadar Zeigerson >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application