Majority Illusion Effect on News Through Social Media

Celina Sprague
, and
Posted on Dec 12, 2018

Highlights section:

News reporting is not a good indicator of socially relevant topics

Networks talking about globally relevant topics are highly prone to the majority illusion



Our project aims to explore the prevalence of the majority illusion effect in news through social media platforms. We were inspired by the works of Abigail C. Salvania and Jaderick Pabico from the University of Philippines and Fernando Flores from the University of Amsterdam, who found three cycles of flow of information. Social networks are generally large-scale networks that comprise of millions or billions of nodes. One of the ways to analyze the properties of these nodes is by understanding their group behavior (community properties).

The three phases being:

  • Expansion Phase: This is when information is first published and the article spreads from the source (Patient zero) to other members of the social network who are at least separated from the source by a couple of degrees of separation
  • Front-Page Phase: Occurs when a news aggregator or verified/popular figure promotes the information(retweets). This gives a boost to the spread of the article outside of the connected components of the source resulting in a faster speed of spread
  • Saturation Phase: This phase is considered when the information ages and its popularity among the members of the social network died down

Our data were all collected using API’s and we conducted our exploratory data analysis using machine learning techniques, e.g., NLP, LDA, and through advanced data visualizations in ggplot, Potly, and D3.


The news dataset originated from, a news web scraping company, and accessed with an API registration key. We filtered for the following for articles:

  • English language
  • Published since December 2017
  • Popularity index of 5 out of 10 (largest popularity)
  • Crawled since 30 days ago (max possible)
  • News articles

Each API request contained 100 dictionaries with each dictionary representing an individual article. We made about 400 requests, which meant we were able to obtain roughly 40,000 articles.

Each news article JSON file contained basic information such as author, title, publish date, crawled date, text and social media share metrics. Fortunately, the format for every article was identical, so we could develop a loop to consolidate the articles.

We decided to use Twitter because of its ease of use and low barriers to access made it the perfect simulation of an internet-mediated conversation forum. Twitter data were obtained through the Twitter API service by applying for a token key - each user is allowed to obtain about 4,500 tweets maximum for every 15 minutes. We were able to obtain data on combinations of Twitter usernames, screen names, retweets, and text.



We start by creating a “for” loop to export each API request as a JSON file and then apply a similar loop for each dictionary in every JSON file. After the first “for” loop, we had about 400 JSON files, each containing about 100 dictionaries. For each dictionary in a JSON file, we want to export as a JSON file, ensuring each JSON file will represent one single news article. Our final data processing loop is to import each JSON file and combine them all into a Pandas dataframe. The dataframe was constructed by doing the following:

  • Eliminate nested dictionaries through flattening the dictionary
  • Input a dummy variable or value for an empty list or string
  • Embedding any list or string containing exactly one element
  • Put into dataframe

From the resulting dataframe, we convert any null values to “none” and remove any columns containing the same value for every single row. In doing so, the dataframe is now ready for an LDA model which will result in topics and keywords. We determined the number of topics and lambda values by running a gridsearch function on our initial LDA model with the criteria of perplexity and log likelihood - the gridsearch function suggested 30 topics with a lambda value of 0.5.

With the 30 topics and top 19 keywords, we decided on a name for each topic in addition to a topic bucket. For example, if we found a topic to be about elephant poaching, then the topic bucket might be animal rights. The benefit of the topic bucket is the word will be broad enough to capture enough Twitter activity.


To analyze how information spreads through a social media network like Twitter, the data must be aggregated by users where each user will be taken as an individual node. Nodes are connected between each other by edges. In this case, an edge represents a retweet (RT). When visualizing the network of Twitter users connected to each other, node clusters or communities are formed. User communities are formed when users are talking or seeing the same information.

The helpful analogy of the spread of information in social media is the spread of a disease. For a disease to start spreading, the infection must originate from a source commonly referred to as patient zero. Once patient zero is identified, the range or width of the network can be calculated. The range of the network can also be interpreted as the degree of separation of any given node from patient zero. These local dynamics along with the concepts of scale and speed better visualized in Yang & Counts’ paper “Predicting the Speed, Scale and Range of Information Diffusion in Twitter” (2010) in the following diagram:

Results & Discussion


The purpose of performing LDA and NLP analysis on news articles was to obtain a sense of what people are reading, not necessarily what people are discussing - Twitter API provided that insight. By comparing the count of articles by topic bucket and the total social media shares, we find some topic buckets switched in popularity. For example, data privacy had the highest amount of articles compared to police brutality; however, looking at social media shares, police brutality had the highest shares rather than data privacy. This might indicate perhaps what people read is not necessarily what people talk about and some topics are more widespread than others.

Despite the magnitude of news articles we were able to obtain, we did have some limitations with the largest being crawl date. will start charging if we wanted to access articles crawled over 30 days ago, so to avoid being charged by the file and request, we decided to obtain articles crawled within 30 days. The consequence of the decision was that our overall dataset would be smaller time-wise, which could affect our Twitter pull. Additionally, we were had a limited amount of API requests per month, so the limitation affected our ability to revise searches and certain searches due to the size of the response.

The image above is the resulting topic buckets and the corresponding number of articles for each topic bucket. The “N/A” represents articles that are either not in English or are too specific to be fed into the Twitter API.

The graph above is the topic buckets we decided to input into the Twitter API after initial network analysis gauging how large of a sample size we would obtain through the actual API.

Having cleared up important concepts from network theory, it is time to have a quick look at the data collected from Twitter.

The plot of the rate of tweets per hour shows many types of behavior over time. Ideally, this visualization can aid in the understanding of the lifecycle of the spread of information. The three phases of information spread are easily distinguishable after zooming in to a single day. However, the plot shows the main limitation on doing this for every topic: having enough history of data. Some headlines are so broad, that even after running a query and gathering over 200+ thousand tweets, the data didn’t even span a day.

Given the understanding of other relevant metrics in network theory, there are still other dimensions of the data to explore like range and scale of information spread.

This plot essentially shows two things:

  1.     Broadly relevant topics have a wide extent of spread of information
  2.     Local news headlines have a highly contained network
  3.     Most topics rapidly spread to large scale communities

In this graph, topics are ordered by their maximum degree of separation from patient zero (horizontal-axis). Points are communities laid out according to degrees of separation from patient zero (horizontal-axis) and scale or size of the community (vertical axis).

This means that topics are ranked by how widespread they are beyond any given node’s immediate network. This is crucial to understand how a person’s immediate surroundings are not at all representative if the general truth.

Furthermore, consider what the network researcher Kristina Lerman described as “the majority illusion”:

It’s our flawed perception of some networks that rely on a logical proverb: We just don’t know what we don’t know.

And in some networks, the information we do have, our sliver of local knowledge, can lead us to the false conclusion.

Hopefully, this study will eventually delve deeper into exploring this social media network phenomenon.


As we discussed the applicability of our project, we found a variety of applications with a theme of relatability - the more relatable something is, the faster it will spread and remain relevant. We decided to focus on marketing and product creation. From a marketing perspective, we would recommend targeting large communities and communities who have a longer saturation phase, i.e., it takes longer for the “hype” to “go away.” This way, marketers idea or product has a better chance to remain relevant and become a trend.

Likewise, for product creation, we would recommend targeting topics that have remained relevant. I.e., there is a steady flow of news articles about the topic, and communities who are large and interconnected with other communities. For example, prior to Hulu’s “The Handmaid's Tale,” there was an abundance of articles about feminism, political chaos, and gender equality. Moreover, there was also a lot of discussions about each of those topics on Twitter found through trending hashtags and from the API. The steady flow of articles in tandem with the large quantities of discussions and opinions made an ideal environment and timeframe for Hulu to launch their TV series. Within a day of the TV series launch, there were articles about the series and discussions occurring on Twitter and other social media platforms. More importantly, the series did very well.


We explored the general illusion effect on news through social media and found popular topics are not always what people read and focused on the rate of speed information to communities. After data processing and preparation, we focused on specific topics and explored whether peers or neighbors’ opinions were creating an illusion of popularity (general illusion) or topics are truly popular. Through exploring popularity, we discovered relationships between topics, communities, and the rate at which information spread throughout communities. We found our results appeared to support the three phases of social networks (expansion, front-page, and saturation phases), and there is an opportunity to build a model to predict behavior. For now, the visualizations and the workflow we created serves as a platform for various strategy applications such as targeting marketing products or campaigns to communities and timing of product topic creations and launches.  

About Authors

Celina Sprague

Celina Sprague

Celina Sprague completed her BA at Barnard College. Her work experience has been primarily in the finance industry where she worked in primarily fixed-income research and managed over 40 economic models. As a past dancer/overall athlete and painter,...
View all posts by Celina Sprague >

Raul Vallejo

Actuary, statistician and certified Data Scientist. Experienced in building risk models and integrating them into a company-wide modelling strategy. Leader of new multi-department initiatives to create data-driven culture. Raúl Vallejo completed his BA in Actuarial Science at Instituto...
View all posts by Raul Vallejo >

Aamash Haroon

Aamash Haroon completed his Bachelor of Civil Engineering from India with a Masters in Finance and Real Estate from Fordham University, New York. Coupled with experience in the construction, finance, and real estate space, he is also the...
View all posts by Aamash Haroon >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp