The Facebook Effect
On Wednesday February 21st, social media enthusiast, Kylie Jenner, tweeted:
"sooo does anyone else not open Snapchat anymore? Or is it just me... ugh this is so sad."
Within the next 24 hours, Snapchats stock dropped 6 percent, or roughly 1.6 billion dollars in market value.
Kylie Jenner has nearly 25 million followers on twitter, and this single tweet alone was "liked" 350 thousand and "retweeted" 71 thousand times. Kylie Jenners tweet about Snapchat is a microcosm of the way social media has engulfed how businesses operate and how it can both be a major benefit, or in the case of Snapchat, detriment to success.
So for my web-scraping project, I seek examine two questions? What is the effect of something trending on social media? and subsequently, how to increase the likelihood of something trending. And I chose to examine the largest social media company of them all... Facebook
Part I: What's Trending?
For the first part of my analysis, I wanted to focus on the effect Facebook exposure. Companies are constantly writing content, sharing links, and commenting on post in hopes that increased exposure on Facebook results in larger publicity. But what is the actual effect of increased exposure on Facebook?
What better way to examine this then to look at Facebook's "trending" list, literally one of the most viewed items on Facebook. Implemented in 2014, Facebook created a "trending" functionality on their homepage in which they provided viewers a list of popular topics being discussed and shared on Facebook. The actual algorithm by which this list is constructed is moot for the purposes of this project, instead, what cannot be denied is that when something is deemed "trending" by Facebook, it is seen by all of Facebook Users. All 200 million (at least in the United States).
So if we can find a way to measure the effect of things trending on Facebook, we can begin to unpack how much exposure on Facebook really matters? But that remains a big if. Without being employed by Facebook or paying some type of fee, having access to things like number of clicks on a post is impossible. What I need is information that is both publicly available and directly related to facebook trends. And I turned to Wikipedia.
What of the most commons patterns after something goes trending on Facebook is for someone to seek more information on that topic. For example, an actor or actress may go trending on Facebook because they got married or were involved in a scandal. One of the first reactions for me is to get more information. Where did I see that name before? what movie were they in? wait, weren't they already married!? And as google can profess, when a person is searched, 9 out 10 times, the first item returned is their Wikipedia page. So if I could link facebook trends to their wikipedia viewership at a granular level, I would have a great way to examine the relative effect of something trending on Facebook. What do you know, Wikipedia publishes information on the hourly page view count of EVERY WIKIPEDIA PAGE.
1. I record Facebook trending data from the week of January 30th to February 5th. This information is obtained at 8am (ET) daily, one of the most active times on Facebook.
2. I download data from wikipedia on the number of views a page receives hourly from the same time period
The bulk of the work then involved accurately combining these sources of data. Each wikipedia file contained information on the hourly page view counts for every single wikipedia page (which was slightly over 5 million). With seven days of data, and 24 wikipedia files for each of those days, downloading and subsetting the data took a substantial amount of time and computing power.
- writing a script to automate the downloading of all 168 files
- using regular expression to transform Facebook trends into a unique string that would match to wikipedia pages. For example, Mark Ruffalo had a main wikipedia page (Mark_Ruffalo), a film page (Mark_Ruffalo_films), a page in spanish (Mark_Ruffalo_sp), and other iterations.
- creating a loop to download each file, subset it based on that days Facebook trends, and then combine each file in an organized hierarchal format.
In total, for one week of Facebook trending data, I had over twelve hundred observations.
So does something trending on facebook affect its wikipedia page view count?
The graph above indicates that it seems so. The y axis is the number of views, the x axis is the time of day in three hour intervals (each tick represents the sum number of views from the preceding tick), the blue line represents when something goes trending on facebook, and the black line represents the average for all Facebook trending topics. As you can see, after something goes trending, it has a sharp increase in the number of times it is viewed.
Aggregation can obscure important individual distinctions, therefore the graph above shows the wikipedia page view count for every Facebook trending topic that week (where views is logged so all graphs can be on the same y axis scale). Overall, we can see that there is a general trend of where after something goes trending of facebook, it receives a sharp increase in wikipedia page views for several hours before tailing off. Some trends that stick out in particular include
Analysis of how these effects vary by group reveal even more interesting insights.
When we group Facebook trends into three categories, politics, pop-culture, and sports, we see the same overall pattern of growth in wikipedia views. However, sports related trending topics differ from the others; namely, it receives both the largest uptick in wikipedia views immediately after going trending and has the sharpest dip when the views begin tailing off. This suggests, the Facebook trends have different impacts depending on the topic of what is trending.
But are these difference significant? To answer this question, I run two analyses. First, I create a box plot for the number of hourly views for a topic before and after it goes trending on Facebook. Second, I run a T - Test comparing the mean for the number of hourly wikipedia views before and after trending on Facebook.
The box plot demonstrates that the median and overall distribution for the number of views in the post-trending group is higher than the pre-trending group. Furthermore, with p-value less than .001, the t-tests suggests that the mean number of hourly wikipedia views post-trending group is significantly higher than pre-trending.
So then can we argue with confidence that having something going trending on Facebook increases its wikipedia page view count by nearly 70%? Not necessarily. The t-tests does show that the mean difference in hourly wikipedia page views pre-trending and post-trending is significantly different. However, it does not prove that going trending on Facebook is purely responsible for that increase in page views. Indeed, a reasonable counter-argument would be that something goes trending on Facebook because people are talking about it, and that in and of itself is enough for an increase in wikipedia views.
Indeed, with the current way the study is constructed, I cannot rule out the counter argument that the effect of something going trending on Facebook is actually being confounded by the fact that both "Facebook Trending" and Wikipedia page views are driven by real world events.
In order to truly find a "causal" relationship between something going trending on Facebook and wikipedia page views, I need to isolate the effect of going trending on Facebook from other potentially confounding effects. Taking a page from the experimental methods, it could be thought of as having a treatment group and a control group then measure the differences in the outcome.
As a thought experiment, what would that look like? The treatment group would be what we have already observed in the data; a real world event occurring then that event going trending on Facebook. The control group would then be that same real world event occurring, but that event NOT going trending on Facebook. When working with observational data, where treatment and controls cannot be manipulated, this is an impossible task.
Yet, this does not mean more cannot be done to better leverage this relationship. Although manipulating the treatment, trending on Facebook, is not feasible, another method would be to find a comparison group that is similar to the treatment group in every way except the treatment.
In order to do this, I turn to historical data of trending twitter topics. Twitter trends, particularly historical twitter trends, offers an interesting comparison to Facebook trending topics. Something that is trending on twitter is likely trending for the same reasons something is trending on Facebook; because it was initiated by some real-world event. So that clears the condition of find a group (or phenomena) in this case that is extremely similar to the Facebook treatment group. What remains, is that group not observing the treatment, going trending on Facebook. However, if something goes trending on twitter, it is also likely to go trending on Facebook. And in the case it is not, an argument for selection bias could be made, specifically that the topics trending on twitter but not on facebook are qualitatively different. That is why, historical twitter data is best suited as a point of comparison. The functionality of Facebook trends dramatically changed in January of 2017. Prior to that point in time, Facebook personalized what users saw in their trends section according to the preferences of each person. Although it is impossible to know in entirety the black box behind Facebook's trending algorithm, after significant push back from media and the general public, Facebook reported that they they would no longer personalize Facebook trends to reflect users personal interests. Instead, "Everyone in the same region will see the same topics".
Therefore, if we could gather twitter trends that existed before Facebook changed its algorithm, theoretically, we would have a group of real world events that occurred, and would have been trending on Facebook (if its current trending format existed)but did not.
Alas, a control group.
Therefore, I go back and obtain additional data for my study. First, I scrape data from trendogate.com, which keeps track of twitter trends as far back as 2015. I scrape data from the week of January 18, 2016, nearly two years prior to my facebook data. I do this in order to make sure I had twitter data from a point in time prior to Facebook trending algorithm change as well as make my this data as comparable as possible to the Facebook data.
Second, mimicking the earlier data collection process, I obtain data from wikipedia on hourly page view counts for times corresponding to the twitter data.
This graph shows the relationship between when something is trending on twitter and the number of hourly views that topic receives on Wikipedia. It is important to note here that the green line, which represents when something is trending on twitter, is at 2pm. This is because trendogate.com both updates its website roughly in the late afternoon and because unlike facebook, twitter tends to have its highest rate of activity in the early to late afternoon. Whether this is a fair assumption is a valid concern, and one I definitely wrestled with. However, given the limitations of data, I decided this was the best course of action.
Again, I run a t-test, in addition to graphing a box plot, to examine whether differences in wikipedia hourly page view counts pre-trending and post-trending on twitter are significant. And the results show that at a p-value of less than .05, the means do not differ.
The data from the Facebook analysis show the number of views a wikipedia page gets is significantly higher after it goes trending on Facebook. Supplementary analysis uses trending topics from twitter from 2016 as a comparison group in order to isolate the effect of facebook on wikipedia hour page views. The results from the twitter analysis show that have something on twitter does not significantly impact that topics page view count, thus lending support for the fact that increased exposure on Facebook does result substantial increases in publicity, as seen in the large increase in wikipedia views.
So back to the original question. Is Facebook exposure important. Yes. having something trending on Facebook will nearly double its search rate on wikipedia.
Part II: How To Get Something Trending?
We established trending on Facebook does wonders for exposure. Great. So now everyone who wants to increase their publicity should go trending on Facebook. Easier said than done.
So in part one, we identified an area of need (exposure on Facebook), now I want to analyze how to increase exposure. Having something trending on Facebook is nearly impossible, but we can take the underlying principals of Facebook's trending section. How can you maximize content exposure on Facebook?
Contrary to what many believe, it is not by flooding facebook with endless posts and images. Described as the "Zombie Scroll Syndrome", Facebook users are constantly scrolling through content, ignoring posts, images, and especially advertisements, until something catches their eye. Indeed, today marketing companies actively trying to find ways to create content that breaks this zombie like scrolling and brings meaningful attention to their information.
One of the best examples of this are how media outlets share their news on their Facebook accounts. In stark contrast to when users come on media outlet's websites to look at articles and post, when on Facebook, users are not directly there to see whats going on in politics, sports, or entertainment. Therefore, how they interact with information is qualitatively different. Correspondingly, most major media outlets do not simply just post their stories verbatim or share a link to their website. Instead, they craft posts intended to pique the interests of Facebook users in order to navigate these users to their websites.
To understand how post and content can best be crafted to increase Facebook exposure, I conduct a case study of one of the largest media outlets in America: The New York Times
I use data collected on all the Facebook posts made by The New York Times from 2012 to 2016. This data contains the facebook post written, the title of the article or video shared, its underlying description, and number of likes. Initially, in order to continue to build my web scraping skills, I was scraping data from the archives of the New York Times to collect additional information on how they shared their posts on their website. But after several scrapes, I realized that the information being collected from the website was exactly the same as the information under the title and description section of the Facebook data. Therefore, proving that I could scrape the website, I felt my time was better spent analyzing the data.
The first observation that sticks out is that compared to descriptions on their website, the New York Times Facebook posts are on average longer.
From the dashed lines on both graphs (which represents the means), we can see that when The New York Times posts on Facebook, they use both more words (three more words) and characters (twenty more characters), both at p-values less than .001.
Indeed, a scatterplot of the number of words in a facebook post versus the number of likes indicate that there is a positive correlation between the number of words in a post and the number of likes that post receives.
Besides using more words and characters, how does the language used on Facebook differ?
| Website description
Above are word-clouds of Facebook post(L) and Website description(R). As you would expect, they look fairly similar for the most part. When trying to detect such nuanced differences, other methods of analysis might be preferred.
Above shows a simple text analysis (via Google's Cloud Natural Language) of a sample post on Facebook compared to the corresponding article post on The New York Times website.
We can see here that when posting on Facebook, The New York Times mentioned Ghandi but did not on it's website description. While this is just a simple example from one post, this pattern is also evident in the overall corpus of text. When posting on Facebook, The New York Times was significantly more likely mentioned names like "Obama", "Romney", and "Mitt Romney" compared to when they posted article descriptions on their website. Additionally, when posting on Facebook, The New York Times is 5 times more likely to pose a question or use quotes.
Exploratory analysis of The New York Times provides insight into how The New York Times uses language in their Facebook posts to maximize their social media exposure. This includes:
- Having longer and more descriptive posts compared to the descriptions they use on their home websites
- Using more charged words like President and explicitly referencing people of interests like Obama, and Mitt Romney
- Posing questions to the readers as well as using quotes that provide "shock' value
After Kylie Jenner's tweet, Snapchat was estimated to have lost 1.6 billion dollars in worth. Was her post responsible for the entirety of that loss? We may never know, but we can say with certainty Jenner strong presence on social media did significantly effect the outlook of Snapchat.
Indeed, in this project, I sought to unpack the effects of content exposure on social media, and Facebook in particular. I found that when something goes trending on Facebook, it results in a 70 percent increase in searches on wikipedia. Using wikipedia views as a proxy for increased awareness, we can see that Facebook exposure has a enormous impact on product awareness. Furthermore, to ensure that I was isolating the effects of something trending on Facebook and not other potentially confounding effects, I compare the effects of something trending on Facebook to trending topics on twitter. I find that the effect of trending topics on twitter were not significant, lending support to the hypothesis that in my data, Facebook was largely responsible for the 70% increase in wikipedia views.
I then conducted a case study of how The New York Times posted information on Facebook compared to their website to gather insights on how major companies use social media to increase their social media presence. I find that post on Facebook are longer, include names of important figures (e.g. Obama, Ghandi) and use more questions and quotes.
While my study is the first to examine both how facebook trends impact wikipedia views and compares Facebook posts to website posts for large media companies, my study remains exploratory in nature. Future work that may contribute to the robustness of my findings include:
- Writing a script to continually scrape Facebook for information on trending topics in order to have comparisons at different time points
- Increasing my sample size in order to run fixed effects models where within group change (i.e. before and after a topic goes trending) can be leveraged to estimate the effects of something trending on Facebook
- Using more robust NLP tools like latent dirichlet allocation to have a more in-depth analysis of textual differences between posting on Facebook compared to posting on a wesbite
Thanks for taking time to read a long post. Please reach out if you have any questions or suggestions!