Finding Influencers on Twitter

Posted on Oct 17, 2016

Have you been followed on Twitter or Instagram by someone you don't know? I get this a lot. And so to avoid being thought of as rude, I follow back. Eventually, I got tired of following back when I realized that some of these accounts don't really do anything but collect followers. Now, why would anyone go through all the trouble of following people in the hopes of being followed back? Why would anyone waste so much time on the internet for this?

I eventually realized the answer when I saw that most of these accounts were not personal. A lot of these accounts I encountered were about food, some about beach vacations, and on some occasion accounts with risque content.

Advertising has infiltrated the social network. It used to be just ads on banners but now companies hire personalities on social media to spread the word about their product or event. Companies spend big bucks on celebrities in an effort to publicize their brand and attract a celebrity's fan base. A sponsored tweet could net as much as $13,000 as was the case for Kloe Kardashian in 2013

Celebrities have multitudes of followers and get paid big bucks by sponsors. So people may have thought that creating accounts and amassing followers would eventually get them sponsorship deals with advertisers. In this exercise, we see that sponsors might be looking for some other things other than the number of followers.

In a social network, a link could represent a relationship as in Facebook or the passing of a tweet as in Twitter. These links determine the flow of information and are therefore a good indicator of a user's influence. I will be presenting two methods of finding potential influencers in a network. One would be by extracting a user's influence measures and the other is by using network graphs.

A large database was found on The database contained a stream of tweets related to NASDAQ 100 stocks extracted from twitter for 79 days, from 2016 March 28th to 2016 June 15th. This was selected because of a good mix of accounts representing organizations and personalities. The database also contained information about how many times a tweet was passed along and who the original tweet came from. This act, more popularly known as retweeting can be identified in the stream as tweets having 'RT @user' or 'via @ user' at the beginning of the tweet. The stream also contained information about mentions. In twitter, a mention is a public conversation between users. A user calls the attention of another user by mentioning them in a tweet. Mentioning is identified by tweets beginning with '@user'.

The influence measures extracted from the stream were the following: indegree, retweet, and mentions. These measures were selected because of how they affect the flow of information in the network. Indegree measures the user's popularity. This was easily extracted from the database by the number of followers a user has. The number of followers shows us the size of the user's audience base. Retweet influence represents a user's ability to create content which other users find worthy of sharing. When a tweet is shared by another user, a bigger network of users is exposed to the tweet. From the stream, this was extracted by counting the number of retweeted messages for each user. The third measure, mention influence, was extracted by counting the number of mentions containing the user's name. This influence measure indicates the ability of the user to engage others in a conversation. This represents the top-of-mind value of the user's name.

A total of 96,613 users tweeted about NASDAQ 100 stocks during the timeframe. Between them, over 680 thousand tweets were broadcast. A word cloud of the NASDAQ symbols most often mentioned shows that Apple, represented by AAPL, was the most tweeted stock among the group. 


Figure 1. Stock symbol word cloud.


Users were most active on April 27 where they broadcast over 20,800 tweets. This coincides with the day when AAPL stocks slumped following speculations that iPhone sales may decline by as much as 60 million units compared to the same quarter a year ago. The slump in Apple shares dragged the tech-heavy NASDAQ into the red by the day's end.


Figure 2. Frequency plot of tweets.

Users' activity on this day showed that activity was mostly during trading market hours which is 13:30 to 20:30 UTC.


Figure 3. Frequency plot of 27-April-2016.

Each user's ranking over the three influence categories was assigned by using fractional ranking. For example, in assigning the indegree ranking, a rank of 1 was given to the user with the most number of followers. Users with the same number of followers receive the same ranking number, which is the mean of what they would have under ordinal rankings. Table 1 shows the top 30 users across the three influence measures. Notice that minimal overlap can be seen across each influence rank. The first user to show up across all three measures of influence was "WSJ".

Table 1. Top influentials based on indegree, retweets, and mentions

Rank TopIndegree TopRT TopMentions
1 cnnbrk philstockworld jimcramer
2 nytimes StocksHighAlert CNBC
3 CNN ValaAfshar AlertTrade
4 Reuters YahooFinance WSJ
5 WSJ BK_Stocks Benzinga
6 Forbes businessinsider YahooFinance
7 AP StockTwits CNBCFastMoney
8 DRJAMESCABOT timothysykes TheStreet
9 GMA CNNMoney autumnalcity87
10 MarketWatch devonshiretech carlquintanilla
11 JohnLegere CNBCnow HalftimeReport
12 USATODAY ppprophet markbspiegel
13 CNBC carlquintanilla RiskReversal
14 ForbesTech Stockology101 petenajarian
15 FortuneMagazine Benzinga TTtradertwit
16 timoreilly OpenOutcrier GerberKawasaki
17 rsAnakin_FBGx20 marketexclusive barronsonline
18 philstockworld DayTradersGroup ReformedBroker
19 dickc theflynews Reuters
20 ReutersBiz TakeFlightSales GuyAdami
21 businessinsider WrigleyTom StockTwits
22 om TweakTown RedDogT3
23 Yahoo markbspiegel jonnajarian
24 SAI SAI cek_cpa
25 globeandmail WSJ JustinPulitzer
26 Variety CenterTrading ryanwallace198
27 VH1 CBOE DougKass
28 CNNMoney davidmoadel technology
29 WSJbusiness GerberKawasaki BossHoggHazzard
30 CNET options_answers SquawkStreet

To see how much users overlap across the three categories, a Venn diagram of the top 100 users was derived. Figure 4 shows that among the 239 users in the top list, only 10 users can be seen across all three measures of influence.


Figure 4. Venn diagram of top influentials across measures.

Figure 5 below shows a correlation matrix which represents how a user's rank varies across the three different measures of influence. The correlation matrix represents the strength of the association between a pair of rankings. This matrix was derived by comparing the relative influence ranks of all 96,613 users in the database.


Figure 5. Correlation plot across all influence measures.

The users show a strong correlation in their retweet influence and mention influence. The low correlation of the indegree measure across the other two measures show that indegree ranking may not be related to the other rankings.

A couple of conclusions can be derived from the correlation plot. First, we can say that in most cases, users who are retweeted often are also mentioned often, and vice versa. Another one is that the most followed user may not be the most engaging user in the group. A user's popularity, therefore, is a weak representation of the ability to motivate the spread of information.

Retweets and mentions have direction. A retweet is the path of an idea from User A to User B. User A broadcast a tweet which was read by User B. User B, thought it was worth sharing and retweeted it. This retweet will eventually be seen by users not directly accessible to User A. When User A mentions User B, this is again a link from User A to User B.  With this in mind, we have enough data to convert our twitter stream into a directed network graph. All users will be a node in our graph and all directed links will be edges. The igraph library will be used to extract information from the resulting network graph.

A quick look at the resulting network graph for the whole stream shows that we were able to create a graph with 96,613 nodes and 168, 519 edges. Because of this size, the resulting network graph will not be shown. This is because of the amount of time and computational effort needed to come up with a plot. It would most likely be a crowded mess of dots and lines anyway. However, we can still extract some information from the graph object.

The density of a network object is the proportion of present edges from all possible edges in the network. Our present graph has a density of 2.799118e-05. A very low density would mean that there is a very low interaction between our users.

The diameter of a network graph is the length of the longest path across unique nodes and edges. Considering the direction of the links, the diameter of our network is 14. This means that we are able to trace an unbroken path across 15 users.

The hubs and authority algorithm was developed by Jon Kleinberg to examine the relevance of a web page's content. He categorized pages into hubs and authority pages. Hubs, which have more outgoing links are the internet's catalog. This is similar to the early days of Yahoo where it touted itself as the internet's yellow pages. Authority pages have more incoming links presumably because of their high-quality content. Translated to twitter activity, hub pages would fit the description of a user with high retweet influence and authority pages would be similar to a twitter user who has high mention influence.

The hub score and authority score of the network graph was derived using a simple igraph function call. The resulting top hub score went to "markbspiegel" while the top authority score went to "Benzinga". This is in contrast to the ranking tables where the top retweet and mention belong to "philstockworld" and"jimcramer" respectively.

To find out where the discrepancy came from, each node were investigated. Although it showed that "markbspiegel" had more unique edges than "philstockworld" if we consider and sum the weight of each unique edge, philstockworld still beats markbspiegel. The same is observed when looking at the edges of "Benzinga" and "jimcramer". The discrepancy is consistent with how web pages are rated wherein the number of links matter more over the number of times each link was activated. The hub and authority score also does not take into account the weight characteristics of the nodes.

To see an actual network graph, we narrow down our selection to a twitter stream of users tweeting about CA Technologies.

Table 2 shows us the resulting top influentials derived from our ranking method. The first user to cross the three influence categories is "Benzinga". 

Table 2. Top influentials of the CA stream.

Rank TopIndegree TopRT TopMentions
1 CBOE WrigleyTom sam_miller00
2 InvestorIdeas ppprophet diggingplatinum
3 247WallSt TradeZer0 AlertTrade
4 androsForm LMTentarelli LMTentarelli
5 jjjinvesting eWhispers AdaptToReality
6 AlertTrade PersonsPlanet Opinterest
7 DirectorsTalk Boursier_com eWhispers
8 PENNYBUSTER1 bored2tears Le_Revenu
9 scottrade pnoytrader TransitoOK
10 MorningstarInc crosshairtrader DozenStocks
11 Quaikey UTradePH Benzinga
12 PersonsPlanet OpenOutcrier leahanneta
13 Benzinga quack1612 jascapital1
14 airtransat MorningstarInc Jascapitalforex
15 traderstewie SleekMoneycom XFenaux
16 OptionAlert App_sw_ AmericanBanking
17 KimAuclair DividendSheet ConsumerFeed
18 stt2318 InvestirFr SleekMoneycom
19 AltruistWealth iviewmarkets desota
20 MarketCurrents ChinaInvest dailypoliticaln
21 selfmade_harris 1MinuteStock saidjarrah
22 jfahmy ACInvestorBlog Boursier_com
23 daytradingninja Benzinga TickerReport

The resulting network graph of this smaller twitter stream comes up with 431 nodes and 131 edges. 

There is comparatively more interaction between users compared to our initial network object with the density clocking in at 0.0009550531. The diameter is shorter with just 9 hops across 10 nodes.

The resulting hub and authority score show a more consistent result with the ranking tables because the actual number of retweets and mentions were low. This time, the number of unique edges were not significantly lower than the total weight of the edges.

Figure 7 and 8 show the network graphs with the nodes adjusted based on the hub and authority score. The higher the score, the bigger the node size.

CA network graph with diameter

Figure 6. CA stream network graph showing the diameter path.



Figure 7. Closeup of network graph with node sizes adjusted based on hub score.



Figure 8. Closeup of network graph with node sizes adjusted based on authority score.


The fractional ranking method is found to be a more realistic measure of a twitter user's influence. The frequency of interactions between users must be considered in measuring influence, even if it is among a usual set of audience. This just means that the user is consistent in producing high-quality content that has pass-along value.

For smaller networks, the network graph method may yield additional information that can't be derived from fractional ranking. The key would be to check whether the ratio of the number of edges to the total edge weight is close to 1. The discrepancy between the ranking method and the network graph is expected to be greater when this ratio approaches zero.



Celli, F., Di Lascio, F., Magnani, M., Pacelli, B., Rossi, L. 2009. Social Network Data and Practices: the case of Friendfeed.
Cha, M., Haddadi, H., Benevenuto, F., and Gummadi, K. 2010. Measuring User Influence in Twitter: The Million Follower Fallacy.
Ognyanova, K. 2016. Network Analysis and Visualization with R and igraph.

About Author

Oamar Gianan

Oamar Gianan has about 15 years of experience in the information technology industry primarily in cloud computing. He developed a passion for data analysis by working on infrastructure where big data is processed. Before moving to New York,...
View all posts by Oamar Gianan >

Related Articles

Leave a Comment

Asmaa Mahmoud July 3, 2017
How can I contact with Dr. Oamar Gianan? I want to study the identification of influencers on Twitter in a specific topic and then Rank them, Which the best technique that I should use (such as social network analysis or machine learning like clustering)?

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI