Targeting Twitter Influencers Through Social Network Analysis

Posted on Sep 11, 2015


0. Introduction
1. Background
2. Collaboration with Fusion: Data Sources
3. Finding the Superfans
4. Recording Superfans’ Preferences
5. Natural Language Processing and My Recommendation System
6. Limitations


0. Introduction

This project designs a program for charting the influencers and patterns of the Twitter community. My findings will be able to help online media companies identify their most important influencers on Twitter and understand these “superfans’” interests, patterns, and behaviors. Furthermore, recommendation systems can be built based on natural language processing of influencers’ Twitter timelines to suggest content that more compellingly attracts influencers’ attention.

My demo R “Twitter Influencers” Shiny app below showcases the analysis result of the top 20 retweets in the @thisisfusion timeline:

QQ20150908-4@2x“Twitter Influencers” (1)

QQ20150908-5@2x“Twitter Influencers” (2)

QQ20150908-6@2x“Twitter Influencers” (3)

1. Background

Recommendation systems have become widely used by news media companies. As evidenced by the “Recommended for You” section of, or the “Recommendations” plugin on HuffPost Social News, personalized placement of articles on apps and websites is now being used to guide users to posts they’ll probably find most interesting.

At the core of news media recommendation systems is the tripartite ability to recommend the right content to the right people at the right times. Understanding who views what when makes possible recommendations in real time. A few recommendation algorithms are already in this market--content-based methods, for instance, or collaborative filtering ones.

2. Collaboration with Fusion: Data Sources

Fusion, a joint venture between Univision Communications Inc. and Disney/ABC, is a multi-platform media company that champions a young, diverse, and inclusive world. Social media referrals have contributed to over 90% of the web traffic of its online presence (see:

To better understand Fusion’s social media market segments, two main Fusion data sources were used in this project: (1) Twitter API (tweets information related to @thisisfusion); (2) WordPress API (news article information with related tweet). This is all open source data, making the project reproducible. In the future, the model we developed could be modified to analyze Facebook, for instance, as well.

Working with Fusion data sources, my “Twitter Influencers” Shiny app shifts the market focus of recommendation systems from news media outlets to social media networks. The project specifically targets those most influential users of social networks like Twitter--superfans.

3. Finding the Superfans

As is true in real life, Twitter users have varying levels of influence on other people. Different users get different amounts of attention and reactions, even when sending out the same tweet. Based on trends like that, I wondered how I might quantify the influence of Twitter users in general. Developing a model for doing so would allow me to identify those users who bear most influence for Fusion, via @thisisfusion, in particular.


Two sets of metrics were implemented in this project:

(1) Centrality Score (Mining Single Tweets). Based on retweet history in a time series, and the following relationship between retweet users, a simple influence graph can be drawn as follows:



shinyServer(function(input, output) {

    # following relationships between retweeters
    # (whether a retweeter is following those who retweeted the same tweet before him)
    rtLinks <- reactive({
        rter_id <- c(fusion$id, rev(intersect(rter_id(), dir(paste(data_path, "friends", sep = "/")))))
        friendShip <- c()
        for(i in 2:length(rter_id)) {
            friend <- intersect(rter_id[1:(i-1)], read.csv(paste(data_path, "friends", rter_id[i], sep = "/"))$id)
            if(length(friend)) {
                friendShip <- rbind(friendShip, cbind(friend, rep(rter_id[i], length(friend))))
        friendShip <- data.frame(matrix(sapply(friendShip, function(x) rters$screenName[rters$id == x]), ncol = 2))
    # centrality score
    alphaCentr <- reactive({
        centra <- sort(alpha_centrality(graph(t(rtLinks()[,c(2,1)])), alpha = input$alpha), decreasing = TRUE)
        score <- numeric(0)
        for (i in 1:length(names(centra))) {
          score[i] <- influence[influence$screenName==names(centra)[i],]$score[1]
        centra <- data.frame(Name = names(centra), Centrality = centra, Influence = score)

In the context of a single retweet network, a given user’s Centrality Score indicates how important that user is within the network.

(2) InfluenceFlow Score (Mining Twitter Communities). In the context of a given Twitter community, a particular user’s InfluenceFlow Score indicates how influential that user is within the community.

Not limited to a specific tweet, InfluenceFlow Scores capture overall information flow in particular Twitter communities. Within the Fusion community (signified by @thisisfusion), a certain user’s InfluenceFlow Score is calculated by the product of that user’s number of followers and number of times they mentioned @thisisfusion in their most recent 400 tweets.


import pandas as pd
import numpy as np
import tweepy
import requests
import re
import time
from tweepy import OAuthHandler
from get_config import get_config

env = get_config()

consumer_key = env.get('CONSUMER_KEY')
consumer_secret = env.get('CONSUMER_SECRET')
access_token = env.get('ACCESS_TOKEN')
access_secret = env.get('ACCESS_TOKEN_SECRET')
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

t = pd.read_csv('fetcher/top100.csv')

# print t.t_id[:3]
# 0    614807708575907840
# 1    618798114825220097
# 2    617840986006401024

def retweet_users_of_a_tweet(tweet_id):
    retweets = api.retweets(tweet_id, 100)
    return [ for rt in retweets]

# print retweet_users_of_a_tweet(614807708575907840)
# 16877020

def t_all_tweets(user,n):
    result = []
    count = 0
    for x in range(n):
        tweets = api.user_timeline(id=user,count=200,page=x+1,include_rts=True)
        result += tweets
        count += 1
        if (x+1)%10 == 0:
            print 'sleep for 90 seconds'
        print count, 'of ', n, 'pages done'
    return result

def t_mentions(user):
    tweets = t_all_tweets(user, 2) # first 2 pages timeline, 16 pages max
    t_text = ''
    for t in tweets:
        t_text += t.text
    return len(re.findall('(@thisisfusion|@ThisIsFusion)', t_text)) # number of direct mentions + retweets

def t_user_rank(users):
    udic = {}
    count = 0
    for user in users:
        screen_name = api.get_user(id=user).screen_name
        follower = api.get_user(id=user).followers_count
        mention = t_mentions(user)
        udic[screen_name] = [follower, mention, (follower*mention)]
        count += 1
        print count, 'of', len(users), 'users added into dictionary'
        if count%5 == 0:
            print 'sleep for one minute'
    return udic

def t_tweets_influencers(n):
    count = 0
    for i in range(n):
        if not i:
            udic = t_user_rank(retweet_users_of_a_tweet(t.t_id[i])) # first 3 users, 100 max
            follower = [udic.values()[x][0] for x in range(len(udic))]
            mention = [udic.values()[x][1] for x in range(len(udic))]
            score = [udic.values()[x][2] for x in range(len(udic))]
            keys = udic.keys()
            t_id = [t.t_id[i] for x in range(len(udic))]
            newdic = {'t_id':t_id,'influencer':keys,'score':score,'mention':mention,'follower':follower}
            udic = t_user_rank(retweet_users_of_a_tweet(t.t_id[i])) # first 3 users, 100 max
            follower = [udic.values()[x][0] for x in range(len(udic))]
            mention = [udic.values()[x][1] for x in range(len(udic))]
            score = [udic.values()[x][2] for x in range(len(udic))]
            keys = udic.keys()
            t_id = [t.t_id[i] for x in range(len(udic))]
            newdic['t_id'] += t_id
            newdic['influencer'] += keys
            newdic['score'] += score
            newdic['mention'] += mention
            newdic['follower'] += follower
        count += 1
        print '-------', count, 'of', n, 'tweets analyzed', '-------'
    return newdic

result = t_tweets_influencers(20) # first 2 popular tweets, 100 max
df = pd.DataFrame(result)
df.to_csv('influencers(10 posts).csv', encoding='utf-8')

print 'project is done!'

4. Recording Superfans’ Preferences

Having gathered the InfluenceFlow Scores for all retweet users, one quick way to categorize users is to associate retweets with interests--categorical information consisting references to sections, classes, categories, etc.--culled from articles and hashtags alike. By grouping users according to their interests, editors can easily find the influencers of any topic or milieu--any category, any community, any interest.

The model for collecting superfan data from every tweet runs as follows:

  • Scores
    • Centrality – influence within one tweet
    • InfluenceFlow – overall influence
  • Interests
    • Twitter Hashtags
    • Article Sections (news/justice/voices…)
    • Article Topics (drugs/transgender/mexico…)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tweepy
import requests
from collections import defaultdict
import re
import time

from tweepy import OAuthHandler
from get_config import get_config

env = get_config()

consumer_key = env.get('CONSUMER_KEY')
consumer_secret = env.get('CONSUMER_SECRET')
access_token = env.get('ACCESS_TOKEN')
access_secret = env.get('ACCESS_TOKEN_SECRET')
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

t = pd.read_csv('fetcher/top100.csv')
u = pd.read_csv('influencers/influencers(20 posts).csv')

t1 = t.loc[:,['t_id','w_tags_section','w_tags_topic','t_hashtags']].iloc[:20,:]

sections = []
for i in t1.w_tags_section:
    if type(i) == str:
        i = re.sub('"','',i)
        [sections.append(x) for x in i.split(",")]
sections = list(set(sections))

topics = []
for i in t1.w_tags_topic:
    if type(i) == str:
        i = re.sub('"','',i)
        [topics.append(x) for x in i.split(",")]
topics = list(set(topics))

hashtags = []
for i in t1.t_hashtags:
    if type(i) == str:
        [hashtags.append(x) for x in i.split(",")]
hashtags = list(set(hashtags))

# set influence score threshold as 2000 (about 8% ~ 9% in the top)
u1 = u.loc[u.score>=2000]
u1 = u1.loc[:,['t_id','influencer']]

index = list(set(u1.influencer.values))
users = pd.Series(np.zeros(len(index)),index=index)

# check out how the result mapping looks like in mapping.csv
mapping = dict()
for section in sections:
    mapping[section] = users
for topic in topics:
    mapping[topic] = users
for hashtag in hashtags:
    mapping[hashtag] = users
mapping = pd.DataFrame(mapping)

df = pd.merge(t1,u1)

for row_index,row in df.iterrows():
    features = []
    if type(row['w_tags_section']) == str:
        section = re.sub('"','',row['w_tags_section'])
        [features.append(x) for x in section.split(",")]
    if type(row['w_tags_topic']) == str:
        topic = re.sub('"','',row['w_tags_topic'])
        [features.append(x) for x in topic.split(",")]
    if type(row['t_hashtags']) == str:
        [features.append(x) for x in row['t_hashtags'].split(",")]
    for feature in features:
        mapping.loc[row['influencer'],feature] += 1

print '\n', '------All features extracted from your top 20 retweets-------', '\n'

print 'Website sections:', sections, '\n'
print 'Website topics:', topics, '\n'
print 'Twitter hashtags:', hashtags, '\n'

while True:
    m = raw_input('***** Which one to query? Choose from sections/topics/hashtags:')
    if m == 'sections':
        m = sections
    elif m == 'topics':
        m = topics
    elif m == 'hashtags':
        m = hashtags
        print 'Wrong format!'

print '\n', '***** Please choose one item from', m, ':'
n = raw_input('')
print '\n', '------Your Superfans who ever participated in that topic-------', '\n'
print mapping[(mapping[n]>0)]
print '\n', '------Influence Rank-------', '\n'

influencer = mapping[(mapping[n]>0)].index
x = pd.DataFrame({'influencer': list(influencer)})
result = pd.merge(x,u).loc[:,['influencer','follower','mention','score']].sort_index(by='score',ascending=False)

print result

5. Natural Language Processing and My Recommendation System

To make my recommendation system more robust, two questions need to be answered properly in the future.

First, is there a better way to predict users’ future interests?

I used users’ past retweets as a basis to understand their interests. But the limited retweet records from one user may mislead the algorithm to give unbalanced weights to different topics. A more robust method would involve analyzing users’ own timelines using natural language processing (NLP) tools.

Analyzing users’ timelines would give our recommendation system the full range of user behavior. The system would be able to get to know a user’s interest from the full set of their past behaviors, as expressed in natural language.

As an experiment toward this end, we used the NLP tool AlchemyAPI to extract both scores and interests from user timelines.

QQ20150908-7@2x QQ20150908-8@2x

from __future__ import print_function
from alchemyapi import AlchemyAPI
import pandas as pd
import numpy as np
import json

i = pd.read_csv('itext.csv')

# Create the AlchemyAPI Object
alchemyapi = AlchemyAPI()

def get_subtypes(n):
    mining = dict()
    response = alchemyapi.entities('text', i.text[n], {'sentiment': 1})
    if response['status'] == 'OK':

        for entity in response['entities']:
            # print('relevance: ', entity['relevance'])
            # print('text: ', entity['text'].encode('utf-8'))
            # print('type: ', entity['type'])
            if 'disambiguated' in entity.keys():
                if 'subType' in entity['disambiguated'].keys():
                    for subtype in entity['disambiguated']['subType']:
                        mining[subtype] = mining.get(subtype,0) + 1*float(entity['relevance'])
        print('Error in entity extraction call: ', response['statusInfo'])
    return mining

def match_all(num):
    al = pd.DataFrame()
    for n in range(num):
        usern = pd.DataFrame(get_subtypes(n),index=[i.influencer[n]])
        al = al.append(usern)
    return al

al = match_all(len(i.influencer))
print('Project is done!')

Second question: is there an algorithm to suggest personalized recommendations to superfans?

Again, utilizing the prevalence of natural language on Twitter will help. A content-based recommendation system could be built in the following steps:

(1) Mine text on all superfans’ timeline content
(2) Cluster texts on vectors from AlchemyAPI
(3) Observe distributions of superfan scores and interests in different clusters
(4) Match vectorized user timeline text and article text using cosine similarity to give content-based recommendations.

6. Limitations

Because of time and data limitations, results from this project may be subject to certain biases. Within our three-week timeline, I mainly focused on identifying and classifying superfans, and didn’t include time analyses. And Twitter API limitations meant rate and query limits, and translated to a lack of infrastructural support. It took one to two days to get basic data needed for 20 tweets, and I couldn’t retrieve more than 3,200 tweets and 100 retweets, which sacrificed the flexibility of popularity analyses and meant that data consistency issues occurred a lot.

Despite constrains, I was able to envision frameworks for many recommendation system features. Social media networks continue to provide a rewarding avenue for doing data science.

About Author

Related Articles

Leave a Comment

Gina Tsai October 4, 2020
Hi, I'm interested in your research. But I can't understand what 'w_tags_section' and 'w_tags_topic', mean? Would you please explain about it ? Thanks :) .

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp