Targeting Twitter Influencers Through Social Network Analysis
Contents:
0. Introduction
1. Background
2. Collaboration with Fusion: Data Sources
3. Finding the Superfans
4. Recording Superfans’ Preferences
5. Natural Language Processing and My Recommendation System
6. Limitations
0. Introduction
This project designs a program for charting the influencers and patterns of the Twitter community. My findings will be able to help online media companies identify their most important influencers on Twitter and understand these “superfans’” interests, patterns, and behaviors. Furthermore, recommendation systems can be built based on natural language processing of influencers’ Twitter timelines to suggest content that more compellingly attracts influencers’ attention.
My demo R “Twitter Influencers” Shiny app below showcases the analysis result of the top 20 retweets in the @thisisfusion timeline:
1. Background
Recommendation systems have become widely used by news media companies. As evidenced by the “Recommended for You” section of NYTimes.com, or the “Recommendations” plugin on HuffPost Social News, personalized placement of articles on apps and websites is now being used to guide users to posts they’ll probably find most interesting.
At the core of news media recommendation systems is the tripartite ability to recommend the right content to the right people at the right times. Understanding who views what when makes possible recommendations in real time. A few recommendation algorithms are already in this market--content-based methods, for instance, or collaborative filtering ones.
2. Collaboration with Fusion: Data Sources
Fusion, a joint venture between Univision Communications Inc. and Disney/ABC, is a multi-platform media company that champions a young, diverse, and inclusive world. Social media referrals have contributed to over 90% of the web traffic of its online presence (see: fusion.net).
To better understand Fusion’s social media market segments, two main Fusion data sources were used in this project: (1) Twitter API (tweets information related to @thisisfusion); (2) WordPress API (news article information with related tweet). This is all open source data, making the project reproducible. In the future, the model we developed could be modified to analyze Facebook, for instance, as well.
Working with Fusion data sources, my “Twitter Influencers” Shiny app shifts the market focus of recommendation systems from news media outlets to social media networks. The project specifically targets those most influential users of social networks like Twitter--superfans.
3. Finding the Superfans
As is true in real life, Twitter users have varying levels of influence on other people. Different users get different amounts of attention and reactions, even when sending out the same tweet. Based on trends like that, I wondered how I might quantify the influence of Twitter users in general. Developing a model for doing so would allow me to identify those users who bear most influence for Fusion, via @thisisfusion, in particular.
Two sets of metrics were implemented in this project:
(1) Centrality Score (Mining Single Tweets). Based on retweet history in a time series, and the following relationship between retweet users, a simple influence graph can be drawn as follows:
library(igraph) shinyServer(function(input, output) { # following relationships between retweeters # (whether a retweeter is following those who retweeted the same tweet before him) rtLinks <- reactive({ rter_id <- c(fusion$id, rev(intersect(rter_id(), dir(paste(data_path, "friends", sep = "/"))))) friendShip <- c() for(i in 2:length(rter_id)) { friend <- intersect(rter_id[1:(i-1)], read.csv(paste(data_path, "friends", rter_id[i], sep = "/"))$id) if(length(friend)) { friendShip <- rbind(friendShip, cbind(friend, rep(rter_id[i], length(friend)))) } } friendShip <- data.frame(matrix(sapply(friendShip, function(x) rters$screenName[rters$id == x]), ncol = 2)) }) # centrality score alphaCentr <- reactive({ centra <- sort(alpha_centrality(graph(t(rtLinks()[,c(2,1)])), alpha = input$alpha), decreasing = TRUE) score <- numeric(0) for (i in 1:length(names(centra))) { score[i] <- influence[influence$screenName==names(centra)[i],]$score[1] } centra <- data.frame(Name = names(centra), Centrality = centra, Influence = score) }) })
In the context of a single retweet network, a given user’s Centrality Score indicates how important that user is within the network.
(2) InfluenceFlow Score (Mining Twitter Communities). In the context of a given Twitter community, a particular user’s InfluenceFlow Score indicates how influential that user is within the community.
Not limited to a specific tweet, InfluenceFlow Scores capture overall information flow in particular Twitter communities. Within the Fusion community (signified by @thisisfusion), a certain user’s InfluenceFlow Score is calculated by the product of that user’s number of followers and number of times they mentioned @thisisfusion in their most recent 400 tweets.
import pandas as pd import numpy as np import tweepy import requests import re import time from tweepy import OAuthHandler from get_config import get_config env = get_config() consumer_key = env.get('CONSUMER_KEY') consumer_secret = env.get('CONSUMER_SECRET') access_token = env.get('ACCESS_TOKEN') access_secret = env.get('ACCESS_TOKEN_SECRET') auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth) t = pd.read_csv('fetcher/top100.csv') # print t.t_id[:3] # 0 614807708575907840 # 1 618798114825220097 # 2 617840986006401024 def retweet_users_of_a_tweet(tweet_id): retweets = api.retweets(tweet_id, 100) return [rt.user.id for rt in retweets] # print retweet_users_of_a_tweet(614807708575907840) # 16877020 def t_all_tweets(user,n): result = [] count = 0 for x in range(n): tweets = api.user_timeline(id=user,count=200,page=x+1,include_rts=True) result += tweets count += 1 if (x+1)%10 == 0: print 'sleep for 90 seconds' time.sleep(90) print count, 'of ', n, 'pages done' return result def t_mentions(user): tweets = t_all_tweets(user, 2) # first 2 pages timeline, 16 pages max t_text = '' for t in tweets: t_text += t.text return len(re.findall('(@thisisfusion|@ThisIsFusion)', t_text)) # number of direct mentions + retweets def t_user_rank(users): udic = {} count = 0 for user in users: screen_name = api.get_user(id=user).screen_name follower = api.get_user(id=user).followers_count mention = t_mentions(user) udic[screen_name] = [follower, mention, (follower*mention)] count += 1 print count, 'of', len(users), 'users added into dictionary' if count%5 == 0: print 'sleep for one minute' time.sleep(60) return udic def t_tweets_influencers(n): count = 0 for i in range(n): if not i: udic = t_user_rank(retweet_users_of_a_tweet(t.t_id[i])) # first 3 users, 100 max follower = [udic.values()[x][0] for x in range(len(udic))] mention = [udic.values()[x][1] for x in range(len(udic))] score = [udic.values()[x][2] for x in range(len(udic))] keys = udic.keys() t_id = [t.t_id[i] for x in range(len(udic))] newdic = {'t_id':t_id,'influencer':keys,'score':score,'mention':mention,'follower':follower} else: udic = t_user_rank(retweet_users_of_a_tweet(t.t_id[i])) # first 3 users, 100 max follower = [udic.values()[x][0] for x in range(len(udic))] mention = [udic.values()[x][1] for x in range(len(udic))] score = [udic.values()[x][2] for x in range(len(udic))] keys = udic.keys() t_id = [t.t_id[i] for x in range(len(udic))] newdic['t_id'] += t_id newdic['influencer'] += keys newdic['score'] += score newdic['mention'] += mention newdic['follower'] += follower count += 1 print '-------', count, 'of', n, 'tweets analyzed', '-------' return newdic result = t_tweets_influencers(20) # first 2 popular tweets, 100 max df = pd.DataFrame(result) df.to_csv('influencers(10 posts).csv', encoding='utf-8') print 'project is done!'
4. Recording Superfans’ Preferences
Having gathered the InfluenceFlow Scores for all retweet users, one quick way to categorize users is to associate retweets with interests--categorical information consisting references to sections, classes, categories, etc.--culled from articles and hashtags alike. By grouping users according to their interests, editors can easily find the influencers of any topic or milieu--any category, any community, any interest.
The model for collecting superfan data from every tweet runs as follows:
- Scores
- Centrality – influence within one tweet
- InfluenceFlow – overall influence
- Interests
- Twitter Hashtags
- Article Sections (news/justice/voices…)
- Article Topics (drugs/transgender/mexico…)
- …
import pandas as pd import matplotlib.pyplot as plt import numpy as np import tweepy import requests from collections import defaultdict import re import time from tweepy import OAuthHandler from get_config import get_config env = get_config() consumer_key = env.get('CONSUMER_KEY') consumer_secret = env.get('CONSUMER_SECRET') access_token = env.get('ACCESS_TOKEN') access_secret = env.get('ACCESS_TOKEN_SECRET') auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth) t = pd.read_csv('fetcher/top100.csv') u = pd.read_csv('influencers/influencers(20 posts).csv') t1 = t.loc[:,['t_id','w_tags_section','w_tags_topic','t_hashtags']].iloc[:20,:] sections = [] for i in t1.w_tags_section: if type(i) == str: i = re.sub('"','',i) [sections.append(x) for x in i.split(",")] sections = list(set(sections)) topics = [] for i in t1.w_tags_topic: if type(i) == str: i = re.sub('"','',i) [topics.append(x) for x in i.split(",")] topics = list(set(topics)) hashtags = [] for i in t1.t_hashtags: if type(i) == str: [hashtags.append(x) for x in i.split(",")] hashtags = list(set(hashtags)) # set influence score threshold as 2000 (about 8% ~ 9% in the top) u1 = u.loc[u.score>=2000] u1 = u1.loc[:,['t_id','influencer']] index = list(set(u1.influencer.values)) users = pd.Series(np.zeros(len(index)),index=index) # check out how the result mapping looks like in mapping.csv mapping = dict() for section in sections: mapping[section] = users for topic in topics: mapping[topic] = users for hashtag in hashtags: mapping[hashtag] = users mapping = pd.DataFrame(mapping) df = pd.merge(t1,u1) for row_index,row in df.iterrows(): features = [] if type(row['w_tags_section']) == str: section = re.sub('"','',row['w_tags_section']) [features.append(x) for x in section.split(",")] if type(row['w_tags_topic']) == str: topic = re.sub('"','',row['w_tags_topic']) [features.append(x) for x in topic.split(",")] if type(row['t_hashtags']) == str: [features.append(x) for x in row['t_hashtags'].split(",")] for feature in features: mapping.loc[row['influencer'],feature] += 1 print '\n', '------All features extracted from your top 20 retweets-------', '\n' print 'Website sections:', sections, '\n' print 'Website topics:', topics, '\n' print 'Twitter hashtags:', hashtags, '\n' while True: m = raw_input('***** Which one to query? Choose from sections/topics/hashtags:') if m == 'sections': m = sections break elif m == 'topics': m = topics break elif m == 'hashtags': m = hashtags break else: print 'Wrong format!' print '\n', '***** Please choose one item from', m, ':' n = raw_input('') print '\n', '------Your Superfans who ever participated in that topic-------', '\n' print mapping[(mapping[n]>0)] print '\n', '------Influence Rank-------', '\n' influencer = mapping[(mapping[n]>0)].index x = pd.DataFrame({'influencer': list(influencer)}) result = pd.merge(x,u).loc[:,['influencer','follower','mention','score']].sort_index(by='score',ascending=False) print result
5. Natural Language Processing and My Recommendation System
To make my recommendation system more robust, two questions need to be answered properly in the future.
First, is there a better way to predict users’ future interests?
I used users’ past retweets as a basis to understand their interests. But the limited retweet records from one user may mislead the algorithm to give unbalanced weights to different topics. A more robust method would involve analyzing users’ own timelines using natural language processing (NLP) tools.
Analyzing users’ timelines would give our recommendation system the full range of user behavior. The system would be able to get to know a user’s interest from the full set of their past behaviors, as expressed in natural language.
As an experiment toward this end, we used the NLP tool AlchemyAPI to extract both scores and interests from user timelines.
from __future__ import print_function from alchemyapi import AlchemyAPI import pandas as pd import numpy as np import json i = pd.read_csv('itext.csv') # Create the AlchemyAPI Object alchemyapi = AlchemyAPI() def get_subtypes(n): mining = dict() response = alchemyapi.entities('text', i.text[n], {'sentiment': 1}) if response['status'] == 'OK': for entity in response['entities']: # print('relevance: ', entity['relevance']) # print('text: ', entity['text'].encode('utf-8')) # print('type: ', entity['type']) if 'disambiguated' in entity.keys(): if 'subType' in entity['disambiguated'].keys(): for subtype in entity['disambiguated']['subType']: mining[subtype] = mining.get(subtype,0) + 1*float(entity['relevance']) else: print('Error in entity extraction call: ', response['statusInfo']) return mining def match_all(num): al = pd.DataFrame() for n in range(num): usern = pd.DataFrame(get_subtypes(n),index=[i.influencer[n]]) al = al.append(usern) print('Processed',n+1,'of',num,'influencers') return al al = match_all(len(i.influencer)) print(al) al.to_csv('textmining.csv',encoding='utf-8') print('Project is done!')
Second question: is there an algorithm to suggest personalized recommendations to superfans?
Again, utilizing the prevalence of natural language on Twitter will help. A content-based recommendation system could be built in the following steps:
(1) Mine text on all superfans’ timeline content
(2) Cluster texts on vectors from AlchemyAPI
(3) Observe distributions of superfan scores and interests in different clusters
(4) Match vectorized user timeline text and article text using cosine similarity to give content-based recommendations.
6. Limitations
Because of time and data limitations, results from this project may be subject to certain biases. Within our three-week timeline, I mainly focused on identifying and classifying superfans, and didn’t include time analyses. And Twitter API limitations meant rate and query limits, and translated to a lack of infrastructural support. It took one to two days to get basic data needed for 20 tweets, and I couldn’t retrieve more than 3,200 tweets and 100 retweets, which sacrificed the flexibility of popularity analyses and meant that data consistency issues occurred a lot.
Despite constrains, I was able to envision frameworks for many recommendation system features. Social media networks continue to provide a rewarding avenue for doing data science.