Most of the data readily available in the real world comes unlabeled. Getting the labels often entails manual classification, which can be a tedious and expensive process. While dealing with unlabeled data, we are limited to using unsupervised machine learning techniques since we do not have a definite response variable to work with. Unsupervised learning is generally used as an exploratory data analysis tool used to understand the data. I decided to work with unlabeled Twitter data in the form of tweets to attempt to understand human thoughts and interactions. A business or organization can benefit a great deal from understanding the individuals in its target audience and then leveraging this to connect with them on a more personal level. Thus, having an automated classification system of such unlabeled tweets can serve as a valuable tool for business improvement.
The classification of tweets is a multi-step process, starting from the preprocessing of unlabeled tweets before applying unsupervised machine learning algorithms for clustering and topic modeling. Once the desired level of accuracy in tweet grouping is achieved, one can proceed to manually label them based on similarity in context. Now these labelled tweets can be fed into supervised machine learning models to train the model to understand the underlying similarities of tweets belonging to each category. A fully trained tweet classification model can then be applied to unseen data in an automated tweet labeling system. A visualization of the steps involved can be seen below.
The steps outlined in red form the focus of this blog. Only once reproducibility is achieved in the clustering and topic modeling of tweets, can one proceed to the supervised machine learning model.
The data for this project were obtained directly using Twitter's Streaming API which allows access to Twitter's global stream of Tweet data. The API gives the user control over downloading by specifying the language, as well as keywords being used in the tweets. For the purpose of the project, I was interested in English tweets. For some of my data I used the keyword search option to narrow the focus of the tweets being downloaded. The data conveniently downloads in JSON format which can easily be parsed into a Python dictionary.
For analysis the data was then transformed from dictionary format into a Pandas dataframe. Preprocessing of the tweets was somewhat of a trivial process since they tend to be slightly different in structure from standard text documents. There tends to be more use of slang and shorthand to get the message across in limited 140 character working space. For that reason, one must be careful in removing unnecessary elements from each tweet while retaining features which would contribute to the meaning and context of the tweet.
I used regular expressions to remove the characters "RT" in any tweet because keeping it would lead to clustering retweets together which would be at odds with the purpose of my analysis. I also removed any user mentions from tweets which can be identified with the preceding "@" before the user's name. Similarly, I removed any links that were present as well. Finally, I used Python's NLTK package to remove stop words like is, the, are, etc. Stop words are words that are grammatically essential to structure, but contribute very little to the context of a sentence. Once the data was cleaned up it was now ready for machine learning.
The unsupervised learning algorithms used for this analysis include Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) for topic modeling, and K-means for clustering of tweets. Conveniently, all three are available in Python's scikit-learn package.
K-means clustering algorithm essentially grouped individual tweets into only one of the specified number of clusters which could be problematic if a given tweet fell into more than one category. Despite this limitation, K-means did surprisingly well in grouping most tweets based on similarities. However, there would always be a "catch-all" cluster which would contain any tweets which the algorithm was unable to properly classify. In this regard the topic modeling algorithms allowed the flexibility of assigning multiple topics to tweets by weight of how likely it is to be a part of each of the topics. The following sections of this blog will show how these algorithms performed with different sets of tweets.
The first set of tweets I tested were obtained using different keyword searches related to health & fitness, travel, sports, news/politics, art, music, video games, lifestyle, dating/relationships, wedding/marriage, and other life changing events. This was a test to check if the unsupervised learning methods would group the tweets based on the keywords that were used to get them. A word cloud of the hashtags associated with this set of tweets can be seen below.
LDA: Below we have the top words used to determine topic using LDA. Topic 5 is clearly related to US politics as it includes words as trump, comey, and president. Topic 12 generalizes to entertainment by including the words music, video, sports, and gameplay. Topic 8 seems to be a confused mixture between travel and politics that could partially be due to the mention of President Trump's travel ban.
NMF: For this set of tweets NMF performed substantially better than LDA as can be seen with the topic assignments below. The top words for each topic relate very well to the keywords used in my search. Topics 1, 3, 5, 7, 9, and 13 clearly show relevance to jobs/hiring, wedding, art, sports, health & fitness, and politics respectively. There were some unclear classifications with certain topics, but overall NMF performed very well in this case.
K-means: Clustering using the K-means algorithm also outperformed LDA in coherent grouping of tweets. With the exception of a few "not so good" clusters, grouping based on sports, politics, music, and health can be seen in clusters 5, 8, 11, and 13 respectively.
As mentioned earlier, cluster 9 was the catch-all cluster containing all tweets which did not fall into any of the other clusters.
The next set of tweets for cluster analysis were streamed with the only restriction of being in english. Given the absence of intentional underlying categories, it was interesting to see how the algorithms would perform in grouping randomly obtained tweets. Below is a word cloud of the hashtags of these tweets.
LDA: For the open-streaming tweets LDA performed much better as compared to the previous set of tweets. Topic 2 is related to music entertainment, and topic 3 is specific to tweets about the elections in the United Kingdom. Topic 28 seems to talk about jobs, hiring, and careers. Topics 4 and 7 are somewhat unclear in determining the context of the tweets.
NMF: Once again NMF outperformed LDA in topic assignment as can be seen with the clear context of most of the topics determined using this algorithm. Topic 0, 2, 7, 8, and 10 clearly relate respectively to entertainment, UK elections, career, sports, and giveaway advertisements. Topic 11 refers to the unfortunate tragedy of the Arianna Grande concert in Manchester, UK, and topic 29 speaks of events related to FBI director Comey's testimony regarding President Trump and Russia.
K-means: K-means did a good job at clustering tweets by category as can be seen some of the clusters below. Cluster 5 grouped tweets mentioning different genres of music together in a music-related cluster. Cluster 14 and 25 group on the discussion of the UK and US politics respectively.
The catch-all cluster in this case was cluster 22 containing a good portion of the tweets which were unclassified.
The final set of tweets I obtained were from users following specific organizations and companies. The purpose of narrowing my focus to user tweets was to detect if multiple tweets from the same users can provide insights into topics of interest to the user. For this set I obtained about 100 tweets each from 2000 followers of organizations/companies such as Barclays, Fitbit, Tinder, Democratic/Republican National Party, and the Economist. For this blog I will discuss my analysis of tweets from followers of the Democrat or Republican Party.
Using LDA topic modeling, and visualizing some of the tweets associated with the assigned topics, we can see the opposing views from followers of two competing entities. Followers of the Democratic Party seem to be more critical of President Trump in their tweets, while followers of the Republican Party are more supportive of President Trump and more critical of Hillary Clinton and former President Obama.
Future work on this project will involve cycling between cleaning the text of the tweets to obtain optimal topic modeling and clustering using the algorithms discussed in the blog. Afterwards, supervised learning techniques can be used to train a model to classify new tweets based on pre-existing or newly added categories.
This scope of this project and its methodology can be extended to include image analysis and video analysis to include posts from other forms of social media such as Facebook, Instagram, and Snapchat. The more data that can be added to this model, the better the model will serve as a valuable resource to organizations to understand their current and potential customers.