Scraping the Political Divide II - Predicting Partisanship with a Supervised Classification Model & Unbalanced Dataset
In my previous post, Scraping the Partisan Divide: Sentiment, Text, & Network Analysis of an Online Political Forum, I describe my process of scraping ~ 250 thousand user posts from an online political discussion forum, as well as my application of various Python packages to perform analyses examining differences between Conservative and Liberal users of the site.
In this post, I will further utilize the scraped data to build a supervised classification model to predict a given post as Conservative or Liberal. I will detail my process of building and evaluating my models to eventually produce my optimal predictive output.
My model's goal:
Various digital organizations, such as digital advertisers, media organizations, lobbyists and political campaigns are highly interested in understanding the political leanings of users commenting on specific social media/digital publication content.
This metric would explain how content is politically-received and thus where these organizations should direct their resources.
To serve this aim, it is crucial that my model is flexible and considers only text data.
This post is a deep dive and is intended to teach the reader! Here is a general outline of the steps:
- Understanding the Data - pre-processing and reducing extreme frequency user bias.
- Building a Baseline model - Multinomial Naive Bayes and Logistic Regression.
- Establishing Performance Measures necessary to evaluate my models - Accuracy, Precision, Recall, and F1
- Specifying Class Weights to account for non-equal data class sizes - vitally important!
- Visualizing my Predictions to quickly evaluate my models relatively - Precision vs. Recall & the ROC curve
- Applying Stochastic Gradient Descent for potential improvements to my Logistic Regression & SVC models.
- Evaluating my Final Models via performative benchmarks.
- Contextually Filtering Data via prior to modeling in search of improvements.
Most creatively, in my final step, I examine utilizing news.api to dynamically tag political words related to an input word based on their appearance in news article descriptions.
Check out my git with my full code in Jupyter Notebook here!
Text Data Pre-Processing
Before I can build my model, I first must clean and categorize the scraped user political identification and user post text data.
I performed this cleaning as part of my original data exploration in my first blog post, and re-list the steps here:
- Remove residual JSON xpath text
- Remove special characters with regex
- Classified listed political ideologies as liberal/conservative
- filter out posts from only classified liberal/conservative users
Reducing extreme frequency user bias...
I have several important choices to make in determining my dataset used to train and evaluate my model. I must always consider both the composition of my available data and the ultimate goal of my model.
Of the original ~250k posts scraped, ~113k are labeled along the conservative/liberal spectrum as defined in my previous post. However, these ~113k posts are not divided equally between the two classes, they are highly imbalanced with ~87% of posts being conservative and only ~13% being liberal. This imbalance stems from a few incredibly active conservative users.
Having a huge portion of my data coming from a few users heavily biases my model toward the views of those specific people. As the goal of my model to generalize over anonymous posts, I thus limit the number of posts per conservative user, such that no user has more posts than the most active liberal poster.
- Identify the number of posts of the top liberal user (3410 posts)
- Create a list of all conservative users who have a greater number of posts (8 users)
- Build a for loop taking a random sample of 3410 from the posts of each of those users.
- Append those abridged posts back into the total dataframe.
While there are still more posts per conservative users, this greatly limits the influence of those several high-frequency users.
While I will perform my modeling with this data, I additionally, through understanding that this particular forum has an established community of long-time, frequent users, many posts are purely social in nature and removed from politics.
Therefore, I hypothesize as to whether filtering out posts based by some sort of political-related threshold would both improve the model and align it more toward my ultimate goal of predicting partisanship in a more anonymous user environments.
In my final step, I will build additional models via a keyword-contextual filtering approach.
Building a Baseline
After cleaning my text data, I must establish a baseline for my data pre-processing as well as model parameterization and scoring techniques. Establishing this baseline will validate the majority of my key modeling decisions so I can then quickly compare other models later on.
Train Test Split
I choose to work with my initial ~113k post dataset in my baseline, and because of the large size of the dataset, decided to isolate 1/3 as my test set using train_test_split() from sklearn.model_selection.
Since my output classes are considerably unequal, I make sure I code my less frequent class (liberal) as the positive outcome (1), as the metrics in sklearn (discussed ahead) are geared toward evaluating the positive class.
I also include the optional 'stratify' parameter my sample based on the output class, ensuring a sufficient number of both classes in the training and test sets and thus reducing my model's ultimate bias.
To avoid snooping bias and bias in my final model, and as a best practice, I will from now on isolate my test data until my final evaluation! I will evaluate my training model and its hyper-parameters using cross-validation of the training data.
Quantifying Text Data via Natural Language Processing (NLP)
For me personally, finding insight through quantifying ultimately qualitative human speech, is one of the most exciting and fascinating components of data science - gaining experience modeling using NLP was one of my main motivations in choosing this project.
CountVectorizer creates a sparse matrix, splitting the text of each post into individual words (separating words via a space), and then counting, for each post, the frequency by which each word appears.
The 'Tdfidf' of TdifidfTransformer stands for 'Term Frequencies - Inverse Document Frequencies" and consists of two methods that build upon the work of CountVectorizer.
The first component, Term Frequencies, divides, the frequency of each word in each post by the number of total words within that post. This converts the values of the matrix to percentages and removes CountVectorizer's bias toward longer documents which naturally have higher word frequency counts.
The second component, "Inverse Document Frequencies" multiplies the Term Frequencies score by a second metric, the inverse of the frequency by which each word appears amongst ALL posts. Words that appear frequently across all posts will have a high frequency, and thus low inverse frequency, resulting in a lower overall score when multiplied by the Term Frequencies score. More uncommon words are thus given more weight.
Baseline Supervised Classification Models
As a supervised classification problem, there are a variety of available models to structure my data. While I ultimately want to test against a variety of models, for the sake of efficiency, I will determine my modeling assumptions upon a select baseline, and later compare competing models against that baseline utilizing those same assumptions.
For my baseline, I chose two common models, Multinomial Naive Bayes Classifier and Logistic Regression.
Multinomial Naive Bayes Classifier is a specific instance of the more general Naive Bayes Classifier, specifying a multinomial distribution of the output feature. The Naive Bayes Classifier more generally is built upon Bayes Theorem of conditional probabilities. It is advantageous in it is relatively simple and naturally suited for counts of text, but disadvantageous in that it relies on strong independence assumptions that are known to be false (it is thus 'naive').
Logistic Regression, in turn, is a Linear Regression transformed via a sigmoid function to produce a logit between 0 and 1 signifying the probability that each instance belongs to the positive class. As a linear function, it is advantageous in that it produces a coefficient of each term and that its log loss cost function is guaranteed to find a global minimum. It is disadvantageous in that it assumes the relationship of the data is inherently linear.
Sequencing via Pipelines
Scikit-Learn provides a Pipeline class, allowing easy processing of sequential transformation and modeling steps. It takes in a list of name/estimator pairs, requiring all but the last estimators be transformers (have a fit_transform method).
Using pipelines is a modeling best practice and I use it to combine my CountVectorizer() and TfidfTransformer() with my chosen models.
Establishing Performance Measures
Evaluating a supervised classification model's effectiveness lies in comparing the model's predictions against the actual breakdown of positive and negative classes.
This breakdown is made most explicit via a 2x2 confusion matrix, in which the columns indicate the true output values, and the rows indicate the model's output predictions (always pay attention as the rows and columns are sometimes switched!). Each cells represent a combination of the two: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
Here is a quick mock-up via a Pandas DataFrame:
Evaluative metrics, such as accuracy, precision, recall, and specificity, are composed of various sub-components of the confusion matrix. As my model's original output will show, evaluating these metrics in tandem, as opposed to relying on a single metric, is critical to interpreting the model's success.
Accuracy measures the correctness of all predictions, regardless of class. It is measured as (TP + TN)/(TP + FP + TN + FN).
Precision measures the correctness of the positive predictions. It is measured as the TP / (TP + FP)
Recall measures the correctness of the positive actual values. It is measured as the TP / (TP + FN)
Precision and Recall exist in a duality, with a necessary tradeoff in which tuning your model in favor of one necessarily decreases the other. As the threshold value that distinguishes between a negative and positive prediction is increased, it results in fewer positive and more negative predictions.
Thus, with more negative predictions, there are thus fewer false positives and more false negatives to be had: Precision increases while Recall decreases.
In contrast, lowering the threshold value results in more positive and fewer negative predictions (both accurate and inaccurate): Precision decreases and Recall increases.
The F1 score combines precision and recall using the harmonic mean, which gives greater weight to lower values. Thus to have a high F1 score, a model must have both a high Precision and Recall.
Let's evaluate the confusion matrix and scores on the initial two models:
Despite decently high precision, both models have low recall scores (incredibly low for Naive Bayes) which pushes down their F1 score.
As you can see from the confusion matrix, almost all instances are being predicted toward the negative class. This is not an error, but the models following the"best-guess-is-majority-class-if-everything-else-is-equal" directive.
To overcome this, I need to specify the class weights.
Specifying Class Weights
Sklearn's documentation provides parameters to specify class weights, and this process works different depending on the model.
Multinomial Naive Bayes provides a 'class_prior' parameter:
Therfore, I need to manually specify the distribution of the classes to later input into the model.
Logistic Regression provides a 'class weights' parameter:
Setting the parameter equal to "balanced" will fulfill my purposes.
I now rerun my pipelines with these parameters set and examine my new confusion matrices and scores:
As both models are now predicting a far greater number of instances in the positive class, Precision now drops greatly at the while Recall. The harmonic mean F1 scores are much higher.
Visualizing my Predictions:
I will next present three main classification visualizations used to evaluate model outputs:
- Precision and Recall vs. Decision Threshold
- Precision vs. Recall
- the ROC Curve
To build these graph, I must first obtain the decision scores for each of my models utilizing the 'cross_val_predict()' function.
In sklearn, there are generally two ways that models output these scores ( 1. the method "decision function" and 2. the second column of the method='predict_prob') and my two baseline models showcase each way.
Precision and Recall vs. Decision Threshold
With the y_scores, I can now compute the precision and recall for all possible thresholds of each model.
I can then display both the precision and recall scores at each threshold value for both models:
As the F1 score demonstrated above, Logistic Regression overall performs much better than Multinomial Naive Bayes, as it can simultaneously support both a higher Precision and Recall at a given threshold.
Precision vs. Recall
This demonstrates the same conclusion as above: Logistic Regression simultaneously maintains higher scores in both Precision and Recall, and is clearly the better model.
The ROC Curve
The ROC (receiver operating characteristic) curve measures the positive predictions. It's output displays the True Positive rate (recall) versus the False Positive Rate (it is equal to 1 - specificity or 1- the True Negative Rate).
As the true positive rate increases, so does the true negative rate, and the goal is to achieve the lowest true negative rate at the highest true positive rate.
As such, the ROC of a model is measured via the AUC (area under the curve) score. A perfect classifier would have an AUC of 1, while a perfectly random classifier would have an AUC of .5 (represented by the dotted line in the graph).
Removing StopWords & Stemming
I am running a text-based model, and removing stopwords and stemming are basic elements of NLP.
To show how they are beneficial, I wanted to run the baseline model first, and then show how removing stopwords and stemming improves their score.
Removing stopwords is easy enough, as 'stop_words' is a parameter of sklearn's CountVectorizer and can be added so in the pipeline:
I apply stemming through the NLTK package's snowball stemmer. Applying stemming in my pipeline a bit more complex, however, and must be added within the CountVectorizer (applying stop_words) in its own class (see this stackoverflow post):
Also to note, adding these steps makes fitting the models take significantly more time to run.
Results suggest small but significant improvements to the Multinomial Naive Bayes model, while slight, but not very significant diminishment to the Logistic Regression model. Therefore I will continue to stem and remove stopwords in the additional models.
Additional Models - RandomForest & Support Vector Machine
I next run 2 additional models on the stemmed and stopwords-removed data: RandomForest and Support Vector Machine:
RandomForest is built upon the CART decision tree model, which iteratively minimizing the impurity of the classification via a binary split of the optimal feature. RandomForest performs CART on an ensemble of many trees, each tree having a random subset of all instances and features of the data. This randomness helps to reduce overfitting, a common issue with CART. Both CART and RandomForest also contain a number of related parameters controlling the depth and size of the trees.
Usually, the more complex the model, the more difficult it is to interpret. As a data scientist, navigating this complexity to communicate the features that impact your model most to stakeholders is critical. There are several methods to do so (PCA, Lasso Regression), and RandomForest has its own method, Feature Importance.
It is interesting that non-politically related words such as 'lol', 'heh', and 'hey', are among most important. This seems to indicate that attitude, rather than political words themselves were the most distinguishing aspect.
Support Vector Classifier differs from my other models in that it does not define its class division using the entire dataset, but rather only the closest data points at the margins (the support vectors) between the two classes. The strictness of that margin is depends on the value of C hyper-parameter. Lowering the strictness helps to reduce overfitting.
Here are the Precision vs. Recall and ROC curves for all the models:
While I ideally would want to tune the hyper parameters for all models before making my final decision, Logistic Regression and Support Vectorizer Classifier clearly perform the best and I choose to move forward to tune using only them.
Applying Stochastic Gradient Descent
Stochastic Gradient Descent signifies a linear classifier using a method of optimizing the error equation with not all the data at once, but iteratively using a random instance. The 'loss' parameter signifies the type of linear classifier (the default 'hinge' for SVC and 'log' for logistic regression').
SGD is highly advantageous for very large datasets, in that it can still learn very quickly. However, the randomness its iterative approach puts it at risk of missing the global minimum of the error equation.
Since my two best models both have SGD methods, I compared the results of SGD applied to Support Vector Classifier and Logistic Regression compared to the prior models without Stochastic Gradient Descent.
In both cases, the SGD model does worse than the original model, and thus SGD will not be applied further in my analysis.
Tune Hyperparameters to Avoid Overfitting
Having established Logistic Regression and SVC as my two best performing models on my training data, I now want to appropriately regularize my model so it is neither underfitting nor overfitting the training data.
Both the Logistic Regression and SVC models have the 'C' hyperparameter (equivalent to 1/lambda) which, as it is increased, reduced regularization of the model.
To do so, I will train my models on a range of values for 'C' and then evaluate each iteration's F1 score on the test set.
While increasing C will always increase the training data score, there is an optimal test data score, after which it will decrease.
For both models, I tested a range of C values until I found the peak (be aware, dual y axes on the graphs below).
Evaluating my Final Models
Evaluation is Relative
Random Guess - ROC Curve
*(It is unclear why the SVC - Test ROC curve has a linear slope - I applied a workaround to get the predict_proba values which may have caused it)
The dotted slope =1 line of the ROC curve represents a predicts a completely random guess, and by that comparison, the model is certainly better.
Completely one class
By this metric, if I simply classified every post as conservative, I'd be correct 72.8% of the tine. My two models' accuracies of ~74% present only a minuscule improvement. However, this metric is not appropriate for this dataset because I actively defined my models' parameters to even out the class imbalance and would most likely not be running the model on as imbalanced data in production.
The Human Comparison
In many psychological data science studies, the model is ultimately compared against a human's intuitive judgment. If a human only accurately assesses a person's personality 30% of the time, then a model with 40% of the time is relatively very good. In this case, I would ideally compare the results to that of a human guessing, not knowing the person's ideology.
Contextually Filtering Data to create a Dynamic Model
The ultimate goal of my model is to predict partisan bent in comments in a more anonymous context rather than the tight community that LiberalForum.net represents. In the RandomForest Feature Importance, many of the top features were words more social than political.
If I thus filtered my data source to include only these distinctly political posts, it is reasonable to hypothesize that it would better serve this purpose.
In this final portion of my post, I define two systems to filter out only specifically political posts and analyze their efficiency v ia a logistic regression model.
- Direct filter - only include posts that explicitely contain the specified word
- News.Api - Search word and filter based on list of related words from news article descriptions
- Full (a large variety of political term inputs)
- Inputs: 'barack obama','donald trump','hilary clinton', 'robert mueller','fbi investigation', 'fake news', 'foreign policy', 'bernie sanders', 'health care', '2016 election','charlottesville riots'
- Outputs:'obama', 'barack', 'president', 'school', 'trump', 'elementary', 'richmond', 'virginia', 'new', 'confederate', 'general', 'capital', 'public', 'venture', 'donald', 'post', 'visited', 'washington', 'house', 'named', 'silicon', 'valley', 'firm', 'fu', 'world', 'white', 'stuart', 'leader', 'prominent', 'photo', 'trump', 'donald', 'president', 'images', 'know', 'meme', 'com', 'north', 'korea', 'states', 'watch', 'fake', 'news', 'truth', 'says', 'singapore', 'order', 'hard', 'kim', 'new', 'twitter', 'big', 'funny', 'people', 'week', 'time', 'nuclear', 'post', 'york', 'just', 'trump', 'clinton', 'hilary', 'president', 'donald', 'party', 'justice', 'election', 'vote', 'didn', 'news', 'saelune', 'people', 'voted', 'video', 'breakfastman', 'media', 'way', 'children', 'investigation', 'supreme', 'court', 'said', 'department', 'gaming', 'prosecutor', 'social', 'county', 'fbi', 'series', 'mueller', 'robert', 'counsel', 'special', 'trump', 'president', 'investigation', 'donald', 'russia', 'probe', 'campaign', 'new', 'russian', 'giuliani', 'fbi', 'michael', 'manafort', 'adviser', 'prosecutors', 'federal', 'election', 'says', 'lawyer', 'justice', 'news', 'team', 'rudy', 'interview', '2016', 'possible', 'fbi', 'investigation', 'clinton', 'report', 'general', 'inspector', 'trump', 'director', 'justice', 'department', 'email', 'hillary', 'doj', 'horowitz', 'comey', 'michael', 'james', 'agent', 'president', 'released', '2016', 'wray', 'ig', 'attorney', 'anti', 'agents', 'messages', 'mueller', 'federal', 'christopher', 'news', 'fake', 'media', 'trump', 'post', 'president', 'appeared', 'social', 'new', 'cnn', 'bbc', 'facebook', 'real', 'story', 'stories', 'political', 'wants', 'biggest', 'like', 'sources', 'source', 'chrome', 'fox', 'said', 'americans', 'know', 'election', 'immigration', 'reporting', 'read', 'foreign', 'policy', 'trump', 'president', 'new', 'america', 'administration', 'international', 'donald', 'india', 'summit', 'post', 'secretary', 'minister', 'iran', 'world', 'key', 'american', 'power', 'diplomacy', 'government', 'white', 'house', 'kim', 'important', 'relations', 'doctrine', 'erdogan', 'tough', 'state', 'sanders', 'bernie', 'sen', 'vermont', 'democratic', 'senator', 'vt', 'immigration', 'son', 'nomination', 'new', 'politics', 'trump', 'left', 'presidential', 'ocasio', 'cortez', 'protesters', 'news', 'post', 'alexandria', 'president', 'va', 'told', 'crowd', 'ready', 'challenge', 'ice', 'customs', 'enforcement', 'health', 'care', 'new', 'costs', 'national', 'company', 'hospitals', 'state', 'services', 'massachusetts', 'insurers', 'patients', 'medical', 'insurance', 'healthcare', 'based', 'business', 'improve', 'house', 'people', 'page', 'administration', 'primary', 'school', 'coalition', 'major', 'million', 'assessments', 'coverage', 'time', 'election', '2016', 'trump', 'presidential', 'department', 'president', 'russian', 'fbi', 'officials', 'russia', 'state', 'campaign', 'intelligence', 'clinton', 'hillary', 'justice', 'report', 'senate', 'donald', 'obama', 'investigation', 'arpaio', 'said', 'phoenix', 'ap', 'joe', 'meddled', 'committee', 'federal', 'news', 'white', 'year', 'trump', 'day', 'virginia', 'right', 'post', 'line', 'charlottesville', 'supremacist', 'death', 'hottest', 'photos', 'divs', 'great', 'american', 'security', 'michael', 'violence', '2020', 'national', 'july', 'bloomberg', 'issues', 'entire', 'summer', 'video', 'riots', 'african', 'man'
- Inputs: 'trump'
- Outputs: 'trump', 'donald', 'president', 'foundation', 'new', 'illegal', 'ivanka', 'york', 'conduct', 'general', 'attorney', 'lawsuit', 'jr', 'baby', 'underwood', 'news', 'eric', 'sues', 'family', 'said', 'times', 'video', 'ny', 'children', '25', 'filed', 'barbara', 'media', 'niro', 'tweeted', 'independence', 'decade', 'directors', 'did', 'people', 'latest', 'washington', 'post', 'london', 'night', 'melania', 'military', 'themed', 'day', 'book', 'left', '11', 'border', 'liberals', 'state'
- Inputs: "obama'
- Outputs:'obama', 'president', 'barack', 'trump', 'administration', 'new', 'washington', 'post', 'did', 'house', 'donald', 'meet', 'democratic', 'white', 'said', 'york', 'year', 'russian', 'putin', 'clinton', 'party', 'border', 'policy', 'thursday', 'people', 'immigration', 'crimea', 'american', 'public', 'africa', 'news', 'week', 'fundraiser', 'msnbc', 'democrats', 'saying', 'times', 'office', 'private', 'world', 'say', 'talk', 'years', 'virtually', 'children', 'families', 'vote', 'child', 'summer', 'presidents']
- Full (a large variety of political term inputs)
The main downside of this method is that filtering out posts severely reduces the training data.
|Model||# Posts||% Class = 1||Accuracy||F1||AUC|
|Full Data Final LR Model||55,683||27.2%||.725||.551||.700|
|Filter - Obama||4,018||11.6%||.890||.522||.728|
|NewsAPI - Full||30,532||16.0%||.806||.521||.746|
|NewsAPI - Trump||23,207||15.8%||.812||.533||.755|
|NewsAPI - Obama||20,506||17.8%||.796||.537||.744|
From the ROC curve, the filtered models clearly perform better than the original model! The direct-filtered models, despite being the smallest dataset, appear the best.
Next Steps - Modeling Clustered TFIDF with text meta-data
While the goal of my model is to only utilize text data, I am not limited to only the words. To advance my model further, I can combine the TFIDF results with a number of meta features for each post: # of characters, # of words, counts of specific words ('trump' for instance'), a sentiment score, etc.
Again, check out my full code here!