The Hillary Clinton Email Explorer
Contributed by Jake Lehrhoff, John Montroy, and Chris Neimeth. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project (due the 9th week of the program).
For the greater part of 2015, Hillary Clinton has been under fire for her near-exclusive use of a personal email account on a non-governmental server during her time as Secretary of State. The press was certainly not the type a presidential candidate hopes for 18 months before an election. Amidst continued investigation on the 2012 attacks on the Benghazi consulate, scandal rolled into scandal. State department lawyers uncovered the former Secretary of State's potential breach of email protocol while working through documents from the House Select Committee on Benghazi. Investigators and pundits alike questioned Mrs. Clinton's continued council from Sidney Blumenthal, a family friend and advisor who held no role in the State Department. As if Mrs. Clinton wasn't already dominating the 24-hour news cycle as the Benghazi deposition loomed, suddenly there was even more to the conversation: rather than questioning Mrs. Clinton's foresight and commitment to defending United States citizens working overseas, news anchors had us questioning our very trust of a POTUS frontrunner.
Months later, after an 11-hour Benghazi hearing that included a representative symbolically ripping up a blank piece of paper and another piling a desk with a stack of emails from the former Secretary of State's private-server emails for dramatic effect, perhaps Bernie Sanders said it best: "Enough of the emails. Let's talk about the real issues facing America."
Even for those who agree with the Vermont Senator's sentiment, it can be difficult to disengage from the discussion, especially now that we have the actual emails. After a series of Freedom of Information Act lawsuits, each passing month sees newly released emails from the original 55,000 pages. Nearly 8,000 of the released emails were cleaned and hosted on Kaggle.com, a website for data science competitions, inviting users to "uncover the political landscape" in these documents. Given how tired this scandal has grown, we decided the only thing left to do was create a tool such that you can do your own investigation. No matter what your feelings are about Hillary Clinton, we hope you enjoy The Hillary Clinton Email Explorer.
*The website is currently in beta, so please do let us know if you experience any outages.
Overview
The landing page of the Explorer provides basic exploratory data analysis of the dataset. Many of the 7,945 released emails were redacted, and far more will likely never be released. What was available to us can paint a variety of pictures. Overall, Mrs. Clinton's emails were extremely short, averaging just 19 words, while Sidney Blumenthal's averaged over 600 words. It appears that Mr. Blumenthal sent far more emails than he received, but we know nothing about the content of the over 40,000 unreleased emails.
EPT Counter
The Email-Person-Topic Counter allows you to type in a series of search words and select senders to see the count of emails sent containing those words. Above, you can see that "Benghazi," "Libya," and "attack" appear in surprisingly few emails. One would imagine that the majority of emails concerning those topics remain unreleased or classified. But finer details can be gleaned: though Jake Sullivan and Huma Abedin held the title of Deputy Chief of Staff, the content of these emails suggest that Huma was less involved in the affairs in Benghazi.
This simple tool can answer so many questions. Who says their "please" and "thank you's"? Who is concerned with basic administrative tasks? It seems that Mrs. Clinton is quite courteous in her emails (although she does prefer "pls" to "please"). We also see that Huma Abedin is more concerned with administrative tasks than her counterpart or her superior, Cheryl Mills.
Here's another question: how does everyone refer to President Obama? Only Mrs. Clinton uses the shortened POTUS, while Mr. Blumenthal is more likely than the rest to just use the president's first name. Overall, "Obama" is the most popular, while very few use the formal "President Obama." Fascinating!
Please play with it yourself! If you find anything particularly interesting, please post it in the comments below. A few quick instructions: separate your search terms with only a comma, no space, (e.g., "Iran,Iraq,Syria"). Also, hold the shift key to select multiple senders. To get you started, why don't you check out who wants tea and who wants coffee? And for a sanity check, search "tea party" as well, to make sure we know what kind of tea everyone is talking about!
Wordcloud Generator
To add to our investigation of the content within the released emails, we developed a wordcloud application. Select the name of one of the top contributors, and see a cloud of the most common words in his or her emails. Size represents the frequency that a given word appeared in the emails; the color is merely cosmetic. A quick scan of Mrs. Clinton's word cloud shows that she is largely concerned with high level administrative tasks. "Thx," "pls," "print," "will," "call," "time," and "know" all appear prominently.
Sidney Blumenthal's wordcloud shows more focused content covering a range of political topics. "Obama," "political," "Israel," and "issue" are all easy to locate. This wordcloud gives a peak as to the nature of Mr. Blumenthal's advice for the former Secretary. It's also relieving that even the highest level political advisors in the country aren't above abbreviating "you" down to a single letter.
Check out the wordclouds for the other top contributors to get a sense of their roles within Mrs. Clinton's State Department.
Sentiment Analysis
The sentiment analysis tab provides the most unique view of the data. "Sentiment" runs from -1 to 1, depicting a range of completely negative to completely positive content. Sentiment was determined using TextBlob, a Python package that determines sentiment by comparing text to its own lexicons of positive and negative words.
Each query returns two graphs. The first graph gives a density of a given sender's sentiment. The highest peak shows the sentiment (x-axis) that is most common for that sender. Mrs. Clinton's emails were largely positive, with peaks at 0.2 and 0.5. This isn't surprising given all the "please" and "thank you's" that we discovered she uses in her emails.
On the right we see the average sentiment of emails sent to each of the recipients along the x-axis. While Mrs. Clinton is positive with everyone, she is the most positive with Mr. Blumenthal and least positive with Mr. Sullivan.
When selecting a sender besides Mrs. Clinton, it's important to remember that sentiment to any recipient besides the Secretary of State comes from a small sample of emails. However, there is still understanding to be gleaned. Cheryl Mills and Jake Sullivan, though positive toward Hillary Clinton and Huma Abedin, are not so rosy with each other. In fact, Jake Sullivan's sentiment in emails to Cheryl Mills is actually negative.
The Code
All the code necessary to run this website can be found in our github repository.
The Hillary Clinton Email Explorer website was built with Flask, a python-based web development framework. The design of the website comes from Bootstrap, which puts the stylings of the entire internet at your fingertips. Each page of the website has its own unique HTML file that contains the structure of the page. A series of Bootstrap CSS files decorate the pages while javascript files add functionality. While we worked with the HTML to populate the pages with our material, we did not have to touch the CSS or JS files that Bootstrap produces for the user. The following functions are housed in an init.py file.
Get the emails
The data is in a mysql database, and the below function gets the data and populates a pandas data frame with the given column names.
def getEmails(): con = mysql.connect() cur = con.cursor() sql = """SELECT * FROM EmailsC""" cur.execute(sql) Emails = cur.fetchall() Emails2 = [tuple(elm,) for elm in Emails] EmailsFinal = pd.DataFrame(Emails2, columns = [u'Id', u'DocNumber', u'MetadataSubject', u'MetadataTo', u'MetadataFrom', u'SenderPersonId', u'MetadataDateSent', u'MetadataDateReleased', u'MetadataPdfLink', u'MetadataCaseNumber', u'MetadataDocumentClass', u'ExtractedSubject', u'ExtractedTo', u'ExtractedFrom', u'ExtractedCc', u'ExtractedDateSent', u'ExtractedCaseNumber', u'ExtractedDocNumber', u'ExtractedDateReleased', u'ExtractedReleaseInPartOrFull', u'ExtractedBodyText', u'RawText']) return EmailsFinal
Cleaning the Data
The following functions use regular expressions to clean unnecessary or problem elements from the text. The first eliminates symbols and the second strips particular phrases from the emails, particularly those that label an email as being investigated by the House Benghazi investigation committee. While these regular expressions certainly strip out unwanted material, it is feasible that they pull a few words out of the emails that were in fact innocuous and part of the true body text. However, as the words in question are sentiment-less, we do not feel that we are risking the loss of pertinent data.
def rmNonAlpha(texts): """ Remove non-alphabetic characters (roughly) """ if isinstance(texts, list): ctext = [re.sub(r'\s+', ' ', ctext) for ctext in [re.sub(r'[[$$()<>{}!:,;-_|\."\'\\]', '', text) for text in texts]] elif isinstance(texts, (str, unicode)): ctext = re.sub(r'[(){}<>,\.!?;:\'"/\\\_|]', '', texts) return ctext def rmBoring(texts): """ Remove boring stuff. Warning: strong assumptions ahead...but we gotta do some chopping. """ # overhead stuff ctext = re.sub(r'^From .*\n', '', texts, flags=re.MULTILINE) ctext = re.sub(r'^To .*\n', '', ctext, flags=re.MULTILINE) ctext = re.sub(r'^Case No .*\n', '', ctext, flags=re.MULTILINE) ctext = re.sub(r'^Sent .*\n', '', ctext, flags=re.MULTILINE) ctext = re.sub(r'^Doc No .*\n', '', ctext, flags=re.MULTILINE) ctext = re.sub(r'^Subject .*\n', '', ctext, flags=re.MULTILINE) # other misc ctext = re.sub(r'.*@.*', '', ctext) # emails ctext = re.sub(r'(?i)(monday|tuesday|wednesday|thursday|friday|saturday|sunday).*\d{3,4} [AP]M\n', '', ctext, flags = re.MULTILINE) # timestamps ctext = re.sub(r'Fw .*\n', '', ctext, flags = re.MULTILINE) # forward line ctext = re.sub(r'Cc .*\n', '', ctext, flags = re.MULTILINE) # Cc line ctext = re.sub(r'B[56(7C)]', '', ctext) # house benghazi committee stuff ctext = re.sub(r'Date 05132015.*\n', '', ctext, flags = re.MULTILINE) ctext = re.sub(r'STATE DEPT .*\n', '', ctext, flags = re.MULTILINE) ctext = re.sub(r'SUBJECT TO AGREEMENT.*\n', '', ctext, flags = re.MULTILINE) ctext = re.sub(r'US Department of State.*\n', '', ctext, flags = re.MULTILINE) return re.sub(r'\s+', ' ', ctext).lower()
Counts by keyword
The following two functions creates a count of emails that contain particular topics. The first function takes a single person and creates a count of emails that contain each of the given keywords. The second uses that function to complete that task for all the selected email senders.
def CountsByKeyword(df, col, person, topics, StartDate = '2009-01-01', EndDate = '2013-01-01'): """ Returns a dict of total mention counts per keyword for the given person. Returns counts for the passed in time frame, defaults to entire timeframe. 'By' parameter controls which field you're getting counts by. Big return says: return a dictionary via comprehension for lists, or just a dict for one value """ if not isinstance(topics, (str, unicode, list)): raise TypeError('\'topics\' parameter must be either str or list') person = '(' + person + ')' StartDate = datetime.strptime(StartDate, '%Y-%m-%d') EndDate = datetime.strptime(EndDate, '%Y-%m-%d') return ( {topic: df[col].loc[ (df[col].str.contains(person, case = False)) & (df['ExtractedBodyText'].str.contains(topic, case = False)) & (df['MetadataDateSent'] > StartDate) & (df['MetadataDateSent'] < EndDate)].count() for topic in topics} def buildCounterDF(personlist, topiclist): PersonThing = list() PersonTopic = pd.DataFrame() topiclist = topiclist.split(',') for person in personlist: PersonThing.append( tuple((person, CountsByKeyword(Emails, col = 'MetadataFrom', person = person, topics = topiclist)) ) ) for item in PersonThing: tdf = pd.DataFrame.from_dict(item[1], orient = 'index') tdf['Person'] = item[0] tdf.reset_index(level = 0, inplace = True) tdf.rename(columns = {'index': 'Topic', 0: 'count'}, inplace = True) tdf = tdf[['Person', 'Topic', 'count']] PersonTopic = PersonTopic.append(tdf) return PersonTopic
Make Sentiment
Sentiment is determined with TextBlob. Two functions are necessary as the first creates the data that will populate the first sentiment graph, which shows the density of sentiments among a given sender's corpus of emails. The second determines the sentiment by recipient. The extra argument, "personlist" is populated with the selected recipients from the dropdown menu on the application.
def GetSentimentPerPerson(df, person): text = df[['MetadataFrom','ExtractedBodyText']].loc[ (df.MetadataFrom.str.contains('(' + person + ')'))] text.ExtractedBodyText = text.ExtractedBodyText.apply(lambda x: rmBoring(rmNonAlpha(x)).decode('ascii', 'ignore')) text['sentiment'] = text['ExtractedBodyText'].apply(lambda x: TextBlob(x).polarity) return text.loc[text.sentiment != 0] # only return meaningful def GetSentimentForPeople(df, target, personlist): sentimentlist = list() stoplist = set('for a of the and to in on from'.split()) for person in personlist: text = df['ExtractedBodyText'].loc[ (df.MetadataFrom.str.contains('(' + target + ')')) & (df.MetadataTo.str.contains('(' + person + ')'))].values.tolist() text = ' '.join([str(word) for word in text if word not in stoplist]) text = rmBoring(rmNonAlpha(text)).decode('ascii', 'ignore') sentimentlist.append(tuple((person, TextBlob(text).polarity))) return sentimentlist
Creating Visualizations
The HTML files running the website contain Flask code seen below. The request.args.get function takes the two inputs, "target" and "personlist" and uses them to make "sentplot" and "sentpeopleplot," two plots that are defined back in init.py and will eventually populate the sentiment page.
{% if request.args.get('target') != None and request.args.get('personlist') != None %} <div class="panel panel-primary"> <div class="panel-heading"> <h3 class="panel-title" style="text-align:center;">Graph Results</h3> </div> <div class="panel-body"> <span><center> <img src="{{ url_for('sentplot', target = target, personlist = personlist) }}" alt="Distribution of sentiment" style = "position:relative;top:-10px;" height = 600 width = 650> <img src="{{ url_for('sentpeopleplot', target = target, personlist = personlist) }}" alt="Personal sentiment towards others" style = "position:relative;top:-10px;" height = 600 width = 650>
@app.route('/fig/<target>/<personlist>/sentplot.png') def sentplot(target, personlist): target = re.sub(r'\+', ' ', target) EmailSnt = GetSentimentPerPerson(Emails, target) plt.clf() sns.distplot(EmailSnt.sentiment) plt.xlim(-1,1) plt.title('Email Sentiment: {}'.format(target), fontsize = 16) fig = plt.gcf() img = StringIO.StringIO() fig.savefig(img) img.seek(0) return send_file(img, mimetype='image/png') @app.route('/fig/<target>/<personlist>/sentpeopleplot.png') def sentpeopleplot(target, personlist): target = re.sub(r'\+', ' ', target) personlist = personlist.split(',') s = GetSentimentForPeople(Emails, target, personlist) s = pd.DataFrame(s, columns = ['Person', 'Sentiment']) plt.clf() sns.barplot(x='Person', y = 'Sentiment', data=s) plt.ylabel('Sentiment') plt.title('How {} feels'.format(target)) fig = plt.gcf() img = StringIO.StringIO() fig.savefig(img) img.seek(0) return send_file(img, mimetype='image/png')
Debugging
Conclusion