Sentiment Analysis of Media Coverage of Presidential Candidates
Contributed by David Comfort. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project(due at 8th week of the program).
===================
My overall goal was to determine if the sentiment of US Presidential Candidates in news coverage can be detected. Specifically, I extracted articles about the major presidential candidates in the New York Times and attempted to perform sentiment analysis.
Outline
- Identifying Articles in the New York Times using their API
- Web scraping tricks and tips
- Content Extraction Algorithms for Python
- Scraping the New York Times and extracting the content
- AlchemyAPI
- Sentiment Analysis
- Demonstrate Indico, TextBlob and other Sentiment Analysis packages
- Comparison of Indico and TextBlob for Movie Reviews
- Conclusions
A screen capture of the presentation is below:
[youtube http://www.youtube.com/watch?v=a2yW7OLC_2s?rel=0&w=560&h=315]
1. Identifying Articles in the New York Times using their API
I utilized the New York Times api and python packages to pull out article titles for the candidates in the past year. Specifically, I used a Python Package nytimesarticle, a fully-functional Python wrapper for the New York Times Article Search API. I borrowed some of the code below from the excellent tutorial, Scraping New York Times Articles with Python: A Tutorial.
The New York Times API will not return the full text of articles. However, it will return metadata such as subject terms, abstract, and date, as well as the URL, which I will subsequently use to scrape the full text of articles.
To begin, you first need to obtain an API key from the New York Times. See here for more information.
You also need to install the nytimesarticle Python package, which allows you to query the API through Python.
Load the API
Parse the Articles
Put the Articles in a Dataframe
Now that we have a function to parse results into a clean list, we can easily write another function that collects all articles for a search query in a given year.
Now I can get articles for each candidate. For instance, for Bernie Sanders:
2. Web scraping tricks and tips
In the course of doing this project, I discovered several web scraping tips and tricks. I learned some of these techniques from the book, Web Scraping with Python: Collecting Data from the Modern Web, although I wish the author went into a little bit more detail sometimes.
Adjusting Headers
HTTP headers are a list of attributes, or preferences, sent by you every time you make a request to a web server. A typical Python scraper using the default urllib
library might send the following header:
User-Agent Python-urllib/3.4
We can utilize the following code to examine what a server would "see" if I tried to scrape their site:
Notice that the USER_AGENT
is set to python-requests/2.3.0 CPython/2.7.10
, which might alert the site that I am scraping that I am indeed scraping their site.
However, we can change the header using the code below so that our web scraper appears "more human":
Now, the USER_AGENT
is set to Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)
, so the website will think that I am a human using a browser to view their site.
Handling Cookies
Browser plug-ins that can show you how cookies are being set as you visit and move around a site.
In the above code, I first send the login parameters to the welcome page, which acts as the processor for the login form. I then retrieve the cookies from the results of the last request, print the result for verification, and the send them to the profile page by setting the cookies argument.
The session object (retrieved by calling requests.Session()
) keeps track of session information, such as cookies, headers, and even information about protocols you might be running on top of HTTP, such as HTTPAdapters
.
The first request to the URL will fill the jar. The second request will send the cookies back to the server. The same goes for the standard library's urllib
module cookielib
.
There are some excellent Stackoverflow posts which cover how to handle cookies while you are web scraping:
- requests.Session() load cookies from CookieJar
- Putting a
Cookie
in aCookieJar
- Using requests module, how to handle 'set-cookie' in request response?
Space out requests
You can also space out requests so you are not overloading somebody else's server:
Scraping remotely
Tor
The Onion Router network, better known by the acronym Tor, is a network of volun‐ teer servers set up to route and reroute traffic through many layers (hence the onion reference) of different servers in order to obscure its origin. Data is encrypted before it enters the network so that if any particular server is eavesdropped on the nature of the communication cannot be revealed. In addition, although the inbound and out‐ bound communications of any particular server can be compromised, one would need to know the details of inbound and outbound communication for all the servers along the path of communication in order to decipher the true start and endpoints of a communication—a near-impossible feat.
Having Tor installed and running is a requirement for using Python with Tor, as we will see in the next section. Fortunately, the Tor service is extremely easy to install and start running with. Just go to the Tor downloads page (https://www.torproject.org/ download/download) and download, install, open, and connect! Keep in mind that your Internet speed might appear to be slower while using Tor. Be patient—it might be going around the world several times!
PySocks
PySocks is a remarkably simple Python module that is capable of routing traffic through proxy servers and that works fantastically in conjunction with Tor. You can download it from its website or use any number of third-party module managers to install it.
pip install PySocks
Although not much in the way of documentation exists for this module, using it is extremely straightforward. The Tor service must be running on port 9150 (the default port) while running this code:
3. Content Extraction Algorithms for Python
There are several packages in Python for extracting content from Web pages. A brief overview of these packages follows:
Goose
Goose was originally an article extractor written in Java.
The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.
Goose will try to extract the following information:
- Main text of an article
- Main image of article
- Any YouTube/Vimeo movies embedded in article
- Meta Description
- Meta tags
We can utilize Goose to extract the content, the title, the meta description, and the top image of an article based upon the URL of an article:
http://www.nytimes.com/politics/first-draft/2015/10/30/death-penalty-could-provide-debate-fodder-for-hillary-clinton-and-bernie-sanders/
Hillary Rodham Clinton, who leads most Democratic polls nationally and in Iowa, has for months moved to her party’s left on a range of issues, from immigration overhaul to criminal justice reform to, more recently, opposition to the Trans-Pacific Partnership trade deal.Yet on Wednesday, Mrs. Clinton bluntly told attendees at a campaign event that she supports the death penalty — in limited use and in limited cases, but she still supports it. And that’s a position that isn’t shared by much of the Democratic primary base.Senator Bernie Sanders of Vermont, her main opponent in the Democratic contest, called for the abolition of the death penalty in a speech on the Senate floor on Thursday, a move that highlighted the issue and the fact that he is to her left on it.“We are all shocked and disgusted by some of the horrific murders that we see in this country, seemingly every week,” Mr. Sanders said. “And that is precisely why we should abolish the death penalty. At a time of rampant violence and murder, the state should not be part of that process.”The other Democratic candidate, Martin O’Malley, also favors abolishing the death penalty. And Mrs. Clinton’s position and her rivals’ disagreement with it virtually guarantees it will be a topic at the next Democratic debate.
u'Death Penalty Could Provide Debate Fodder for Hillary Clinton and Bernie Sanders'
u'Mrs. Clinton\u2019s comments that she supported limited use of the death penalty put her on the opposite side of much of the Democratic base and prompted a response on the Senate floor from Bernie Sanders, all but assuring the topic will be raised at the next debate.'
'http://graphics8.nytimes.com/images/2015/10/29/us/politics/29firstdraftnl-sanders/29firstdraftnl-sanders-tmagArticle.jpg'
BeautifulSoup
I also examined BeautifulSoup, although I found it less easy to use than Goose. In retrospect, BeautifulSoup is good to use if you really need to target specific elements on a web page.
http://www.nytimes.com/politics/first-draft/2015/10/30/death-penalty-could-provide-debate-fodder-for-hillary-clinton-and-bernie-sanders/
Readability
This is a python port of a ruby port of arc90’s readability project. Given a html document, Readability pulls out the main body text and cleans it up.
http://www.reuters.com/article/2015/10/29/us-indonesia-elnino-idUSKCN0SM2SK20151029
<html><body><div><span id="articleText"> <span id="midArticle_start"/> <span id="midArticle_0"/><span class="focusParagraph"><p><span class="articleLocation">KARANG JATI, Indonesia</span> On a dry and dusty sports field in central Java, Indonesian men dressed as traditional warriors take turns to battle with wooden staves, while village women crowd around, chanting: "All farmers let us pray that rain comes and washes our sorrow away."</p></span><span id="midArticle_1"/><p>As in many parts of Java, Indonesia's main rice-growing island, seasonal rains are late coming to Karang Jati. A drought caused by the El Nino weather pattern, which scientists say could be the worst on record, means fields are fallow weeks after they would normally be sown. So the villagers have turned to a rain-making ritual to hasten the planting season.</p><span id="midArticle_2"/><p>Crop failures caused by an El Nino drought presage more pain for Southeast Asia's largest economy, which is already growing at its slowest pace in six years, by squeezing incomes, fanning inflation and pushing more people into poverty.</p><span id="midArticle_3"/><p>All this piles pressure on Joko Widodo, Indonesia's first president from humble origins, on top of the haze crisis caused by slash-and-burn forest clearances, who made poverty reduction a priority but has seen it swell across this archipelago of 250 million people since he took office a year ago.</p><span id="midArticle_4"/><p>The number of people officially classed as poor actually rose in the first six months of his presidency to 28.6 million in March from 27.7 million in September 2014.</p><span id="midArticle_5"/><p>Twenty of Indonesia's 34 provinces are currently stricken by severe drought, according to the meteorology agency.</p><span id="midArticle_6"/><p>The World Bank says that if there is a severe El Nino this year, rice production will fall by 2.1 million tonnes, or 2.9 percent, and rice prices will rise by 10.2 percent.</p><span id="midArticle_7"/> <span class="first-article-divide"/><p>That price rise will hit the poor hardest because they spend more of their income on food than the well off.</p><span id="midArticle_8"/><p>"Reduced agricultural incomes and higher prices could be devastating for poor households," the Bank said in a report, adding that rice imports may be needed if El Nino intensifies.</p><span id="midArticle_9"/><p/><span id="midArticle_10"/><p>"NO RAIN, NO MONEY"</p><span id="midArticle_11"/> <span class="second-article-divide"/><p>Widodo has provided more funds for cash transfers and social schemes, but so far has refused to sanction rice imports, keen that Indonesia should be self-sufficient in food.</p><span id="midArticle_12"/><p>"We are not talking about imports," Finance Minister Bambang Brodjonegoro told Reuters in a recent interview. "We are trying to make sure the domestic stocks are available and accessible."</p><span id="midArticle_13"/><p>Other countries at risk of an El Nino drought, such as the Philippines, have taken advantage of low global rice prices to boost stocks with foreign imports.</p><span id="midArticle_14"/><p>Such measures at least cap inflation if crops fail, though they mostly benefit people in towns who consume rice, rather than the farmers who produce it - all they can do is pray for the weather to change.</p><span id="midArticle_15"/> <span class="third-article-divide"/><p>"Our paddy fields depend on rainwater, so if there is no rain we suffer," said Darijan, a 60-year-old farmer in central Java who has started selling his soil to brick-makers to make ends meet.</p><span id="midArticle_0"/><p>Agriculture accounts for nearly 14 percent of Indonesia's gross domestic product, the highest among Southeast Asia's five main economies. One-third of the labour force works in farming, and more than half of poor households live off the land.</p><span id="midArticle_1"/><p>"What is very important ... to the poverty numbers is rice production and rice prices," Steven Tabor, the Asian Development Bank's head in Indonesia, told a recent conference. "And the beginnings of El Nino seem to suggest that we may be in for rising poverty towards the end of the year."</p><span id="midArticle_2"/><p>As the drought drags on, Karang Jati's farmers such as 70-year-old Rohadi Rustam are anxious. </p><span id="midArticle_3"/><p>"If there's no rain, we have no money," he said, sitting by his sun-cracked fields. "That's how we farmers live."</p><span id="midArticle_4"/><p/><span id="midArticle_5"/><p> (Additional reporting by Heru Asprihanto, Quincy de Neve and Arzia Tivany Wargadiredja in JAKARTA; Editing by <a href="http://blogs.reuters.com/search/journalist.php?edition=us&n=john.chalmers&">John Chalmers</a> and <a href="http://blogs.reuters.com/search/journalist.php?edition=us&n=simon.cameron.moore&">Simon Cameron-Moore</a>)</p><span id="midArticle_6"/></span> </div></body></html>
Beyond haze, El Nino drought poses poverty challenge for Indonesia
Other Packages include:
A recent blog post benchmarked one content extraction algorithm, Dragnet, against a few other Python content extraction repositories in execution speed and accuracy. Their post describes those benchmarks and the recent improvements in Dragnet.
Python Web Scraping Resources
This list contains python libraries related to web scraping and data processing:
Python Web Scraping
- Network
- Web-scraping Frameworks
- HTML/XML Parsing
- Text processing
- Specific Formats Processing
- Natural Language Processing
- Browser automation and emulation
- Multiprocessing
- Queue
- Cloud Computing
- URL and Network Address Manipulation
- Web Content Extracting
- Asynchronous
- WebSocket
- DNS Resolving
- Computer Vision
- Proxy Server
- Misc
- Other Python Lists
4. Scraping the New York Times and extracting the content
Looping through each candidate and each URL
5. AlchemyAPI
AlchemyAPI uses machine learning (specifically, deep learning) to do natural language processing (specifically, semantic text analysis, including sentiment analysis) and computer vision (specifically, face detection and recognition)
- Entity Extraction
- Keyword Extraction
- Concept Tagging
- Sentiment Analysis
- Targeted Sentiment Analysis
- Relation Extraction
- Text Categorization
- Taxonomy
Key: ba25019e0990c87c9c4965c4898672b8ddb3f3e1 was written to api_key.txt You are now ready to start using AlchemyAPI. For an example, run: python example.py Mayor Bill de Blasio’s slow, awkward march toward endorsing Hillary Rodham Clinton — the woman who jump-started his political career — reached its widely predicted conclusion on Friday, as the mayor extended his presidential blessing on MSNBC’s “Morning Joe.” “The candidate who I believe can fundamentally address income inequality effectively, the candidate who has the right vision, the right experience and the ability to get the job done, is Hillary Clinton,” Mr. de Blasio said. Mrs. Clinton, a fellow Democrat, did not appear on the program. Nor did she issue a statement about Mr. de Blasio. Instead, the Clinton campaign emailed a note to reporters calling the New York City mayor’s support “a sign of the campaign’s continued momentum.” The note added, “The Clinton campaign will also announce that an additional 85 mayors from across the country will endorse her today.” The circumstances of Mr. de Blasio’s endorsement offered a sign of the reduced interest in his presidential preferences since he initially denied Mrs. Clinton his blessing in April. When the Clinton campaign issued an official announcement about its latest wave of endorsements, Mr. de Blasio’s was given fourth billing, after the mayors of Chicago, Houston and Philadelphia. By this week, Mr. de Blasio’s six-month delay in issuing an endorsement was viewed in political circles as quixotic at best. Virtually all leading Democrats in New York have already thrown their support behind Mrs. Clinton, along with leading liberal Democrats like Senator Sherrod Brown of Ohio. But the mayor’s vow of neutrality was initially taken as a liberal line in the sand, part of Mr. de Blasio’s effort to kick-start a national movement to nudge presidential contenders leftward. His public vacillation started in political prime time with a bold declaration during an interview on NBC’s “Meet the Press” — “I want to see a vision,” he said — hours before Mrs. Clinton formally began her candidacy. For months, the mayor’s allies objected to accusations that Mr. de Blasio wanted to play kingmaker, saying that he simply wanted to hear more detail about the policy plans of presidential contenders, particularly in the early stages of the contest. More recently, however, many members of the mayor’s inner circle had grown frustrated with his delay, saying that the uncertainty had become a distraction. The mayor’s liberal advocacy group, the Progressive Agenda Committee, also announced plans to hold a forum for presidential contenders in Iowa in December, complicating Mr. de Blasio’s calculus about whether to remain unaffiliated with a candidate. The “will he or won’t he” soap opera ended shortly after sunrise on Friday, on a low-rated basic-cable talk show amid banter about baseball and jokes about Citi Bike. “Morning Joe” is familiar, relatively friendly territory for Mr. de Blasio, who has visited its set at least six times during his 22-month term. But the mayor still faced skeptical questions from the hosts, who wondered what, exactly, had changed in the past few months to make him want to endorse Mrs. Clinton now. At one point, Mark Halperin, a political analyst, urged Mr. de Blasio to face a camera and explain his decision directly to supporters of Senator Bernie Sanders of Vermont, an independent who has been embraced by many left-leaning Democrats. “I like Bernie a lot,” Mr. de Blasio said. But he quickly added that Mrs. Clinton was “the most capable of executing the vision.” (A spokesman for Mr. Sanders did not return a request for comment.) Katrina vanden Heuvel, the editor and publisher of The Nation and one of Mr. de Blasio’s closest supporters, wrote in an email that the mayor’s endorsement “gets de Blasio on the Clinton team,” adding: “That’s good for him, good for New York, good for an urban agenda.” But she also urged Mr. de Blasio to stay vigilant in encouraging Mrs. Clinton to adopt positions supported by the left. “There is still room for her to move, and de Blasio should keep the pressure on for that movement,” Ms. vanden Heuvel wrote. For his part, Mr. de Blasio said on MSNBC that Mrs. Clinton had “a very sharp progressive platform” and “the ability to follow through on it.” He added: “There’s a lot of spine there. There’s a steel there.” Reflecting on his own thought process, Mr. de Blasio suggested that Mrs. Clinton had increasingly fit his criteria for a candidate, saying, “As you’ve seen in each successive speech, Hillary has filled in the blanks forcefully.” At the same time, he urged voters to pay more attention to her accomplishments in the past. “I think what’s missing here in this discussion,” Mr. de Blasio said, “is who Hillary has always been.”
Entity Extraction
An example of using AlchemyAPI for entity extraction is below:
Example code for keyword extraction, concept tagging, relation extraction, text categorization, taxonomy, and sentiment analysis is at the AlchemyAPI github.
6. Sentiment Analysis
"Sentiment analysis and opinion mining is the field of study that analyzes people’s opinions, sentiments, evaluations, attitudes, and emotions from written language." - Liu, Bing. "Sentiment analysis and opinion mining." Synthesis Lectures on Human Language Technologies 5.1 (2012): 1-167.
I examined a couple of different packages to implement sentiment analysis.
Indico Sentiment Analysis API
Indico appears to be using Recurrent neural networks.
'NASHUA, N.H. \xe2\x80\x93 Most presidential candidates don\xe2\x80\x99t talk openly about electoral strategy, preferring to keep the focus on voters\xe2\x80\x99 needs and not their own. But Senator Bernie Sanders of Vermont, in remarks here Friday at one of his campaign offices, made a blunt appeal to New Hampshire voters as he described the state\xe2\x80\x99s February primary as critical to his survival as a candidate.\n\nSaying that he had an \xe2\x80\x9cexcellent chance\xe2\x80\x9d of beating Hillary Rodham Clinton in New Hampshire, and that he could also win the Iowa caucuses a week before, Mr. Sanders said, \xe2\x80\x9cIf we win Iowa and New Hampshire, it opens up for us a path toward victory.\xe2\x80\x9d\n\nMr. Sanders rarely talks about the caucuses and primaries as must-win affairs, but he was seeking to send a sharp message to voters \xe2\x80\x93 through the bank of television cameras and a dozen reporters \xe2\x80\x93 that his campaign will rise or fall in the nominating race\xe2\x80\x99s first two voting states. Mrs. Clinton\xe2\x80\x99s aides, by contrast, are building a political firewall through Southern states to keep her on track to capture the Democratic nomination even if she loses in Iowa and New Hampshire.\n\nAfter an earlier event where he promised to do more than Mrs. Clinton to protect and expand Social Security benefits, Mr. Sanders dwelled largely on campaign dynamics here as he said he faced \xe2\x80\x9ca tough road \xe2\x80\x93 we are taking on establishment politics, we are taking on establishment economics, we will be outspent.\xe2\x80\x9d\n\nMr. Sanders dropped Mrs. Clinton\xe2\x80\x99s name as he pointedly noted that unlike him, she has support from a \xe2\x80\x9csuper PAC\xe2\x80\x9d \xe2\x80\x93 the sort of big-money political outfit that he sees as a corrupting influence in government.\n\n\xe2\x80\x9cWhen you have a super PAC, you can sit down in a room with a handful of very wealthy families and then walk out with millions of dollars,\xe2\x80\x9d said Mr. Sanders, who is well aware that many New Hampshire voters strongly support campaign finance reform. \xe2\x80\x9cI\xe2\x80\x99m proud of the way we\xe2\x80\x99ve raised our money.\xe2\x80\x9d\n\nMr. Sanders, who holds a modest lead in New Hampshire polls, is campaigning in the state through Saturday evening, while Mrs. Clinton held events here on Wednesday and Thursday \xe2\x80\x93 a pace that is sure to continue for the next three months.'
{u'Libertarian': 0.13097386225896757, u'Green': 0.011191867696439881, u'Liberal': 0.7681282251289556, u'Conservative': 0.08970604491563688}
I should note that the political sentiment for Bernie Sanders scored the highest for "Liberal."
The Bernie Sanders also scored very positively for both sentiment analysis algorithms used by Indico. Sentiment_hq is supposed to be a High Quality Sentiment Analysis.