Scraping Reddit
Contributed by Daniel Donohue. Daniel was a student of the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. This post was based on his third in-class project (due at the end of the 6th week of the program).
14 November 2015
Introduction
For our third project here at NYC Data Science, we were tasked with writing a web scraping script in Python. Since I spend (probably too much) time on Reddit, I decided that it would be the basis for my project. For the uninitiated, Reddit is a content-aggregator, where users submit text posts or links to thematic subforums (called "subreddits"), and other users vote them up or down and comment on them. With over 36 million registered users and nearly a million subreddits, there is a lot of content to scrape.
Methodology
I selected ten subreddits---five of the top subreddits by number of subscribers and five of my personal favorites---and scraped the top post titles, links, date and time of the post, number of votes, and the top rated comment on the comment page for that post. The ten subreddits were:
- /r/gaming (subreddit devoted to video games)
- /r/movies (all things movies)
- /r/videos (video content of all kinds)
- /r/worldnews (subreddit for major world news; excludes U.S. internal news)
- /r/science (general science)
- /r/seahawks (the official subreddit of the Seattle NFL organization)
- /r/floridaman (subreddit about ridiculous people in Florida doing ridiculous things)
- /r/circlejerk (subreddit for mocking Reddit culture)
- /r/totallynotrobots (people acting like robots trying to act like people)
- /r/uwotm8 (people writing in Cockney accents)
There are many Python packages that would be adequate for this project, but I ended up using Scrapy. It seemed to be the most versatile among the different options, and it provided easy support for exporting scraped data to a database. Once I had the data stored in a database, I wrote the post title and top comment to txt files, and used the wordcloud module to generate word clouds for each of the subreddits.
The Details
When you start a new project, Scrapy creates a directory with a number of files. Each of these files rely on one another. The first file, items.py, defines containers that will store the scraped data:
from scrapy import Item, Field class RedditItem(Item): subreddit = Field() link = Field() title = Field() date = Field() vote = Field() top_comment = Field()
Once filled, the item essentially acts as a Python dictionary, with the keys being the names of the fields, and the values being the scraped data corresponding to those fields.
The next file is the one that does all the heavy lifting---the file defining a Spider class. A Spider is a Python class that Scrapy uses to define what pages to start at, how to navigate them, and how to parse their contents to extract items. First, we have to import the modules we use in the definition of the Spider class:
import re from bs4 import BeautifulSoup from scrapy import Spider, Request from reddit.items import RedditItem
The first two imports are merely situational; we'll use regex to get the name of the subreddit and BeautifulSoup to extract the text of the top comment. Next, we import spider.Spider, from which our Spider will inherit, and spider.Request, which will lend Request objects from HTTP requests. Finally, we import our Item class we defined in items.py.
The next task is to give our Spider a place to start crawling.
class RedditSpider(Spider): name = 'reddit' allowed_domains = ['reddit.com'] start_urls = ['https://www.reddit.com/r/circlejerk', 'https://www.reddit.com/r/gaming', 'https://www.reddit.com/r/floridaman', 'https://www.reddit.com/r/movies', 'https://www.reddit.com/r/science', 'https://www.reddit.com/r/seahawks', 'https://www.reddit.com/r/totallynotrobots', 'https://www.reddit.com/r/uwotm8', 'https://www.reddit.com/r/videos', 'https://www.reddit.com/r/worldnews']
The attribute allowed_domains limits the domains the Spider is allowed to crawl; start_urls is where the Spider will start crawling. Next, we define a parse method, which will tell the Spider what to do on each of the start_urls. Here is the first part of this method's definition:
def parse(self, response): links = response.xpath('//p[@class="title"]/a[@class="title may-blank "]/@href').extract() titles = response.xpath('//p[@class="title"]/a[@class="title may-blank "]/text()').extract() dates = response.xpath('//p[@class="tagline"]/time[@class="live-timestamp"]/@title').extract() votes = response.xpath('//div[@class="midcol unvoted"]/div[@class="score unvoted"]/text()').extract() comments = response.xpath('//div[@id="siteTable"]//a[@class="comments may-blank"]/@href').extract()
This uses XPath to select certain parts of the HTML document. In the end, links is a list of the links for each post, titles is a list of all the post titles, etc. Corresponding elements of the first four of these lists will fill some of the fields in each instance of a RedditItem, but the top_comment field needs to be filled on the comment page for that post. One way to approach this is to partially fill an instance of RedditItem, store this partially filled item in the metadata of a Request to a comment page, and then use a second method to fill the top_comment field on the comment page. This part of the parse method's definition achieves this:
for i, link in enumerate(comments): item = RedditItem() item['subreddit'] = str(re.findall('/r/[A-Za-z]*8?', link))[3:len(str(re.findall('/r/[A-Za-z]*8?', link))) - 2] item['link'] = links[i] item['title'] = titles[i] item['date'] = dates[i] if votes[i] == u'\u2022': item['vote'] = 'hidden' else: item['vote'] = int(votes[i]) request = Request(link, callback=self.parse_comment_page) request.meta['item'] = item yield request
For the ith link in the list of comment urls, we create an instance of RedditItem, fill the subreddit field with the name of the subreddit (extracted from the comment url with the use of regular expressions), the link field with the ith link, the title field with the ith title, etc. Then, we create a request to the comment page with the instruction to send it to the method parse_comment_page, and store the partially filled item temporarily in this request's metadata. The method parse_comment_page tells the Spider what to do with this:
def parse_comment_page(self, response): item = response.meta['item'] top = response.xpath('//div[@class="commentarea"]//div[@class="md"]').extract()[0] top_soup = BeautifulSoup(top, 'html.parser') item['top_comment'] = top_soup.get_text().replace('\n', ' ') yield item
Again, XPath specifies the HTML to extract from the comment page, and in this case, BeautifulSoup removes HTML tags from the top comment. Then, finally, we fill the last part of the item with this text and yield the filled item to the next step in the scraping process.
The next step is to tell Scrapy what to do with the extracted data; this is done in the item pipeline. The item pipeline is responsible for processing the scraped data, and storing the item in a database is a typical such process. We chose to store the items in a MongoDB database, which is a document-oriented database (in contrast with the more traditional table-based relational database structure). Strictly speaking, a relational database would have sufficed, but MongoDB has a more flexible data model, which could come in use if I decide to expand on this project in the future. First, we have to specify the database settings in settings.py (another file initially created by Scrapy):
BOT_NAME = 'reddit' SPIDER_MODULES = ['reddit.spiders'] NEWSPIDER_MODULE = 'reddit.spiders' DOWNLOAD_DELAY = 2 ITEM_PIPELINES = {'reddit.pipelines.DuplicatesPipeline':300, 'reddit.pipelines.MongoDBPipeline':800, } MONGODB_SERVER = "localhost" MONGODB_PORT = 27017 MONGODB_DB = "reddit" MONGODB_COLLECTION = "post"
The delay is there to avoid violating Reddit's terms of service. So, now we've set up a Spider to crawl and parse the HTML, and we've set up our database settings. Now we need to connect the two in pipelines.py:
import pymongo from scrapy.conf import settings from scrapy.exceptions import DropItem from scrapy import log class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['link'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['link']) return item class MongoDBPipeline(object): def __init__(self): connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] ) db = connection[settings['MONGODB_DB']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): valid = True for data in item: if not data: valid = False raise DropItem("Missing {0}!".format(data)) if valid: self.collection.insert(dict(item)) log.msg("Added to MongoDB database!", level=log.DEBUG, spider=spider) return item
The first class is used to check if a link has already been added, and skips processing that item if it has. The second class defines the data persistence. The first method in MongoDBPipeline actually connects to the database (using the settings we've defined in settings.py), and the second method processes the data and adds it to the collection. In the end, our collection is filled with documents like this:
Reddit Word Clouds
The real work was done in actually scraping the data. Now, we want to use it to create visualizations of frequently used words across the ten subreddits. The Python module wordcloud does just this: it takes a plain text file and generates word clouds like the one you see above, and with very little effort. The first step is to write the post titles and top comments to text files.
import sys import pymongo reload(sys) sys.setdefaultencoding('UTF8') client = pymongo.MongoClient() db = client.reddit subreddits = ['/r/circlejerk', '/r/gaming', '/r/FloridaMan', '/r/movies', '/r/science', '/r/Seahawks', '/r/totallynotrobots', '/r/uwotm8', '/r/videos', '/r/worldnews'] for sub in subreddits: cursor = db.post.find({"subreddit": sub}) for doc in cursor: with open("text_files/%s.txt" % sub[3:], 'a') as f: f.write(doc['title']) f.write('\n\n') f.write(doc['top_comment']) f.write('\n\n') client.close()
The first two lines after the imports are there to change the default encoding from ASCII to UTF-8, in order to properly decode emojis (of which there were many in the comments). Finally, we use these text files to generate the word clouds:
import numpy as np from PIL import Image from wordcloud import WordCloud subs = ['circlejerk', 'FloridaMan', 'gaming', 'movies', 'science', 'Seahawks', 'totallynotrobots', 'uwotm8', 'videos', 'worldnews'] for sub in subs: text = open('text_files/%s.txt' % sub).read() reddit_mask = np.array(Image.open('reddit_mask.jpg')) wc = WordCloud(background_color="black", mask=reddit_mask) wc.generate(text) wc.to_file('wordclouds/%s.jpg' % sub)
The WordCloud object uses reddit_mask.jpg as a canvas: it only fills in words in the black area. Here's an example of what we get (generated from posts on /r/totallynotrobots):
After all of this, I am now a big fan of Scrapy and everything it can do, but this project has certainly only scratched the surface of its capabilities.
If you care to see the rest of the word clouds you can find them here; the code for this project can be found here.