Scraping Reddit

Daniel Donohue

Posted on Nov 15, 2015

Contributed by Daniel Donohue. Daniel was a student of the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. This post was based on his third in-class project (due at the end of the 6th week of the program).

14 November 2015

Introduction

For our third project here at NYC Data Science, we were tasked with writing a web scraping script in Python. Since I spend (probably too much) time on Reddit, I decided that it would be the basis for my project. For the uninitiated, Reddit is a content-aggregator, where users submit text posts or links to thematic subforums (called "subreddits"), and other users vote them up or down and comment on them. With over 36 million registered users and nearly a million subreddits, there is a lot of content to scrape.

Methodology

I selected ten subreddits---five of the top subreddits by number of subscribers and five of my personal favorites---and scraped the top post titles, links, date and time of the post, number of votes, and the top rated comment on the comment page for that post. The ten subreddits were:

/r/gaming (subreddit devoted to video games)
/r/movies (all things movies)
/r/videos (video content of all kinds)
/r/worldnews (subreddit for major world news; excludes U.S. internal news)
/r/science (general science)
/r/seahawks (the official subreddit of the Seattle NFL organization)
/r/floridaman (subreddit about ridiculous people in Florida doing ridiculous things)
/r/circlejerk (subreddit for mocking Reddit culture)
/r/totallynotrobots (people acting like robots trying to act like people)
/r/uwotm8 (people writing in Cockney accents)

There are many Python packages that would be adequate for this project, but I ended up using Scrapy. It seemed to be the most versatile among the different options, and it provided easy support for exporting scraped data to a database. Once I had the data stored in a database, I wrote the post title and top comment to txt files, and used the wordcloud module to generate word clouds for each of the subreddits.

The Details

When you start a new project, Scrapy creates a directory with a number of files. Each of these files rely on one another. The first file, items.py, defines containers that will store the scraped data:

from scrapy import Item, Field


class RedditItem(Item):
    subreddit = Field()
    link = Field()
    title = Field()
    date = Field()
    vote = Field()
    top_comment = Field()

Once filled, the item essentially acts as a Python dictionary, with the keys being the names of the fields, and the values being the scraped data corresponding to those fields.

The next file is the one that does all the heavy lifting---the file defining a Spider class. A Spider is a Python class that Scrapy uses to define what pages to start at, how to navigate them, and how to parse their contents to extract items. First, we have to import the modules we use in the definition of the Spider class:

import re

from bs4 import BeautifulSoup

from scrapy import Spider, Request
from reddit.items import RedditItem

The first two imports are merely situational; we'll use regex to get the name of the subreddit and BeautifulSoup to extract the text of the top comment. Next, we import spider.Spider, from which our Spider will inherit, and spider.Request, which will lend Request objects from HTTP requests. Finally, we import our Item class we defined in items.py.

The next task is to give our Spider a place to start crawling.

class RedditSpider(Spider):
    name = 'reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://www.reddit.com/r/circlejerk', 
                'https://www.reddit.com/r/gaming', 
                'https://www.reddit.com/r/floridaman',  
                'https://www.reddit.com/r/movies', 
                'https://www.reddit.com/r/science', 
                'https://www.reddit.com/r/seahawks', 
                'https://www.reddit.com/r/totallynotrobots', 
                'https://www.reddit.com/r/uwotm8', 
                'https://www.reddit.com/r/videos', 
                'https://www.reddit.com/r/worldnews']

The attribute allowed_domains limits the domains the Spider is allowed to crawl; start_urls is where the Spider will start crawling. Next, we define a parse method, which will tell the Spider what to do on each of the start_urls. Here is the first part of this method's definition:

   def parse(self, response):
        links = response.xpath('//p[@class="title"]/a[@class="title may-blank "]/@href').extract()
        titles = response.xpath('//p[@class="title"]/a[@class="title may-blank "]/text()').extract()
        dates = response.xpath('//p[@class="tagline"]/time[@class="live-timestamp"]/@title').extract()
        votes = response.xpath('//div[@class="midcol unvoted"]/div[@class="score unvoted"]/text()').extract()
        comments = response.xpath('//div[@id="siteTable"]//a[@class="comments may-blank"]/@href').extract()

This uses XPath to select certain parts of the HTML document. In the end, links is a list of the links for each post, titles is a list of all the post titles, etc. Corresponding elements of the first four of these lists will fill some of the fields in each instance of a RedditItem, but the top_comment field needs to be filled on the comment page for that post. One way to approach this is to partially fill an instance of RedditItem, store this partially filled item in the metadata of a Request to a comment page, and then use a second method to fill the top_comment field on the comment page. This part of the parse method's definition achieves this:

       for i, link in enumerate(comments):
            item = RedditItem()
            item['subreddit'] = str(re.findall('/r/[A-Za-z]*8?', link))[3:len(str(re.findall('/r/[A-Za-z]*8?', link))) - 2]
            item['link'] = links[i]
            item['title'] = titles[i]
            item['date'] = dates[i]
            if votes[i] == u'\u2022':
                item['vote'] = 'hidden'
            else:
                item['vote'] = int(votes[i])

            request = Request(link, callback=self.parse_comment_page)
            request.meta['item'] = item

            yield request

For the ith link in the list of comment urls, we create an instance of RedditItem, fill the subreddit field with the name of the subreddit (extracted from the comment url with the use of regular expressions), the link field with the ith link, the title field with the ith title, etc. Then, we create a request to the comment page with the instruction to send it to the method parse_comment_page, and store the partially filled item temporarily in this request's metadata. The method parse_comment_page tells the Spider what to do with this:

   def parse_comment_page(self, response):
        item = response.meta['item']

        top = response.xpath('//div[@class="commentarea"]//div[@class="md"]').extract()[0]
        top_soup = BeautifulSoup(top, 'html.parser')
        item['top_comment'] = top_soup.get_text().replace('\n', ' ')

        yield item

Again, XPath specifies the HTML to extract from the comment page, and in this case, BeautifulSoup removes HTML tags from the top comment. Then, finally, we fill the last part of the item with this text and yield the filled item to the next step in the scraping process.

The next step is to tell Scrapy what to do with the extracted data; this is done in the item pipeline. The item pipeline is responsible for processing the scraped data, and storing the item in a database is a typical such process. We chose to store the items in a MongoDB database, which is a document-oriented database (in contrast with the more traditional table-based relational database structure). Strictly speaking, a relational database would have sufficed, but MongoDB has a more flexible data model, which could come in use if I decide to expand on this project in the future. First, we have to specify the database settings in settings.py (another file initially created by Scrapy):

BOT_NAME = 'reddit'

SPIDER_MODULES = ['reddit.spiders']
NEWSPIDER_MODULE = 'reddit.spiders'

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {'reddit.pipelines.DuplicatesPipeline':300, 
'reddit.pipelines.MongoDBPipeline':800, }

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "reddit"
MONGODB_COLLECTION = "post"

The delay is there to avoid violating Reddit's terms of service. So, now we've set up a Spider to crawl and parse the HTML, and we've set up our database settings. Now we need to connect the two in pipelines.py:

import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log


class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['link'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['link'])
            return item



class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

The first class is used to check if a link has already been added, and skips processing that item if it has. The second class defines the data persistence. The first method in MongoDBPipeline actually connects to the database (using the settings we've defined in settings.py), and the second method processes the data and adds it to the collection. In the end, our collection is filled with documents like this:

Reddit Word Clouds

The real work was done in actually scraping the data. Now, we want to use it to create visualizations of frequently used words across the ten subreddits. The Python module wordcloud does just this: it takes a plain text file and generates word clouds like the one you see above, and with very little effort. The first step is to write the post titles and top comments to text files.

import sys
import pymongo

reload(sys)
sys.setdefaultencoding('UTF8')

client = pymongo.MongoClient()
db = client.reddit

subreddits = ['/r/circlejerk', '/r/gaming', '/r/FloridaMan', '/r/movies',
                    '/r/science', '/r/Seahawks', '/r/totallynotrobots', 
                    '/r/uwotm8', '/r/videos', '/r/worldnews']

for sub in subreddits:
    cursor = db.post.find({"subreddit": sub})
    for doc in cursor:
        with open("text_files/%s.txt" % sub[3:], 'a') as f:
            f.write(doc['title'])
            f.write('\n\n')
            f.write(doc['top_comment'])
            f.write('\n\n')

client.close()

The first two lines after the imports are there to change the default encoding from ASCII to UTF-8, in order to properly decode emojis (of which there were many in the comments). Finally, we use these text files to generate the word clouds:

import numpy as np

from PIL import Image
from wordcloud import WordCloud


subs = ['circlejerk', 'FloridaMan', 'gaming', 'movies',
            'science', 'Seahawks', 'totallynotrobots', 
            'uwotm8', 'videos', 'worldnews']

for sub in subs:
    text = open('text_files/%s.txt' % sub).read()
    reddit_mask = np.array(Image.open('reddit_mask.jpg'))
    wc = WordCloud(background_color="black", mask=reddit_mask)
    wc.generate(text)
    wc.to_file('wordclouds/%s.jpg' % sub)

The WordCloud object uses reddit_mask.jpg as a canvas: it only fills in words in the black area. Here's an example of what we get (generated from posts on /r/totallynotrobots):

After all of this, I am now a big fan of Scrapy and everything it can do, but this project has certainly only scratched the surface of its capabilities.

If you care to see the rest of the word clouds you can find them here; the code for this project can be found here.

About Author

Daniel Donohue

Daniel Donohue (A.B. Mathematics, M.S. Mathematics) spent the last three years as a Ph.D. student in mathematics studying topics in algebraic geometry, but decided a few short months ago that he needed a change in venue and career....

View all posts by Daniel Donohue >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

mamaeka March 2, 2017

Berikut ini cara membuat empek-empek tenggiri kapal selam palembang. Dengan menggunakan ikan laut yaitu ikan tenggiri, berikut bahan вЂ“ bahan yang dibutuhkan dalam membuat pempek kapal selam.1. 500 gr daging ikan giling tenggiri2. 1 sdt vetsin3. 25 gr garam4. 550 gr sagu tani5. 450 cc air dengan aturan 350 cc akan diaduk dengan ikan dan 100 cc untuk melarutkan garam dan vetsinCara Membuat Pempek Kapal Selam : Cara Membuat Empek-Empek Tenggiri Kapal Selam Palembang

Eva Carlson February 4, 2017

This is a message to the webmaster. I discovered your Scraping Reddit - NYC Data Science Academy BlogNYC Data Science Academy Blog page by searching on Google but it was hard to find as you were not on the front page of search results. I know you could have more traffic to your site. I have found a website which offers to dramatically improve your website rankings and traffic to your website: http://acortarurl.es/57 I managed to get close to 500 visitors/day using their service, you can also get a lot more targeted traffic from search engines than you have now. Their services brought significantly more visitors to my website. I hope this helps!

CassyLKasa August 29, 2016

Component of writing writing is another fun, should you be knowledgeable about after that it is possible to write or even it is actually complicated to write down.

MattGBilling July 19, 2016

I have read numerous articles regarding the blogger lovers but this article is in fact a fastidious component of writing, keep it up.

Scraping Reddit

14 November 2015

Introduction

Methodology

The Details

Reddit Word Clouds

About Author

Daniel Donohue

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Scraping Reddit

14 November 2015

Introduction

Methodology

The Details

Reddit Word Clouds

About Author

Daniel Donohue

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!