Scraping Reddit

Daniel Donohue
Posted on Nov 15, 2015

Contributed by Daniel Donohue.  Daniel was a student of the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015.  This post was based on his third in-class project (due at the end of the 6th week of the program).

14 November 2015

Introduction

For our third project here at NYC Data Science, we were tasked with writing a web scraping script in Python.  Since I spend (probably too much) time on Reddit, I decided that it would be the basis for my project.  For the uninitiated, Reddit is a content-aggregator, where users submit text posts or links to thematic subforums (called "subreddits"), and other users vote them up or down and comment on them.  With over 36 million registered users and nearly a million subreddits, there is a lot of content to scrape.

Methodology

I selected ten subreddits---five of the top subreddits by number of subscribers and five of my personal favorites---and scraped the top post titles, links, date and time of the post, number of votes, and the top rated comment on the comment page for that post.  The ten subreddits were:

There are many Python packages that would be adequate for this project, but I ended up using Scrapy.  It seemed to be the most versatile among the different options, and it provided easy support for exporting scraped data to a database.  Once I had the data stored in a database, I wrote the post title and top comment to txt files, and used the wordcloud module to generate word clouds for each of the subreddits.

The Details

When you start a new project, Scrapy creates a directory with a number of files.  Each of these files rely on one another.  The first file, items.py, defines containers that will store the scraped data:

from scrapy import Item, Field


class RedditItem(Item):
    subreddit = Field()
    link = Field()
    title = Field()
    date = Field()
    vote = Field()
    top_comment = Field()

Once filled, the item essentially acts as a Python dictionary, with the keys being the names of the fields, and the values being the scraped data corresponding to those fields.

The next file is the one that does all the heavy lifting---the file defining a Spider class.  A Spider is a Python class that Scrapy uses to define what pages to start at, how to navigate them, and how to parse their contents to extract items.  First, we have to import the modules we use in the definition of the Spider class:

import re

from bs4 import BeautifulSoup

from scrapy import Spider, Request
from reddit.items import RedditItem

The first two imports are merely situational; we'll use regex to get the name of the subreddit and BeautifulSoup to extract the text of the top comment.  Next, we import spider.Spider, from which our Spider will inherit, and spider.Request, which will lend Request objects from HTTP requests.  Finally, we import our Item class we defined in items.py.

The next task is to give our Spider a place to start crawling.

class RedditSpider(Spider):
    name = 'reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://www.reddit.com/r/circlejerk', 
                'https://www.reddit.com/r/gaming', 
                'https://www.reddit.com/r/floridaman',  
                'https://www.reddit.com/r/movies', 
                'https://www.reddit.com/r/science', 
                'https://www.reddit.com/r/seahawks', 
                'https://www.reddit.com/r/totallynotrobots', 
                'https://www.reddit.com/r/uwotm8', 
                'https://www.reddit.com/r/videos', 
                'https://www.reddit.com/r/worldnews']

The attribute allowed_domains limits the domains the Spider is allowed to crawl; start_urls is where the Spider will start crawling.  Next, we define a parse method, which will tell the Spider what to do on each of the start_urls.  Here is the first part of this method's definition:

   def parse(self, response):
        links = response.xpath('//p[@class="title"]/a[@class="title may-blank "]/@href').extract()
        titles = response.xpath('//p[@class="title"]/a[@class="title may-blank "]/text()').extract()
        dates = response.xpath('//p[@class="tagline"]/time[@class="live-timestamp"]/@title').extract()
        votes = response.xpath('//div[@class="midcol unvoted"]/div[@class="score unvoted"]/text()').extract()
        comments = response.xpath('//div[@id="siteTable"]//a[@class="comments may-blank"]/@href').extract()

This uses XPath to select certain parts of the HTML document.  In the end, links is a list of the links for each post, titles is a list of all the post titles, etc.  Corresponding elements of the first four of these lists will fill some of the fields in each instance of a RedditItem, but the top_comment field needs to be filled on the comment page for that post.  One way to approach this is to partially fill an instance of RedditItem, store this partially filled item in the metadata of a Request to a comment page, and then use a second method to fill the top_comment field on the comment page.  This part of the parse method's definition achieves this:

       for i, link in enumerate(comments):
            item = RedditItem()
            item['subreddit'] = str(re.findall('/r/[A-Za-z]*8?', link))[3:len(str(re.findall('/r/[A-Za-z]*8?', link))) - 2]
            item['link'] = links[i]
            item['title'] = titles[i]
            item['date'] = dates[i]
            if votes[i] == u'\u2022':
                item['vote'] = 'hidden'
            else:
                item['vote'] = int(votes[i])

            request = Request(link, callback=self.parse_comment_page)
            request.meta['item'] = item

            yield request

For the ith link in the list of comment urls, we create an instance of RedditItem, fill the subreddit field with the name of the subreddit (extracted from the comment url with the use of regular expressions), the link field with the ith link, the title field with the ith title, etc.  Then, we create a request to the comment page with the instruction to send it to the method parse_comment_page, and store the partially filled item temporarily in this request's metadata.  The method parse_comment_page tells the Spider what to do with this:

   def parse_comment_page(self, response):
        item = response.meta['item']

        top = response.xpath('//div[@class="commentarea"]//div[@class="md"]').extract()[0]
        top_soup = BeautifulSoup(top, 'html.parser')
        item['top_comment'] = top_soup.get_text().replace('\n', ' ')

        yield item

Again, XPath specifies the HTML to extract from the comment page, and in this case, BeautifulSoup removes HTML tags from the top comment.  Then, finally, we fill the last part of the item with this text and yield the filled item to the next step in the scraping process.

The next step is to tell Scrapy what to do with the extracted data; this is done in the item pipeline.  The item pipeline is responsible for processing the scraped data, and storing the item in a database is a typical such process.  We chose to store the items in a MongoDB database, which is a document-oriented database (in contrast with the more traditional table-based relational database structure).  Strictly speaking, a relational database would have sufficed, but MongoDB has a more flexible data model, which could come in use if I decide to expand on this project in the future.  First, we have to specify the database settings in settings.py (another file initially created by Scrapy):

BOT_NAME = 'reddit'

SPIDER_MODULES = ['reddit.spiders']
NEWSPIDER_MODULE = 'reddit.spiders'

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {'reddit.pipelines.DuplicatesPipeline':300, 
'reddit.pipelines.MongoDBPipeline':800, }

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "reddit"
MONGODB_COLLECTION = "post"

The delay is there to avoid violating Reddit's terms of service.  So, now we've set up a Spider to crawl and parse the HTML, and we've set up our database settings.  Now we need to connect the two in pipelines.py:

import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log


class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['link'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['link'])
            return item



class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

The first class is used to check if a link has already been added, and skips processing that item if it has.  The second class defines the data persistence.  The first method in MongoDBPipeline actually connects to the database (using the settings we've defined in settings.py), and the second method processes the data and adds it to the collection.  In the end, our collection is filled with documents like this:

post

Reddit Word Clouds

The real work was done in actually scraping the data.  Now, we want to use it to create visualizations of frequently used words across the ten subreddits.  The Python module wordcloud does just this: it takes a plain text file and generates word clouds like the one you see above, and with very little effort.  The first step is to write the post titles and top comments to text files.

import sys
import pymongo

reload(sys)
sys.setdefaultencoding('UTF8')

client = pymongo.MongoClient()
db = client.reddit

subreddits = ['/r/circlejerk', '/r/gaming', '/r/FloridaMan', '/r/movies',
                    '/r/science', '/r/Seahawks', '/r/totallynotrobots', 
                    '/r/uwotm8', '/r/videos', '/r/worldnews']

for sub in subreddits:
    cursor = db.post.find({"subreddit": sub})
    for doc in cursor:
        with open("text_files/%s.txt" % sub[3:], 'a') as f:
            f.write(doc['title'])
            f.write('\n\n')
            f.write(doc['top_comment'])
            f.write('\n\n')

client.close()

The first two lines after the imports are there to change the default encoding from ASCII to UTF-8, in order to properly decode emojis (of which there were many in the comments).  Finally, we use these text files to generate the word clouds:

import numpy as np

from PIL import Image
from wordcloud import WordCloud


subs = ['circlejerk', 'FloridaMan', 'gaming', 'movies',
            'science', 'Seahawks', 'totallynotrobots', 
            'uwotm8', 'videos', 'worldnews']

for sub in subs:
    text = open('text_files/%s.txt' % sub).read()
    reddit_mask = np.array(Image.open('reddit_mask.jpg'))
    wc = WordCloud(background_color="black", mask=reddit_mask)
    wc.generate(text)
    wc.to_file('wordclouds/%s.jpg' % sub)

The WordCloud object uses reddit_mask.jpg as a canvas: it only fills in words in the black area.  Here's an example of what we get (generated from posts on /r/totallynotrobots):

totallynotrobots

After all of this, I am now a big fan of Scrapy and everything it can do, but this project has certainly only scratched the surface of its capabilities.

If you care to see the rest of the word clouds you can find them here; the code for this project can be found here.

About Author

Daniel Donohue

Daniel Donohue

Daniel Donohue (A.B. Mathematics, M.S. Mathematics) spent the last three years as a Ph.D. student in mathematics studying topics in algebraic geometry, but decided a few short months ago that he needed a change in venue and career....
View all posts by Daniel Donohue >

Related Articles

Leave a Comment

Avatar
mamaeka March 2, 2017
Berikut ini cara membuat empek-empek tenggiri kapal selam palembang. Dengan menggunakan ikan laut yaitu ikan tenggiri, berikut bahan – bahan yang dibutuhkan dalam membuat pempek kapal selam.1. 500 gr daging ikan giling tenggiri2. 1 sdt vetsin3. 25 gr garam4. 550 gr sagu tani5. 450 cc air dengan aturan 350 cc akan diaduk dengan ikan dan 100 cc untuk melarutkan garam dan vetsinCara Membuat Pempek Kapal Selam : Cara Membuat Empek-Empek Tenggiri Kapal Selam Palembang
Avatar
Eva Carlson February 4, 2017
This is a message to the webmaster. I discovered your Scraping Reddit - NYC Data Science Academy BlogNYC Data Science Academy Blog page by searching on Google but it was hard to find as you were not on the front page of search results. I know you could have more traffic to your site. I have found a website which offers to dramatically improve your website rankings and traffic to your website: http://acortarurl.es/57 I managed to get close to 500 visitors/day using their service, you can also get a lot more targeted traffic from search engines than you have now. Their services brought significantly more visitors to my website. I hope this helps!
Avatar
CassyLKasa August 29, 2016
Component of writing writing is another fun, should you be knowledgeable about after that it is possible to write or even it is actually complicated to write down.
Avatar
MattGBilling July 19, 2016
I have read numerous articles regarding the blogger lovers but this article is in fact a fastidious component of writing, keep it up.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp