Scraping Used Items on Craigslist.org with Scrapy

Keenan Burke-Pitts
Posted on Jun 11, 2018

Purpose

Why I Chose This Project

As the old adage goes, ‘One man’s trash is another man’s treasure.’  This was my first introduction to Scrapy, and my goal was to scrape some used items from Craigslist.org and perform some basic EDA on the data.

This project was expanded into more detail in my final project that you can check out if you’re interested https://nycdatascience.com/blog/student-works/capstone/local-used-items-analysis-with-python-and-tableau/.

Questions to Answer

I set out to answer what type of items were available, where they were located, and how many there were in each popular location.  I also wanted to find out what the price distribution was for the items in the popular locations.

Process

Where and How I Extracted the Data

Using Scrapy

This spider scraped just the first page of the used items for sale section.  This made the dataset a small sample of what items were actually available on Craigslist but I decided to go this route because I knew from my research that Craiglist will block an IP address if they notice scraping activity they deem incompatible with their terms of service.  I scraped the entire used items for sale section in my final project, but that required using the paid Crawlera service from Scrapinghub.com.  Because this was my first exploration into web scraping, I wanted to focus on the fundamentals of creating a Scrapy spider and then performing EDA on the dataset after I cleaned the data.

How I Visualized The Data

Using pandas, matplotlib, and seaborn

Cleaning the data with pandas and then creating visualizations with matplotlib and seaborn proved sufficient to resolve the questions I set out to answer.  I created some subsets of the items to extract the high-priced motor vehicles and also grouped the dataset by popular locations to get gain insight into what the price distribution was by location.

 

Using wordcloud

I decided to add a simple word cloud visualization as well to get one last high-level overview of what types of items were available.

Results

Insights Gleaned

Though this sample-sized dataset was too small to get a confident understanding of what the price distribution was by location, I was able to determine that $1000 would be enough to purchase the majority of items available that weren’t motor vehicles.  

This was my first introduction to data wrangling with Python and I enjoyed the learning process.  You can view my github repo here: https://github.com/Kiwibp/NYC-DSA-Bootcamp--Web-Scraping.

About Author

Keenan Burke-Pitts

Keenan Burke-Pitts

Keenan has over 3 years of experience communicating and assisting in software and internet solutions to clients. As an account manager, he was responsible for tracking KPI’s, managing projects, and identifying solutions to senior-level stakeholders. Moving forward, Keenan...
View all posts by Keenan Burke-Pitts >

Leave a Comment

Your email address will not be published. Required fields are marked *

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags