Scraping Used Items on Craigslist.org with Scrapy
Why I Chose This Project
As the old adage goes, ‘One man’s trash is another man’s treasure.’ This was my first introduction to Scrapy, and my goal was to scrape some used items from Craigslist.org and perform some basic EDA on the data.
This project was expanded into more detail in my final project that you can check out if you’re interested https://nycdatascience.com/blog/student-works/capstone/local-used-items-analysis-with-python-and-tableau/.
Questions to Answer
I set out to answer what type of items were available, where they were located, and how many there were in each popular location. I also wanted to find out what the price distribution was for the items in the popular locations.
Where and How I Extracted the Data
This spider scraped just the first page of the used items for sale section. This made the dataset a small sample of what items were actually available on Craigslist but I decided to go this route because I knew from my research that Craiglist will block an IP address if they notice scraping activity they deem incompatible with their terms of service. I scraped the entire used items for sale section in my final project, but that required using the paid Crawlera service from Scrapinghub.com. Because this was my first exploration into web scraping, I wanted to focus on the fundamentals of creating a Scrapy spider and then performing EDA on the dataset after I cleaned the data.
How I Visualized The Data
Using pandas, matplotlib, and seaborn
Cleaning the data with pandas and then creating visualizations with matplotlib and seaborn proved sufficient to resolve the questions I set out to answer. I created some subsets of the items to extract the high-priced motor vehicles and also grouped the dataset by popular locations to get gain insight into what the price distribution was by location.
I decided to add a simple word cloud visualization as well to get one last high-level overview of what types of items were available.
Though this sample-sized dataset was too small to get a confident understanding of what the price distribution was by location, I was able to determine that $1000 would be enough to purchase the majority of items available that weren’t motor vehicles.
This was my first introduction to data wrangling with Python and I enjoyed the learning process. You can view my github repo here: https://github.com/Kiwibp/NYC-DSA-Bootcamp--Web-Scraping.