Local Used Items Analysis with Python and Tableau

Keenan Burke-Pitts
Posted on Jun 8, 2018

Purpose

Why I Chose This Project

I’ve always been fascinated with the second-hand market.  So many used goods are exchanged locally as it is easier than ever now.  When I was finishing up my undergraduate degree a while back, I was inspired by Rachel Botsman’s argument in favor of Collaborative Consumption. I’ve always been partly impressed and partly saddened by how much we consume produced goods around us.  For better or worse, the internet has connected us to the point that the boundaries and obstacles between online and offline are dwindling.   One of the benefits to this is the ability to buy and sell used goods with people we’ve never met before, and this has unlocked a remarkable amount of dormant assets.  My hope is that this process will continue to improve, and norms of hyperconsumption will recalibrate into a more balanced state. This is my simple exploration into used items available near where I live in Asheville, NC.

Questions to Answer

I set out to answer where most used items in my area are located, what the central tendencies of the item prices were by location, and how many free items were available by location.  I also wanted to extract a price and summary of the description of the items as well as classify the item category using NLP.

Process

Where and How I Extracted the Data

Letgo.com

I used Scrapy to extract out the item information and this proved to be the easiest scrape of all three sites. Letgo.com is javascript heavy but fortunately, a simple JSON scraper was all it took to get what I needed since I could find the JSON requests going on in the background.

Craigslist.org

Craigslist proved to be the most difficult to scrape at scale since the site has features in place to block an IP address if it is extracting a lot of information in a short period of time.  At first, I tried a simple solution of adding a delay between each page request. That failed to work after several hundred items were scraped and I began to get the impression that CL was not only checking IP's for the speed of requests but also might be checking for page depth.  In any case, after some research and advice, I created an account on scrapinghub.com and used Crawlera to avoid my IP address getting blocked again.

Facebook.com

Facebook is obviously a robust website that uses a lot of javascript; unfortunately, I wasn’t able to easily isolate the loading of JSON requests for items in Facebook Marketplace.  After some research and advice, it was determined that the simplest approach would be to use Selenium. The upside of using Selenium is that you can code any interaction that a user would perform on a website; the downside is it is tedious to break every user interaction required to navigate to the marketplace, and it scrapes considerably more slowly than Scrapy does.

Storing the Data in MongoDB

The data was stored in JSON objects as MongoDB.  This required that I make adjustments to the settings.py and pipelines.py scripts in my Scrapy code. I then imported the json objects into pandas dataframes where the majority of time was spent cleaning the data.  Even though the same primary columns were extracted for all items, as each site had its own idiosyncrasies, a good deal of time and effort was required to clean the data frames so they could be merged and produce better results during exploratory data analysis.

How I Visualized The Data

EDA With Tableau

If you'd like to view the entire Tableau workbook and storyline.

EDA With Pandas & Matplotlib

Results

Insights Gleaned

After a couple unsuccessful attempts at applying unsupervised NLP with spaCy and pyLDAvis libraries inspired by this walkthrough https://github.com/skipgram/modern-nlp-in-python as well as creating a text summarizer with the Keras library inspired by this walkthrough https://github.com/llSourcell/How_to_make_a_text_summarizer, I decided to simplify the process and use the MonkeyLearn API to execute a text summarizer model as well as a price extractor model.  I also created a custom category classification model.

Improvements to be Made

 

I found this project engaging and challenging.  If I were to scrape items again from these sites I would also scrape the categories they are in; this would make for a more interesting analysis of items by their categories and I could use them to as targets for my NLP classification model.  It appeared that some of the descriptions for the Facebook items weren’t scraped. As I wasn’t able to determine why, I would pay more attention to that in the future. The free version of MonkeyLearn only allows 300 queries per month, so I will try to get my customized category classifier more accurate when my allowable query amount resets next month.  I will also train it with many more items to see to see if that makes it more accurate.

You can view the Tableau workbook here: https://public.tableau.com/profile/keenan.burke.pitts#!/vizhome/NYCDSAFinalProject_0/LocalUsedItemsAnalysis and my github repo here: https://github.com/Kiwibp/NYC-DSA-Bootcamp--Final-Project.

About Author

Keenan Burke-Pitts

Keenan Burke-Pitts

Keenan has over 3 years of experience communicating and assisting in software and internet solutions to clients. As an account manager, he was responsible for tracking KPI’s, managing projects, and identifying solutions to senior-level stakeholders. Moving forward, Keenan...
View all posts by Keenan Burke-Pitts >

Related Articles

Leave a Comment

Your email address will not be published. Required fields are marked *

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags