Local Used Items Analysis with Python and Tableau
Why I Chose This Project
I’ve always been fascinated with the second-hand market. So many used goods are exchanged locally as it is easier than ever now. When I was finishing up my undergraduate degree a while back, I was inspired by Rachel Botsman’s argument in favor of Collaborative Consumption. I’ve always been partly impressed and partly saddened by how much we consume produced goods around us. For better or worse, the internet has connected us to the point that the boundaries and obstacles between online and offline are dwindling. One of the benefits to this is the ability to buy and sell used goods with people we’ve never met before, and this has unlocked a remarkable amount of dormant assets. My hope is that this process will continue to improve, and norms of hyperconsumption will recalibrate into a more balanced state. This is my simple exploration into used items available near where I live in Asheville, NC.
Questions to Answer
I set out to answer where most used items in my area are located, what the central tendencies of the item prices were by location, and how many free items were available by location. I also wanted to extract a price and summary of the description of the items as well as classify the item category using NLP.
Where and How I Extracted the Data
Craigslist proved to be the most difficult to scrape at scale since the site has features in place to block an IP address if it is extracting a lot of information in a short period of time. At first, I tried a simple solution of adding a delay between each page request. That failed to work after several hundred items were scraped and I began to get the impression that CL was not only checking IP's for the speed of requests but also might be checking for page depth. In any case, after some research and advice, I created an account on scrapinghub.com and used Crawlera to avoid my IP address getting blocked again.
Storing the Data in MongoDB
The data was stored in JSON objects as MongoDB. This required that I make adjustments to the settings.py and pipelines.py scripts in my Scrapy code. I then imported the json objects into pandas dataframes where the majority of time was spent cleaning the data. Even though the same primary columns were extracted for all items, as each site had its own idiosyncrasies, a good deal of time and effort was required to clean the data frames so they could be merged and produce better results during exploratory data analysis.
How I Visualized The Data
EDA With Tableau
If you'd like to view the entire Tableau workbook and storyline.
EDA With Pandas & Matplotlib
After a couple unsuccessful attempts at applying unsupervised NLP with spaCy and pyLDAvis libraries inspired by this walkthrough https://github.com/skipgram/modern-nlp-in-python as well as creating a text summarizer with the Keras library inspired by this walkthrough https://github.com/llSourcell/How_to_make_a_text_summarizer, I decided to simplify the process and use the MonkeyLearn API to execute a text summarizer model as well as a price extractor model. I also created a custom category classification model.
Improvements to be Made
I found this project engaging and challenging. If I were to scrape items again from these sites I would also scrape the categories they are in; this would make for a more interesting analysis of items by their categories and I could use them to as targets for my NLP classification model. It appeared that some of the descriptions for the Facebook items weren’t scraped. As I wasn’t able to determine why, I would pay more attention to that in the future. The free version of MonkeyLearn only allows 300 queries per month, so I will try to get my customized category classifier more accurate when my allowable query amount resets next month. I will also train it with many more items to see to see if that makes it more accurate.
You can view the Tableau workbook here: https://public.tableau.com/profile/keenan.burke.pitts#!/vizhome/NYCDSAFinalProject_0/LocalUsedItemsAnalysis and my github repo here: https://github.com/Kiwibp/NYC-DSA-Bootcamp--Final-Project.