Data Scraping

Posted on May 14, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Data shows the pet industry has grown three-fold since 1996. Even through the Great Recession, it grew more than 10%.  People will cut back on eating out and vacations, but they'll continue to provide for their pets. And more young men and women are "picking pets over people" according to an article by the Washington Post. With the lifetime cost of a child being $233,000, compared to $32,850 for a dog, millennials are delaying parenthood and spending the extra money on their pets.

Data Scraping

Increasingly, more and more customers are taking to online retail stores rather than their brick-and-mortar counterparts. is an online pet food & supplies company headquartered in Dania Beach, Florida.  Their website has a layout typical to many e-stores, with sections divide by pets and item category.

The Data Scraping

To scrape the website, I created a spider using Scrapy in python. Using xpaths, most of the information was easy to pull, as seen in the image below.

Data Scraping


Data Scraping

However, the information became more difficult to scrape as I moved down the page. What looks like a table in the image below was actually an ordered list with no distinguishing attributes in the html code. What's more is that the categories in the table, and the order they were in, changed depending on the item.

My solution was to grab the list, zip it into itself to create a list of tuples, convert it to a dictionary, and then grab it into my data frame.

The Data: 

My initial dataset contained 31 features and 14,000 observations.  Numeric columns included:

  • sale price
  • original cost
  • number of reviews
  • average rating (out of 5 stars)
  • product customer recommendation percent

Categorical columns included:

  • product name
  • product description
  • brand
  • item category
  • product-specific features

The Data Analysis:

First, I explored what the data looked like when grouped by pet. Not only do dogs and cats have the largest number of foods and supplies to choose from, they also tend to be more expensive than other pets as shown in the graph below.

Below is a table showing the average cost per item, sale percentage per item, and customer recommendation percent grouped by pet. Small pets include gerbils, hamsters, guinea pigs, rabbits, ferrets, and more.

Not only are dog and cat items more expensive with lower sales, their owners are pickier about which items they recommend!

Average Cost by Category

I then looked at the products grouped by item category, such as food, grooming, and toys. The graph below shows the average cost by category. The associated table shows the average rating by category.

Dog food items make up over 40% of the total data. So I took a special look into that category. As shown in the first graph, more highly recommended brands are more expensive.

The highest-rated brand, Royal Canin with 4.82 stars, sells food for $67 on average. I then got pool-side with the data to see why some brands may be higher-rated and more expensive. I noticed that a lot of pricey items included a special diet. Selecting for this in the data, I retrieved the following information:

Special diet items not only have higher ratings on average, they also cost about $10 more per item! These findings coincide with trends in the pet food industry of more natural diet options.


1. Pet owners care about quality:

  • Prefer natural foods
  • Prefer recommended brands
  • Willing to pay significantly more

2. Chewy should:

  • feature nutrition more significantly on their site
  • look into other areas for high quality products for millennial pet owners (travel, grooming)

Chewy actually already has all it needs. The website is user-friendly with a clean design, nutritional information, and even videos for the different products. However, this information is only available as you scroll down the page. Rearranging the layout of the item page will improve sales.

Future Work:

Further analysis can bring insight into customer buying habits. I would be interested in taking a look into:

  • product sales by the numbers
  • trends over time of products, brands, and individual customers
  • textual analysis of reviews
  • other online pet stores


About Author

Sean Kickham

Sean migrated from the Midwest to New York City after graduating with a BS in Mathematics from the University of Notre Dame. He taught middle school math for five years in city schools. Equipped with a Masters in...
View all posts by Sean Kickham >

Related Articles

Leave a Comment August 25, 2017
You can definitely see your skills within the article you write. The world hopes for even more passionate writers such as you who are not afraid to mention how they believe. Always follow your heart.
Mark July 17, 2017
I just like the valuable info you provide on your articles. I'll bookmark your blog andd test again right here regularly. I am somewht certain I will be toold many new stuff proper here! Best of luck for the following!
Sean Kickham June 20, 2017
Thanks for the compliment! You ask a very good ethical question. Since data science is still a relatively new field, I am not sure if these topics are being discussed enough. What I have found is that, in general, websites wish for you to use the scraped data in an academic way that does not harm the privacy of their users. I used Python's Scrapy library in my scraping, which comes with pre-written code in the settings file that looks like this: # Obey robots.txt rules ROBOTSTXT_OBEY = True This code communicates with the website you are trying to scrape and figures out what is pre-approved for scraping. Search engines use robots.txt as well when scraping the internet to generate relevant results. Hope this helps! -Sean
saurabh pundir June 17, 2017
Hi Sean, I just happen to get to your project page while googling about web scraping. I want to tell you that you have done very good work. I am a student myself and liked how you approached it. I have a question about scrapping. I always get scared that scrapping might get me into some trouble with website owners. Please suggest me how do you get permission from the website owner. What step should I follow before start scrapping and during scrapping data from the website? Thanks in advance.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI