Data Comparison on Running Shoes
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Motivation
There is a rising interest in running as the COVID-19 pandemic has forced gym closures and people are running more to stay fit, especially as the weather is getting warmer. This web data scraping project seeks to provide beginner runners with some comparison information as they may look to buy a new pair of running shoes from Zappos.com, a third-party shoes retail site that offers 22 brands and close to 1,000 shoes.
Though traditional factors such as price, rating and fit are considered, two additional measures - shoe weight and the age of the brand โ are examined. These two measures are not typically considered by beginners, however they do yield some interesting insight.
Dataset
The data was scrapped from the Zappos website. The site was chosen because it has a larger running shoe listing than other similar sites such as Finish Line and Overstock.com.
Scrapped information include brand, model, star ratings, price, fit measures, weight and comments. The ratings are on a 5-star scale (5 stars = best). The fit measures consist of 3 categories (True to size, True to width, and Arch support), and is graded by the buyers on a scale of 0 โ 100%. The age of the brand was found separately from other sources on the web. From the 22 brands offered by the site, top 11 most popular brands were filtered out for analysis.
Price range Data
It can be noted that that the shoe prices are clustered around $70 - $120. And though most Nike models selling under $100, this may be due to Nikeโs devotion of most of its R&D and marketing on basketball shoes, it only started to direct energy on running shoes recently. In addition, Nikeโs most advanced and higher-priced selections (such as Vaporfly) are listed only Nikeโs own website, not on Zappos.
As the shoes that are in focus are most popular, itโs no surprise that the ratings are clustered around 4 or 5 ratings. However, one conclusion that can be made at this point is that there is little correlation between price and ratings, so one does not have to spend a lot of money to buy a pair a good pair of running shoes.
Data on Rating and shoe fit
In digging a little deeper on the information content of the ratings, we can look at how that ratings can indicate shoe fit. As previously indicated, the fit measure consists of consists of 3 categories (True to size, True to width, and Arch support), and is graded by the buyers on a scale of 0 โ 100%.
While size and width are mostly consistent with the ratings in which higher ratings reflect greater fitting comfort, there is more varied feedback on arch support. Arch support is harder to pin down as it relates to shoes that provide special cushioning to โcorrectโ biomechanical and prolongation issues. As it relates to potentially highly individualistic medical issues, it is no wonder that the spread of the feedback is wider. This indicates that more care must be taken for runners looking for shoes that provide corrective cushioning.
Shoe weight
The median weight of the shoes is mostly between 10-12 oz., which is reasonable for most average runners. There are some shoes for those looking for extra cushioning. Looking at the relationship between weight and price, the takeaway here is that there is little correlation between weight and price, so a potential buyer may not have to worry about paying a higher price for shoes with extra cushioning.
Brand Age Data
Age is used as a proxy to capture the intangible measures of freshness and innovation. This particularly relates to On and Hoka One One, the two newest entrants to the running shoe market. Though the two brands are only about 10 years old, they have become very popular with runners: On with its innovative design of its soles and Hoka with its cushioning that does not much to the overall weight to the shoes.
However, the two brands typically cost more than other more established brands. It interesting to note that though the Nike brand is about 70 years old, it has a strong innovation cycle. As Nike directed its vast resources to the running shoe market in recent years, it introduced the highly touted, and controversial, Vaporfly model that has generated a great deal of buzz. The net effect of what these brands have done is that it has forced other brands to go back to the drawing board to rethink and innovate to keep things fresh.
Future considerations
Future project considerations include comparing womenโs running shoes. This is particularly significant as women makes up about half of the marathon finishers in the U.S. and makes up more than 50% of all casual runners. Other interesting topics that can be explored include the demographic makeup of the brands and the time series of the brand evolution. Also, a deeper analysis can be performed on a richer dataset obtained from scraping the brand sites directly, which combined with machine learning algorithms may yield more insight.