Data Exploration and Analysis of Nike Shoes
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Github Repo | LinkedIn
Nike is an iconic brand of shoes that has arguably become one of the most recognizable in its market. It's perceived by data, worldwide as a very premium and high quality brand, an image Nike carefully cultivates.
For this project I'm working from the perspective of a potential competitor who would like to understand Nike's offering. Some of the things we will look to analyze are its product categories, pricing structure, distribution of its offer, distribution of reviews, how the product is marketed, what their customers are saying, how different measures impact customer satisfaction, among others.
To acquire the data I used the Selenium package for Python to scrape information directly from the Nike website. The program parsed through the men's and women's shoe sections to gather the following information:
- Product gender (men or woman), title, URL, category, price, and description.
- Number of reviews.
- Average rating for the product based on the reviews.
- Average score for three measures that Nike allows users to input: size, comfort, and durability.
- Individual review's title, score, body, and date.
From this I was able to successfully gather data on 734 out of 739 men's products and 518 out of 523 woman's products, which represent 99.2% of all the shoe products on their website. I also gathered 35,519 reviews.
The data needed some cleaning and reorganizing that I will also discuss later, such as grouping redundant or similar categories. For "size" customers could input a value from 0 to 100 where a value of 0 meant the shoe was too small while 100 meant it was too big. In order to be able to run a correlation between this feature and the others, such as rating, we modified it by making 0 a perfect fit (50 on the original slider) while 50 is a bad fit (either 0 or 100 on the original slider). I choose to do this for the sake of simplicity, although I lost the ability to see if the incorrect fit was due to it being too large or small.
Questions and Data Analysis
Some of the questions we will be asking in this section are:
- How are the products and reviews distributed by gender and category?
- What's the price distribution of the products for each category and gender?
- How are the reviews distributed, and what are the top reviews categories by gender?
- How have the reviews evolved over time in general and by the top categories?
- Are the numerical variables correlated, and what we can infer from this?.
After this section we will do some Natural Language Processing analysis where we will ask some other questions relating to the text scrapped.
Regarding our first question, we will first take a look at the distribution by products available on the website.
Products by Gender Data
We can see that there are slightly more products for men, as they represent 58.6% of all shoes available.
There were over 125 categories, but many were redundant or similar. Consequently, I had to do some data cleaning to group them by use. On that basis, we are able to determine that the top 4 categories out of 29 categories, represent over 70% of their shoe products. That indicates those categories must be of great importance to their strategic plan.
Our most frequent category, for both men and women, is what Nike refers to "Lifestyle Shoes." Those are shoes that are purchased mostly for their aesthetic qualities and not their use for a particular activity or sport. Basically, they are everyday shoes that are targeted to individuals with different tastes and styles.
Some other very popular categories include soccer, basketball, and running shoes. That indicates that Nike does value the sports-centered markets and so produces a large number of shoesfor that use.
Percentage of Reviews by Gender Data
Next we will take a look at the distribution of their categories and gender by number of reviews to see how they match with the number of products offered and also get an idea of what categories could be in high demand.
We can see that we have an even larger proportion of reviews coming from men products when it comes to reviews. Also the distribution changes a bit for men the top four categories are lifestyle (56.0%), running (12.9%), training shoe (8.2%), and sandal (6.1%), which represent over 80% of all reviews. For women the top four are: lifestyle (36.8%), running (21.8%), training (9.2%), and basketball (9.1%), representing over 75% of all reviews.
While we expect a correlation between the number of reviews and sales. However, there are probably other factors involved like price and the expectations for higher priced products versus for lower priced ones. Other factors that may come into play include the consumer’s age and general attitude. This helps us have a better understanding of which categories are larger and can inform us about potential market size. It seems we can conclude that some of the largest categories for Nike are lifestyle, soccer, basketball, running, training, and sandals since they have such a large concentration of products offered and reviews.
Price Distribution Data
Next we will take a look at the histogram of price distribution for the top categories to get a better understanding of their pricing structure.
From the graphs, we can see that depending on the product there are very different distributions. Some, such as lifestyle and basketball, have a more normal distribution while others have different spikes or a more uniform distribution. For example, for the soccer shoes category we find some lower priced products and then a large spike of premium products at around the US $300 mark. This could indicate catering to a particular market segment that is willing to pay a premium price for specialty shoes.
We can also easily get a glimpse into the median and average price points for these categories. For example, we find that soccer and custom shoes have the highest average prices. Also we find little difference between the average and median prices for shoes for both men and women. The only two categories with some difference are running shoes that are higher priced for men and training shoes that are higher priced for women.
Next we will take a look at the review scores in general and for different categories, as well as, how have they changed over time.
As we might have suspected ,there is a heavy skew towards higher reviews. The share of 4 and 5 star review ratings is 90.1%. There could be several reasons why this might be happening.
One would be customer behavior. For example, due to this being the company's website, it attracts the most loyal fans. Another reason could be that since Nike has full control of the website, they are only allowing the best reviews. Furthermore, since these are only the products available at the moment, we could be seeing the effect of survivorship bias. What that means is that the worst performing and thus lower rated products are more quickly removed, leaving higher on average rated products up for longer.
In any case, this skew in the data will limit our ability to draw conclusions, especially for the lower rated reviews which is a smaller and less reliable sample.
The previous graph shows the top 15 largest categories and their average rating for both men and women. It seems that Nike customers are particularly satisfied with the categories that also represent their largest product offering. This would make sense from a business perspective, as it would be wise to ensure customers perceive the products in their largest markets highly.
For a competitor, this could also signal opportunities where it would be easier to gain market share. For example, we can see that women are particularly unhappy with the golf shoes and boots available to them. More analysis of those reviews could help pinpoint the issues these customers have, which could help create an effective marketing plan to take market share from Nike in this segment.
Average Rating by Month
The previous graph shows the average rating per month for the categories with most reviews. As we can see, further back in time, some categories are missing. We suspect that is because we are taking a single snapshot of the website today. As older products are retired, their reviews disappear, leaving only newer reviews.
The retirement of older reviews does skew the data, but we can still derive some interesting insights. First, we see that some categories have products that have been present for a long time, such as training or lifestyle shoes, which could attest to those categories having a longer product lifetime cycle. This information could be valuable as, all else constant, a product that has a longer lifetime is more profitable because of savings in development costs.
Furthermore, we can see some periods of a noticeable decline in the average score, which can point towards particular incidents in a product or category. By doing some further research into the reviews in those periods, we could pinpoint the causes and draw some valuable lessons to avoid similar pitfalls in our brand in the future. In addition, we could prepare marketing materials that focus on attracting customers who were affected by this issue to gain market share.
Correlations Between the Datas
The following graph showcases the correlation between the numerical features.
One interesting realization is that there is a strong correlation between the number of reviews and the score. This could point to the fact that strong feelings tend to motivate more reviews. Customers only bother to write reviews if they really love their shoes or if they are disappointed in them. A practical lesson from this could be to find ways to motivate customers to review your product to achieve higher average ratings.
Another insight is that of the three sliders that Nike offers (comfort, size, and durability), the one that has the most impact in the rating is comfort. This could point to customers placing sizable value on this feature, which could prove valuable when designing and marketing our products.
Next we will do some analysis using natural language processing on the description and reviews texts to attempt to get some insights. This will include a closer inspection into the notion of "comfort" to reveal its context and provide a greater understanding of what it entails.
Natural Language Processing
In this section we will use NLP to get information regarding reviews, descriptions, and we will take a deeper look into the running shoes category. Some of the questions we will attempt to answer are:
- What are the best and worst reviews saying?
- Are there any common words used by Nike in the description of running shoes?
- What are the best reviews of running shoes saying?
- What is the context when "comfort" is mentioned in reviews for running shoes?
In order to do this analysis, we first lemmatize the text body, which is the process of reducing words to their root meaning. This allows grouping of similar words to enable more effective analysis. We also removed stop words and punctuation, which add little value to the analysis and can clutter our graphs. To do this we used the help of the "nltk" package for Python.
Data on Reviews and Description
One thing to note before we start doing this analysis is that there is a large imbalance in the number of positive (4 and 5 starts) vs negative (1 and 2 starts). As mentioned before, this will limit our ability to draw conclusions from the smaller negative reviews dataset.
When looking at the positive reviews of Bigrams, we see some things we expect such as a positive feeling about their purchase ("loved, "great", etc). This is not that insightful, but we can also gather some other data.
For example, we get a lot of "comfortable" mentions, which reinforces the idea we got from our correlation analysis in the previous section that comfort is highly valued by customers. We also note mentions of "son" which indicates that many of the purchasers are parents. Also, we see several "fit" and "look" statements that highlight the importance of these characteristics. Finally, a lot of people are eager to recommend the product.
When it comes to negative reviews, we see a lot of the same themes registered as points of dissatisfaction. We see words such as, comfort, fit, wide foot, small, bigger, falling apart. Also a lot of people are commenting about how long ago they bought it, probably to add credibility to their reviews by showing they have used it for some time.
Since there is a lot to explore with this type of analysis that would be outside of the scope of this project, I wanted to focus on running shoes for the following section to give an example of what can be achieved. This is one of the largest categories for Nike, one in which it enjoys a particularly high reputation.
Word Identification in Descriptions
First, we can take a look at a word cloud of the short and long descriptions (an expanded text that must be clicked to show) of running shoes. There are a lot of mentions of the material that enhances its comfort, such as "cushion”, "knit material", "soft foam”, ”breathable", etc. We can see the large importance that Nike's marketing team puts on emphasizing aspects related to the comfort and usability that the shoe will provide. We also see some mentions of the "intended use" to guide the client on how they can use their product and their "minimalistic design" to highlight its look.
From this we can gather how Nike has decided to market this product category, which we can assume is based on their own insights into what resonates with customers and generates sales.
As comfort seems to be an important metric, we will look at a word-cloud graph of reviews that mention this word to get more context into its use.
Data on the Words Used in Reviews
We can see a lot of people are talking about the "fit", "feel", "wear", "true size" and to a lesser degree "look" and "color". This gives us some dimensions to optimize when looking to maximize comfort, such as making sure the fit and feel are good, as well as making sure they are lightweight and true to their size. As a second note, but still of importance, is the aesthetic due to what's being mentioned regarding look and color. Also, a lot of people mention that they use it for walking instead of running, which could help segment this market and tailor an offer to these customers.
Conclusions and Future Work
There are a lot of insights we have gained from this project some of the most important are:
- How are the shoes categories composed and which are the largest categories by product offer and number of reviews. This can help us focus our efforts on markets that have large potential and size, as well as understand the priorities of our competitor.
- What's Nike's pricing strategy for the largest categories and how they are segmenting their products. This way we can have an idea of whether we can better compete by offering more attractive pricing or by making sure we are targeting all the important segments. For example, we might not have enough offer of high priced soccer shoes for the premium segment, which Nike seems to be targeting heavily due to the number of products in the US $300 range.
- We found which categories and gender has lower reviews. That could be a potential opportunity for targeting since there seems to be less satisfaction with what is being offered by Nike. For example, women's golf shoes and boots seem to have a lower average rating.
- There are several dates where we can see a spike of lower reviews. Those could be further investigated to draw lessons for ourselves on what to avoid and tailor our marketing to attract dissatisfied customers.
- It seems the biggest factor impacting rating is comfort, out of the three sliders available. This was reinforced by our NLP analysis of reviews that showed a large prevalence of terms directly related to comfort.
- From reviews we see that a large segment of buyers are parents buying shoes for their sons. This can help us customize our products and communication to target those clients and their needs.
- Finally, we found that Nike markets its running shoes by heavily mentioning aspects of its materials and design that are related to comfort and that a lot of people are using the shoes for walking.
Future of This Project
These lessons could be very beneficial when it comes to increasing our customer satisfaction, sales, and market share in the shoe category. The power of automation could help prepare regular reports on these subjects that could be used in stable time intervals to make sure we are aware of how to best compete with Nike.
Furthermore, this project could be enhanced by getting more macro information on the shoe sector. Delving deeper into each category should uncover more insights on how to improve each one. Finally, by complementing this analysis with our own information, we can cross-validate our findings and pinpoint strengths and weaknesses.
I hope this information has been useful, I'm very passionate about data science/analytics and would love to connect through LinkedIn to discuss this subject, so feel free to reach out and connect.