NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship 🏆 Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release 🎉
Free Lesson
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See 🔥
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular 🔥 Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New 🎉 Generative AI for Finance New 🎉 Generative AI for Marketing New 🎉
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular 🔥 Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular 🔥 Data Science R: Machine Learning Designing and Implementing Production MLOps New 🎉 Natural Language Processing for Production (NLP) New 🎉
Find Inspiration
Get Course Recommendation Must Try 💎 An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release 🎉
Free Lessons
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See 🔥
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > Data Exploration and Analysis of Nike Shoes

Data Exploration and Analysis of Nike Shoes

Daniel Ellenbogen
Posted on Mar 25, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Github Repo | LinkedIn

Nike is an iconic brand of shoes that has arguably become one of the most recognizable in its market. It's perceived by data, worldwide as a very premium and high quality brand, an image Nike carefully cultivates.

For this project I'm working from the perspective of a potential competitor who would like to understand Nike's offering. Some of the things we will look to analyze are its product categories, pricing structure, distribution of its offer, distribution of reviews, how the product is marketed, what their customers are saying, how different measures impact customer satisfaction, among others.

Data Gathering

Data Exploration and Analysis of Nike Shoes
Nike's Website Men's Shoe Section

To acquire the data I used the Selenium package for Python to scrape information directly from the Nike website. The program parsed through the men's and women's shoe sections to gather the following information:

  • Product gender (men or woman), title, URL, category, price, and description.
  • Number of reviews.
  • Average rating for the product based on the reviews.
  • Average score for three measures that Nike allows users to input: size, comfort, and durability. 
  • Individual review's title, score, body, and date.

Data Results

From this I was able to successfully gather data on 734 out of 739 men's products and 518 out of 523 woman's products, which represent 99.2% of all the shoe products on their website. I also gathered 35,519 reviews.

The data needed some cleaning and reorganizing that I will also discuss later, such as grouping redundant or similar categories.  For "size" customers could input a value from 0 to 100 where a value of 0 meant the shoe was too small while 100 meant it was too big. In order to be able to run a correlation between this feature and the others, such as rating, we modified it by making 0 a perfect fit (50 on the original slider) while 50 is a bad fit (either 0 or 100 on the original slider). I choose to do this for the sake of simplicity, although I lost the ability to see if the incorrect fit was due to it being too large or small.

Questions and Data Analysis

Some of the questions we will be asking in this section are:

  1. How are the products and reviews distributed by gender and category?
  2. What's the price distribution of the products for each category and gender?
  3. How are the reviews distributed, and what are the top reviews categories by gender?
  4. How have the reviews evolved over time in general and by the top categories?
  5. Are the numerical variables correlated, and what we can infer from this?.

After this section we will do some Natural Language Processing analysis where we will ask some other questions relating to the text scrapped.

Regarding our first question, we will first take a look at the distribution by products available on the website.

Products by Gender Data 

Data Exploration and Analysis of Nike Shoes
Data Exploration and Analysis of Nike Shoes
 
Data Exploration and Analysis of Nike Shoes
 

Data Findings

We can see that there are slightly more products for men, as they represent 58.6% of all shoes available. 

There were over 125 categories, but many were redundant or similar. Consequently, I had to do some data cleaning to group them by use. On that basis, we are able to determine that the top  4 categories out of  29 categories, represent over 70% of their shoe products. That indicates those categories must be of great importance to their strategic plan.

Our most frequent category, for both men and women, is what Nike refers to "Lifestyle Shoes." Those are shoes that are purchased mostly for their aesthetic qualities and not their use for a particular activity or sport. Basically, they are everyday shoes that are targeted to individuals with different tastes and styles. 

Some other very popular categories include soccer, basketball, and running shoes. That indicates that Nike does value the sports-centered markets and so produces a large number of shoesfor that use.

Percentage of Reviews  by Gender Data

Next we will take a look at the distribution of their categories and gender by number of reviews to see how they match with the number of products offered and also get an idea of what categories could be in high demand.

We can see that we have an even larger proportion of reviews coming from men products when it comes to reviews. Also the distribution changes a bit for men the top four categories are lifestyle (56.0%), running (12.9%), training shoe (8.2%), and sandal (6.1%), which represent over 80% of all reviews. For women the top four are: lifestyle (36.8%), running (21.8%), training (9.2%), and basketball (9.1%), representing over 75% of all reviews.

While we expect a correlation between the number of  reviews and sales. However, there are probably other factors involved like price and the expectations for higher priced products versus for lower priced ones. Other factors that may come into play include the consumer’s age and general attitude. This helps us have a better understanding of which categories are larger and can inform us about potential market size. It seems we can conclude that some of the largest categories for Nike are lifestyle, soccer, basketball, running, training, and sandals since they have such a large concentration of products offered and reviews.

Price Distribution Data

Next we will take a look at the histogram of price distribution for the top categories to get a better understanding of their pricing structure.

From the graphs, we can see that depending on the product there are very different distributions. Some, such as lifestyle and basketball, have a more normal distribution while others have different spikes or a more uniform distribution. For example, for the soccer shoes category we find some lower priced products and then a large spike of premium products at around the US $300 mark. This could indicate catering to a particular market segment that is willing to pay a premium price for specialty shoes.

We can also easily get a glimpse into the median and average price points for these categories. For example, we find that soccer and custom shoes have the highest average prices. Also we find little difference between the average and median prices for shoes for both men and women. The only two categories with some difference are running shoes that are higher priced for men and training shoes that are higher priced for women.

Review Scores

Next we will take a look at the review scores in general and for different categories, as well as, how have they changed over time.

As we might have suspected ,there is a heavy skew towards higher reviews. The share of 4 and 5 star review ratings is 90.1%. There could be several reasons why this might be happening.

One would be customer behavior. For example, due to this being the company's website, it attracts the most loyal fans. Another reason could be that since Nike has full control of the website, they are only allowing the best reviews. Furthermore, since these are only the products available at the moment, we could be seeing the effect of  survivorship bias. What that means is that the worst performing and thus lower rated products are more quickly removed, leaving higher on average rated products up for longer.

Average Rating

In any case, this skew in the data will limit our ability to draw conclusions, especially for the lower rated reviews which is a smaller and less reliable sample.

The previous graph shows the top 15 largest categories and their average rating for both men and women. It seems that Nike customers are particularly satisfied with the categories that also represent their largest product offering. This would make sense from a business perspective, as it would be wise to ensure customers perceive the products in their largest markets highly.

For a competitor, this could also signal opportunities where it would be easier to gain market share. For example, we can see that women are particularly unhappy with the golf shoes and boots available to them. More analysis of those reviews could help pinpoint the issues these customers have, which could help create an effective marketing plan to take market share from Nike in this segment.

Average Rating by Month

The previous graph shows the average rating per month for the categories with most reviews. As we can see, further back in time, some categories are missing. We suspect that is because we are taking a single snapshot of the website today. As older products are retired, their reviews disappear, leaving only newer reviews. 

The retirement of older reviews does skew the data, but we can still derive some interesting insights. First, we see that some categories have products that have been present for a long time, such as training or lifestyle shoes, which could attest to those categories having a longer product lifetime cycle. This information could be valuable as, all else constant, a product that has a longer lifetime is more profitable because of savings in development costs.

Furthermore, we can see some periods of a noticeable decline in the average score, which can point towards particular incidents in a product or category. By doing some further research into the reviews in those periods, we could pinpoint the causes and draw some valuable lessons to avoid similar pitfalls in our brand in the future. In addition, we could prepare marketing materials that focus on attracting customers who were affected by this issue to gain market share.

Correlations Between the Datas

The following graph showcases the correlation between the numerical features.

One interesting realization is that there is a strong correlation between the number of reviews and the score. This could point to the fact that strong feelings tend to motivate more reviews. Customers only bother to write reviews if they really love their shoes or if they are disappointed in them. A practical lesson from this could be to find ways to motivate customers to review your product to achieve higher average ratings.

Another insight is that of the three sliders that Nike offers (comfort, size, and durability), the one that has the most impact in the rating is comfort. This could point to customers placing sizable value on this feature, which could prove valuable when designing and marketing our products.

Next we will do some analysis using natural language processing on the description and reviews texts to attempt to get some insights. This will include a closer inspection into the notion of "comfort" to reveal its context and provide a greater understanding of what it entails.

Natural Language Processing

In this section we will use NLP to get information regarding reviews, descriptions, and we will take a deeper look into the running shoes category. Some of the questions we will attempt to answer are:

  1. What are the best and worst reviews saying?
  2. Are there any common words used  by Nike in the description of running shoes?
  3. What are the best reviews of running shoes saying?
  4. What is the context when "comfort" is mentioned in reviews for running shoes?

In order to do this analysis, we first lemmatize the text body, which is the process of reducing words to their root meaning. This allows grouping of similar words to enable more effective analysis. We also removed stop words and punctuation, which add little value to the analysis and can clutter our graphs. To do this we used the help of the "nltk" package for Python.

Data on Reviews and Description

One thing to note before we start doing this analysis is that there is a large imbalance in the number of positive (4 and 5 starts) vs negative (1 and 2 starts). As mentioned before, this will limit our ability to draw conclusions from the smaller negative reviews dataset. 

When looking at the positive reviews of Bigrams, we see some things we expect such as a positive feeling about their purchase ("loved, "great", etc). This is not that insightful, but we can also gather some other data.

For example, we get a lot of "comfortable" mentions, which reinforces the idea we got from our correlation analysis in the previous section that comfort is highly valued by customers. We also note mentions of "son" which indicates that many of the purchasers are parents. Also, we see several "fit" and "look" statements that highlight the importance of these characteristics. Finally, a lot of people are eager to recommend the product.

When it comes to negative reviews, we see a lot of the same themes registered as points of dissatisfaction. We see words such as, comfort, fit, wide foot, small, bigger, falling apart. Also a lot of people are commenting about how long ago they bought it, probably to add credibility to their reviews by showing they have used it for some time.

Since there is a lot to explore with this type of analysis that would be outside of the scope of this project, I wanted to focus on running shoes for the following section to give an example of what can be achieved. This is one of the largest categories for Nike, one in which it enjoys a particularly high reputation.

Word Identification in Descriptions

First, we can take a look at a word cloud of the short and long descriptions (an expanded text that must be clicked to show) of running shoes. There are a lot of mentions of the material that enhances its comfort, such as "cushion”, "knit material", "soft foam”, ”breathable", etc. We can see the large importance that Nike's marketing team puts on emphasizing aspects related to the comfort and usability that the shoe will provide. We also see some mentions of the "intended use" to guide the client on how they can use their product and their "minimalistic design" to highlight its look. 

From this we can gather how Nike has decided to market this product category, which we can assume is based on their own insights into what resonates with customers and generates sales.

As comfort seems to be an important metric, we will look at a word-cloud graph of reviews that mention this word to get more context into its use.

Data on the Words Used in Reviews

We can see a lot of people are talking about the "fit", "feel", "wear", "true size" and to a lesser degree "look" and "color". This gives us some dimensions to optimize when looking to maximize comfort, such as making sure the fit and feel are good, as well as making sure they are lightweight and true to their size. As a second note, but still of importance, is the aesthetic due to what's being mentioned regarding look and color. Also, a lot of people mention that they use it for walking instead of running, which could help segment this market and tailor an offer to these customers.

Conclusions and Future Work

There are a lot of insights we have gained from this project some of the most important are:

  1. How are the shoes categories composed and which are the largest categories by product offer and number of reviews. This can help us focus our efforts on markets that have large potential and size, as well as understand the priorities of our competitor.

  2. What's Nike's pricing strategy for the largest categories and how they are segmenting their products. This way we can have an idea of whether we can better compete by offering more attractive pricing or by making sure we are targeting all the important segments. For example, we might not have enough offer of high priced soccer shoes for the premium segment, which Nike seems to be targeting heavily due to the number of products in the US $300 range.

  3. We found which categories and gender has lower reviews. That could be a potential opportunity for targeting since there seems to be less satisfaction with what is being offered by Nike. For example, women's golf shoes and boots seem to have a lower average rating.
  4. There are several dates where we can see a spike of lower reviews. Those could be further investigated to draw lessons for ourselves on what to avoid and tailor our marketing to attract dissatisfied customers.

  5. It seems the biggest factor impacting rating is comfort, out of the three sliders available. This was reinforced by our NLP analysis of reviews that showed a large prevalence of terms directly related to comfort.

  6. From reviews we see that a large segment of buyers are parents buying shoes for their sons. This can help us customize our products and communication to target those clients and their needs.

  7. Finally, we found that Nike markets its running shoes by heavily mentioning aspects of its materials and design that are related to comfort and that a lot of people are using the shoes for walking. 

Future of This Project

These lessons could be very beneficial when it comes to increasing our customer satisfaction, sales, and market share in the shoe category. The power of automation could help prepare regular reports on these subjects that could be used in stable time intervals to make sure we are aware of how to best compete with Nike.

Furthermore, this project could be enhanced by getting more macro information on the shoe sector. Delving deeper into each category should uncover more insights on how to improve each one. Finally, by complementing this analysis with our own information, we can cross-validate our findings and pinpoint strengths and weaknesses.

I hope this information has been useful, I'm very passionate about data science/analytics and would love to connect through LinkedIn to discuss this subject, so feel free to reach out and connect.

About Author

Daniel Ellenbogen

Daniel Ellenbogen is an experienced Data Science professional that has worked in finance and co-founded a health and nutrition start-up. He holds a B.A in Economics with a minor in Business from the University of Texas at Austin....
View all posts by Daniel Ellenbogen >

Related Articles

AWS
Automated Data Extraction and Transformation Using Python, OpenAI, and AWS
Python
Can the data from EA's FIFA Potential Rating Help Bettors?
Data Visualization
Using Data to Get Cats Adopted on petfinder.com
Data Visualization
Wine 101: Gathering Data From Vivino
Python
Using Data to Analyze The Library of Audible

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    © 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application