Data Analysis on Amazon Review Ratings

Posted on Jun 1, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

LinkedIn   |   GitHub   |   Author Bio


Online customer reviews of goods and services have become commonplace with the continued rise of e-commerce and are a potentially valuable source of information for consumers and businesses alike. It is widely recognized that multiple biases are present in consumer reviews, resulting in polarized and imbalanced data. Prior research has focused on the distribution of online reviews as a whole and on the variations from platform to platform.

This analysis focused on the distribution of customer ratings grouped at the product level, and applied the results to develop a tool that helps companies gain insights into how their products compare to others on the site.


About Online Customer Reviews:

Significant research has focused on the investigation and characterization of online customer reviews. Typically the crowdsourcing of preferences and experiences from a large, diverse group of consumers results in data that exhibit normal distributions[1].

However, previous studies have shown that the majority of online reviews are at the positive end of the scale, with few in the mid-range, and some reviews at the negative end of the scale[2]. Most work in the field attempts to explain why this J-shaped distribution occurs, and whether the behavior is consistent across online platforms[3]. It is often described along two dimensions, polarity and imbalance, and attributed to biases introduced through the self-selection of reviewers.

Imbalance of Reviews

Polarity is the observed overrepresentation of positive and negative reviews and underrepresentation of mid-range reviews. This is often ascribed to polarity self-selection bias, where customers generally take the time to review only products with which they are very satisfied or unsatisfied[2].

Imbalance is observed through the large overweighting of positive reviews. A commonly attributed factor is purchase self-selection. This is the idea that customers who decide to purchase a specific product are more likely to be satisfied with it than the general population[4] and that on average consumers are often satisfied with the products they purchase[5]. Another contributing factor is the hesitancy of individuals to transmit negative information over weak ties and to large audiences[6]. review data is a frequent area of research, with the popularity of the website leading to large sample sizes and widely available data for academic research. This project differentiates itself from previous studies through its focus on extracting business insights rather than attempting to characterize and explain the reviews distribution.


The Data:

Given the numerous academic studies conducted on online customer reviews and the ease of accessing this information, there are many readily available data sources.

This analysis was conducted using a dataset of 82 million customer reviews for 9.8 million unique products between 1996-2014. These were gathered by Julian McAuley of UCSD and used in the paper Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering (He & McAuley, 2016). Using the unique product key and star rating from each review, the data was transformed to yield the average ratings and distributions of ratings when grouped by product.


Data Analysis on Amazon Review Ratings

Figure 1. Example of the transformed data used in this analysis.


Product Distribution Data Analysis:

After grouping the reviews by product, the graphical representation of the average product rating distribution revealed neatly organized data. The data took on the appearance of a normal distribution with a fat lower tail, very different than the J-shaped distribution described in prior analysis without product-level grouping.

Data Analysis on Amazon Review Ratings

Figure 2. Distribution of Average Ratings for Products on


The effect on the distribution of average product ratings from changes in the number of products included in the analysis was investigated. Different cutoff levels of minimum reviews per product were tested. This metric makes sense to use as a sorting feature, since the average rating for a product is highly dependent upon the presence of infrequently observed negative product reviews. Prior research indicates that only 5%-15% of reviews for an online marketplace like are negative[3]. Without enough total reviews present, these negative reviews are unlikely to appear in a representative proportion.

Varying the minimum reviews per product cutoff level from 25 reviews up to 300 reviews resulted in a large range of product counts included in the analysis. The number of included products ranged from about 25,000 (for 300 reviews/product minimum) to 500,000 (for 25 reviews/product minimum). Despite this large range, the average product ratings distribution stayed remarkably consistent. This robust relationship between the specific products included in the analysis and the distribution of average product ratings led to the conclusion that the ratings for a set of products depended on the ratings behavior of consumers, not on how frequently-reviewed a product was.

Data Analysis on Amazon Review Ratings

Figure 3. Distribution Comparison for Different Review Cutoff Levels.

Rating By Product Data

In addition to the average ratings distribution being robust to changes in the minimum review cutoff level, the observed proportions of individual ratings levels per product was found to be relatively stable as well. The consistency of these distributions show that the results of this project are applicable to all reviewed products on, as the average ratings distribution is relatively independent of the number of reviews a product has received.

Figure 4. Star Ratings Distributions by Product.


With this information, it was a straightforward process to create a cumulative frequency table for the average ratings of products on This allowed for the placement of a product with a given average rating into a percentile of all products on, a valuable insight for businesses who sell products on the site.


Sample Size Data Analysis:

The next important question to address was how many reviews an individual product needed in order to have a meaningful average rating. Without a large enough number of reviews, the presence (or lack of presence) of just a few negative reviews could greatly change the average rating observed. In order to analyze this, random sampling was conducted on a 'typical' product to observe the sample outcomes from a population with known characteristics.

Random Sampling

For choosing the product to conduct random sampling upon, the dataset of reviews grouped by product was filtered for products that had an average rating close to the median of 4.27 (out of 5). It was then filtered again for products that had a star ratings distribution close to the median proportions seen in Figure 4. From the remaining products, a single one was selected that best fit these criteria and had a large amount of reviews (in this particular case, 14,114 reviews).

Random sampling was then conducted upon the product, with the reviews per trial number varying from 25 reviews to 300 reviews. Five thousand random sets were pulled from the product's population of reviews for each reviews per trial level that was tested. Each distribution of average trial ratings per reviews level tested was analyzed separately for the variance in trial outcomes.

Figure 5. Distributions for Randomly Sampled Averages at Various Reviews per Trial Levels.


Data Analysis

Upon visual examination of the distribution of average product ratings (Figure 2), it was determined that the maximum allowable range for a confidence interval was +-0.10 stars. This was precise enough to provide useful information on where a product's average rating fell relative to other products on

A 90% confidence interval was chosen as this allows for a high level of certainty that the true population mean lies within the range, without posing an excessively high hurdle. Combining these constraints led to the determination of a 300 reviews per product minimum as the cutoff level for products to be included in this projects' analysis. It should be noted that the results generalize to products with lower ratings counts due to the robustness of the average ratings distribution (explored earlier), but that the confidence interval for products with fewer reviews was excessively wide for the purposes of this project.

Figure 6. Comparison of Variance in 300 Reviews per Product Sampled Trials to Overall Product Averages Distribution.


Recap of Analysis:

It was determined the median product on Amazon has an average rating of about 4.27 stars, where 61% of the reviews were 5-star ratings and 5% of the reviews were 1-star ratings. Importantly, a cumulative frequency table was constructed allowing the placement of a product with an average customer rating into a percentile of products on Another key result of the study was the determination of a 300 reviews per product minimum in order to extract meaningful information from the average customer rating.

Figure 7. Cumulative Frequency Table for the Distribution of Average Product Ratings on



Reviews Analysis Tool:

With the analysis completed, it was time to put the results to use. The final product of this study was a tool to help companies extract information from online customer reviews of their products sold on When the user inputs a set of reviews into the program it performs a comparative analysis of how the reviews set stacks up to other product review sets. A message is outputted to the user describing the results of this analysis.

When using the tool, a user is prompted to enter in the number of reviews for the product of interest at each star rating level. The program then takes this information and determines the percentile of the product's rating in relation to all average product ratings, using the cumulative frequency table generated earlier in the project (Figure 7).

Usage of the Tool

In addition, the random sampling process previously described is implemented using a reviews sample size equivalent to the number of reviews entered by the user. This provides an accurate confidence interval for the percentile estimate generated from the program. The tool was developed using Python, and runs right from the command line interface. Concise, aggregated datasets were generated to aid in the speedy functioning of the tool.

Figure 8. Screenshot of the Amazon Product Review Tool Running in the Command Line Interface.


Significant consideration was given to what output the user should receive to both provide useful information and to highlight the limitations of the result. It can be difficult to provide sufficient context for the appropriate use of a program's results to users who do not understand the underlying model, and the development of this tool was no different. Special attention was paid to prevent user overconfidence in the precision of the result.

In addition to providing the percentile rank of a product with an inputted review distribution, a message was outputted providing the 90% confidence interval of the result as well as steps that could be taken to tighten up the confidence interval in the future. When the program is run on a set of reviews smaller than the 300 reviews per product cutoff level established during this project, a result is still outputted to the user but includes a message relaying that the minimum recommended reviews size for optimal results is 300 reviews.



Future Extensibility:

This tool allows companies to extract meaningful insights from the online customer reviews of their products on despite the biased and skewed nature of online reviews. There are multiple interesting areas of further study for this tool, the first of which is to extend its usefulness beyond Reviews distributions for other product marketplaces, or even services websites such as, could be analyzed using the methods from this project and the functionality of the reviews tool could be extended to those domains.

In addition, using newer, post-COVID data could help investigate whether customer reviewing behavior has changed with the wider demographic and increased prevalence of online shopping during the pandemic. A third and potentially most interesting area of expansion for this tool would be the application of natural language processing techniques to extract frequently cited praise or criticism in positive or negative reviews. This would transform the tool from being a descriptive analysis of a company's product into a prescriptive suggestion of methods for product improvement.




1. Hu, Nan, Jie Zhang, and Paul A. Pavlou (2009), “Overcoming the J-shaped distribution of online reviews,” Communications of the ACM, 52 (10), 144-147.
2. Hu, Nan, Paul A. Pavlou, and Jie Zhang (2017), “On self-selection biases in online product reviews” MIS Quarterly, 41 (2), 449-471.
3. Schoenmuller, Verena, Oded Netzer, and Florian Stahl, “The extreme distribution of online reviews: Prevalence, drivers and implications,” Columbia Business School Research Paper, 2019, (18-10).
4. Dalvi, Nilesh N., Ravi Kumar, and Bo Pang (2013), “Para 'Normal' Activity: On the distribution of average ratings,” ICWSM.
5. Anderson, Eugene W., and Claes Fornell (2000), “Foundations of the American Customer Satisfaction Index,” Total Quality Management, 11 (7), 869-882.
6. Zhang, Yinlong, Lawrence Feick, and Vikas Mittal (2014), “How males and females differ in their likelihood of transmitting negative word of mouth,” Journal of Consumer Research, 40 (6), 1097-1108.

About Author

Ryan Burakowski

Ryan Burakowski is a current NYC Data Science Academy fellow with experience in capital markets and a passion for working on difficult problems. He spent the last three years as a proprietary trader, traveling the world and living...
View all posts by Ryan Burakowski >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI