Buyer Beware: Amazon’s Questionable Product Ratings
Amazon dominates e-commerce in the US with close to 40% market share. Various surveys suggest anywhere from 50% to 70% of online product searches in the US start on Amazon. One of the main reasons so many shoppers look to the marketplace is its abundance of product reviews. Amazon tends to have as many reviews as all other major US e-commerce marketplaces combined. It seems to provide the best of both worlds: nearly every product imaginable and enough reviews to help consumers decide which products to buy.
Unfortunately, analysis of a large dataset of Amazon reviews suggests they are polarized, biased and not very good at differentiating better from worse products.
Despite widespread suspicion of the accuracy of online reviews, most consumers still trust Amazon product ratings enough to rely on them for purchasing decisions. Skepticism has grown with repeated fake review scandals. And anyone who cares to read reviews can see the predominance of cursory or irrelevant feedback. But consumers clearly still believe aggregate product ratings have some value, perhaps leaning on the law of large numbers that would suggest that scores get more accurate as the number of reviews increases.
This analysis sheds light on the true value of Amazon reviews, which is relevant to anyone who shops online, retailers who sell there and policymakers concerned with consumer protection.
Setting the Stage
The analysis relies on multiple datasets, including two large sets of scraped Amazon reviews from Kaggle, our own review scraping done via APIs like Rainforest and comparative datasets of reviews scraped from sites like Google Shopping and Consumer Reports.
The product reviews literature tends to divide products into those bought for their usefulness and those bought to fulfill an aesthetic or personal taste. We call the former utility products and the latter lifestyle products. This analysis focuses solely on utility products - a common practice in online reviews research - because they are thought to be more objective. Utility products featured in this analysis include: outdoors products, furniture, automotive products, appliances, office products, tools and camera equipment.
We also try to improve accuracy by limiting our dataset to products with over 100 reviews. Like most averages, aggregate product ratings tend to fluctuate when the number of reviews are too low. (We’ll get into how the aggregate scores are calculated later, but for now think of them as arithmetic averages of each individual rating for a product.) These averages tend to stabilize around 100 reviews.
Reviews are Polarized and Biased
Online reviews tend to have a J-shaped distribution with the most reviews at the high end (5-stars), the second most at the low end (1-star) and the remaining few in the middle (2-4 stars). The distribution doesn’t always take this exact form, but the dominance of positive ratings tends to be ubiquitous. (See, for example, Online Reviews Are Biased. Here’s How to Fix Them by Nadav Klein, Ioana Marinescu, Andrew Chamberlain, and Morgan Smart. Harvard Business Review, March 2018.) Here’s what the distribution looks like across millions of Amazon reviews.
This right or positive skew tends to be the result of four types of reviewer biases:
- Purchasing Bias: buying a product usually means you already like it.
- Reporting Bias: only a small share of customers with extreme opinions tend to leave reviews.
- Review Inflation: just like the B grade at universities, 5-stars often just means a product is “fine” or acceptable.
- Fake Reviews: sellers pay for 5-stars for their products and 1-star reviews for their competitors’.
(For purchasing bias and more detail, see “Why Do Online Reviews Have a J-shaped Distribution…” Hu, Pavlou & Zhang 2009 for details and examples.)
Reporting bias is evident in our dataset. Over 70% of reviews came from reviewers who only left one review in the lifetime of the dataset. Over 95% of reviews came from infrequent reviewers who left fewer than 7 reviews.
Do you go straight for the 1-star reviews when considering a product on Amazon? Many do, because they understand the tendency toward review inflation. As you can see below, customers tend to find 1-star reviews most helpful.
Amazon Doesn’t Fix the Biases
On the company page detailing how Amazon “keeps reviews trustworthy and useful”, Amazon says that it “uses machine-learned models, instead of a simple average” that “consider factors such as how recent the rating or review is and verified purchase status” to “establish the authenticity of feedback.” As of 2019, however, when our dataset of cell phone reviews was created, that does not seem to be the case. The chart below shows the distribution of Amazon’s machine-learned star-ratings across thousands of products in the mobile phone category in red. In blue, we created another distribution of simple arithmetic average product ratings. As you can see, nearly all of the chart is magenta because the two distributions overlap. In other words, Amazon’s machine-learned product ratings are almost exactly the same as the simple average ratings - at least as of 2019.
Fixing review biases may not be possible with current technology. But Amazon could easily do more to make it easier for customers to use product ratings to distinguish better from worse products. One simple method would be applying a curve, just like the curves used in school tests. The chart that follows shows the actual distribution of product ratings from the first page of search results for 1,000 different search terms scraped from Amazon late last year. Those are the blue lines showing product ratings jhdensely clustered between scores of 3.9 and 4.7. This makes it difficult to distinguish one product from another. Would you, for example, choose a product with a 4.7 rating and 5,000 reviews or a product with a 4.5 rating and 10,000 reviews?
The red bars on the chart represent a curved distribution of the same ratings. This is a standard normal or Gaussian distribution which, in simple terms, spreads scores out around the mean (in this case 3-stars) while maintaining the rank. In this case, the 4.7 might become a 4.0 and the 4.5 would become something like a 3.2. Does this make product ratings more or less accurate? No. But it does make it easier for consumers to see differences. The point here is simply that there are many ways to adjust product ratings to help counteract some of the biases.
While modeling to de-bias product ratings was beyond the scope of this analysis, we did find a few types of reviews that are relatively less biased. Studies show that reviewers who are given a product to review (rather than purchasing it) tend to have a less biased and more gaussian distribution of review scores. Amazon’s Vine program appoints reviewers with a good track record of objectivity and relevance to review new products. Vine reviews clearly have a more balanced distribution than the J-shape above. The same is true for long reviews over 2000 characters, which customers also tend to find more helpful. So Amazon could significantly overweight Vine reviews and long reviews in their average product ratings to remove some degree of bias and polarization.
Why doesn’t Amazon adjust ratings more significantly to make them less biased and more useful to customers? Amazon has not made any statements about this, so we can only speculate, though it seems clear that adjusting ratings is not in the company’s interest. First of all, any reasonable adjustment would lower average ratings and lower ratings tend to mean fewer purchases. In other words, adjusting reviews would lower Amazon’s revenue. Second, even though Amazon is just the platform and not the seller, customers tend to associate product quality with the Amazon brand, so lowering review scores would hurt Amazon’s reputation. Third, if Amazon adjusted down review scores, competitor platforms that didn’t adjust reviews would take market share and sellers. In short, adjusting down inflated review scores is bad business.
Can We Trust Amazon Reviews?
The short answer is probably not. But this is a difficult question to answer with certainty.
A more specific question is if Amazon reviews reliably and accurately differentiate better products from worse products. Obviously, there is no available “correct” average rating for each product. But there are more reliable and objective sources of product reviews. There is also the question of consistency: if reviews are inconsistent across a large set of reviewers, over time or from one e-commerce platform to the next, that would suggest they are not sensible or trustworthy.
The degree of consistency among reviewers is easy enough to tease out of the data. Products where reviewers tend to disagree on product quality would have a much wider distribution of individual ratings and therefore a higher standard deviation (red in the chart below). Would you feel confident in an average rating if it was the result of a significant disparity of opinions from reviewers? What about if all reviewers generally agreed on the rating score?
Histograms of the distribution of ratings that comprise the average product rating for low-agreement vs high-agreement products show a clear trend. For low-agreement products, reviewers tend to love it or hate it. For high-agreement products reviewers all seem to love it.
We are not dealing with absolutes here. There is no single threshold above which reviewer disagreement becomes significant enough to deem a product’s average rating to be untrustworthy. What we can say is that a significant number of products in our dataset have this love/hate distribution which makes their average ratings less trustworthy. This degree of trustworthiness could be easily integrated into the overall product review score. Amazon could just lower a product’s review score in proportion to how much reviewers disagreed about the product’s quality.
Consistency can also be viewed over time. To do that I created sequential cohorts of 100 reviews for each product. Imagine if the first 100 reviewers loved a product, then the second 100 hated it, the third 100 loved it and so on. Would you trust the average rating for that product or be highly suspicious?
(Note: I repeated the analysis with cohorts of 200 and 300 and got the same results, so this was not an issue of sample size.)
The hypothetical above describes a highly volatile rating over time. I calculated the average volatility, the average of the change in average rating from one cohort to the next, for each product in the dataset and found that the average rating tends to change about 0.2 stars from one cohort to the next. This may seem small, but remember that nearly all Amazon product ratings fit into a slim range of roughly a 3.8 - 4.8 average star rating, so a 0.2-star movement in that rating is the difference between a product coming up in the top spot in search results and it coming up halfway down the page where shoppers rarely look. Again, this suggests that review scores are not consistent. Amazon could use this degree of inconsistency to adjust aggregate product ratings (lower them).
Amazon Versus Google
If ratings accurately reflect differences in product quality, then we would expect ratings to be consistent, or at least highly correlated from one e-commerce platform to the next. To test that I looked at reviews for the same products on both Amazon and Google Shopping. Google’s shopping search engine aggregates reviews from most major e-commerce marketplaces like Walmart, Target and eBay, but does not include reviews from Amazon. So, essentially, this compares Amazon ratings for a given product to ratings on most other major e-commerce sites.
The chart below shows roughly 200 different products denoted by number along the x-axis. Ratings for each product are logged as blue circles for Amazon and orange Xs for Google Shopping. The data was sorted by Amazon rating from highest to lowest to facilitate visual comparison. As you can see, Google Shopping ratings follow Amazon ratings in the general downward trend over all products, but there is significant variation in ratings from one product to the next.
The Pearson correlation coefficient between Amazon and Google Shopping ratings is 0.42 or what statisticians would call a “moderate” correlation. In other words, we cannot say with confidence that Google Shopping ratings do not tend to follow the pattern of Amazon ratings. But we also cannot conclude that Google Shopping ratings are consistent with ratings for the same products on Amazon.
Amazon Ratings Versus Consumer Reports
Finally, I scraped product ratings from Consumer Reports to compare with ratings for the same products on Amazon. Consumer Reports is a non-profit organization dedicated to consumer safety. They rate products based only on objective measurements so we can consider this comparison a gauge of the objectivity of Amazon product ratings, if nothing else.
In this case there is no need to get into correlation coefficients. Amazon ratings are clearly not consistent with the more objective Consumer Reports ratings. In other words, if you’re interested in objective product quality and trust Consumer Reports, then you probably want to ignore Amazon ratings.
To sum up, Amazon product ratings have a number of problems, largely due to common reviewer biases. Amazon says it works to counteract these biases by using machine-learning to make product ratings more accurate and useful, but there is no evidence of that in the data. Furthermore, it’s clear that Amazon could counteract these biases by curving review scores or overweighting more balanced review types like Vine reviews or long reviews. But it is not in the company’s interest to adjust product ratings because any reasonable adjustment would lower average ratings and hurt revenue and the Amazon brand.
In light of the impact of bias on reviews, I questioned if Amazon reviews could be relied on to distinguish better from worse products. I found that Amazon reviewers often disagree significantly about the quality of a product, which obviously does not inspire a lot of trust in the average product score. Reviewers also seem to disagree over time. So like any average, the average product rating often obscures significant volatility in customers’ opinion of a product over time, which also does not inspire trust.
Finally, I compared Amazon ratings to ratings on Google Shopping and Consumer reports for the same products. Amazon’s ratings clearly did not correlate with Consumer Reports, a company known for its objective and trustworthy product ratings. The correlation between Amazon and Google Shopping ratings was better than that of Amazon and Consumer Reports, but still only “moderate”. Perhaps consumers who buy on Amazon are fundamentally different from consumers who buy on the other major US e-commerce sites whose reviews Google aggregates. Or perhaps consumer e-commerce reviews in general are not to be trusted.
Ultimately, this analysis shows that consumers should be wary of online product ratings - not just individual reviews but also average review scores per product. The wise consumer should stick to more objective, expert product reviews from sites like Consumer Reports and Wirecutter.