Why Might Customers Like or Dislike Products? Learning from Amazon Reviews

Posted on Jul 27, 2015

This script constituted my first project for the NYC Data Science Academy bootcamp. I would like to thank Bryan Valentini, Jason Liu, and Vivian Zhang for their help.

Introduction: Too Much Information

When you want to purchase a product from Amazon, one seemingly helpful step you can take toward a more informed decision is to read reviews from previous buyers. Amazon classifies reviews as "positive" or "critical," and allows users to rate reviews so as to be able to highlight those which have been found most useful by the most users. In this way, Amazon reviews can help potential buyers zone in on particular product features that have proved consistently outstanding or unsatisfactory.

Reading reviews as an individual buyer can quickly reach a point of diminishing returns, however. Many Amazon-highlighted reviews cover multiple features of and/or issues with products. Unless you're an expert skimmer with stamina to boot, it's easy to get bogged down wading through others' issues and distracted from noticing what issues you really care about. Different buyers are picky in different ways, which translates at worst to cherry-picked criteria for "positive" or "critical" feedback, and even at best means a relatively fair but potentially too-long review. (Just think how often you've given up reading something you didn't have to simply because it was way too long.)

There's actually an easy solution to the problem of too much information clogging up Amazon products' review sections: write a piece of script that (a) goes through all reviews for a given product and (b) puts out a list of most-mentioned words. Sure enough, using Python I was able to carry out this solution out in a natural language processing script.

The following three examples demonstrate the power of what my relatively simple script can do.


Demo #1: This Little Toaster

Screen Shot 2015-07-25 at 10.52.07 PM
Let's figure out why this little toaster got worse reviews than other, similar toasters.

(1) User Input: What Is the Product in Question?

import pandas as pd
import string 
import nltk
import numpy as np
import datetime as dt
import time
import unicodedata
import re
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
%matplotlib inline

print("Please copy and paste the address of the Amazon page of the product:n")
address= raw_input("Paste: ")
product =re.findall('.com/(.*)/ref', address)[0]
Id= product.split('/')[2]
product = product.split('/')[0]
address="http://www.amazon.com/" + product + '/product-reviews/' + Id + '/ref=cm_cr_pr_btm_link_1?pageNumber=1&sortBy=recent'

This first part of my code produces an input prompt, allowing a user to copy and paste the webpage address for the product in question.

Screen Shot 2015-08-06 at 12.29.06 AM

(2) Scraping the Product Reviews

Now, let's have BeautifulSoup "scrape" over all the reviews of the product. ("BeautifulSoup" is a Python library. Not this.)

text = requests.get(address).text
text = BeautifulSoup(text, 'html.parser')
my_dict[page] = text
#The following finds the paragraph contains the address to the next page
tmp = text.find_all('li', class_="a-last")
my_str= str(tmp[0])
my_str = re.findall('href="(.*)">Next',my_str)[0]#This extracts the address for the next page
address="http://www.amazon.com"+ my_str +a_str 

while True:
    #print page
    #print address
    text = requests.get(address).text
    text = BeautifulSoup(text, 'html.parser')
    my_dict[page]= text
    #The following finds the paragraph contains the address to the net page
    tmp = text.find_all('li', class_="a-last")
        my_str= str(tmp[0])
    except IndexError:
        #print "Oh~~"
        my_str = re.findall('href="(.*)">Next',my_str)[0]
    except IndexError:
        print "finish downloading"
    address="http://www.amazon.com"+ my_str +a_str
    page +=1
    t = np.random.rand()

def get_review_data_from_fix_page(review_page, row):
    index =0
    while True:
            review_df.loc[row]= get_review_data_from_fix_review( review_page[index] )
            #print row
        except IndexError:
    return row

def get_review_data_from_fix_review( Tag ):
    #get the review ID
    review_txt =unicodedata.normalize('NFKD', Tag.prettify()).encode('ascii','ignore')
    review_ID = re.findall('id="(.*)">', review_txt)[0]
    #Reviewer account and profile address
    author_tag = Tag.find_all(class_='a-size-base a-link-normal author')[0]
    author_uni = author_tag.prettify()
    author_txt = unicodedata.normalize('NFKD', author_uni).encode('ascii','ignore')
    author = re.findall(' (.*)n</a>', author_txt)[0]
    author_address = author_tag.get('href')
    author_address_txt = unicodedata.normalize('NFKD', author_address).encode('ascii','ignore')
    author_address = "http://www.amazon.com"+author_address_txt
    #About the review text itself:
    date_tag = Tag.find_all('span', class_="a-size-base a-color-secondary review-date")[0]
    date_uni = date_tag.get_text()
    date_txt = unicodedata.normalize('NFKD', date_uni).encode('ascii','ignore')
    date_str = re.findall('on (.*)', date_txt)[0]
    date_tmp = time.strptime(date_str, "%B %d, %Y")
    date = dt.datetime(date_tmp[0], date_tmp[1], date_tmp[2])
    verify_tag=Tag.find_all('span', class_="a-size-mini a-color-state a-text-bold")
    if len(verify_tag)>0:
        verify_txt=unicodedata.normalize('NFKD', verify_uni).encode('ascii','ignore')
        verify_txt = None
    verify = (verify_txt=='Verified Purchase')
    title_tag= Tag.find_all('a', class_="a-size-base a-link-normal review-title
 a-color-base a-text-bold")[0]
    title_uni= title_tag.get_text()
    title_txt= unicodedata.normalize('NFKD', title_uni).encode('ascii','ignore')
    star_tag = Tag.find_all('i', class_="a-icon-star")[0]
    star_uni = star_tag.get_text()
    star_txt= unicodedata.normalize('NFKD', star_uni).encode('ascii','ignore')
    star = int(star_txt)
    review_tag= Tag.find_all('span', class_="a-size-base review-text")[0]
    review_uni= review_tag.get_text()
    review_txt= unicodedata.normalize('NFKD', review_uni).encode('ascii', 'ignore')
    return [review_ID, author, author_address, date, verify, title_txt, 
            star, review_txt]
review_df = pd.DataFrame(columns=['review_ID', 'author', 'author_adress', 
                                  'date', 'verify', 'title', 'star_given', 
row  =0
for item in my_dict:
    review_page = my_dict[item].find_all('div', class_="a-section review")
    row = get_review_data_from_fix_page(review_page, row)
review_sort = review_df.sort(columns=['star_given'])

chars_to_remove = set(string.punctuation)

def rm_target(raw_text, exclude):
    return ''.join(ch for ch in raw_text if ch not in exclude)

def filter_stop_word(word_list):
    return [word for word in word_list if word not in stopwords.words('english')]

def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

###This part makes a dict of words showing up and their freq 
len_r =len(review_sort)
for i in range(len_r):
    text= review_sort['review_txt'].iloc[i] #get a particular review
    text= text.lower()      # chage into lower case
    text= rm_target(text, chars_to_remove) #remove punctuation
    text= text.split()       #string to list
    text= list(set(text)) #Same words in one review account for 1 freq
    bag_of_word += text

#remove the stop next (this can apply to list, not string)
filtered_word = filter_stop_word( bag_of_word )

#obtain the branch_freq dict
for item in filtered_word:
    if item in branch_freq_dict.keys():
        branch_freq_dict[item] =1

        #Obtain the branch_freq_df, so that we can sort by freq
branch_freq_df = pd.DataFrame(data=0, index= branch_freq_dict.keys(), columns=['freq'])
for item in branch_freq_dict.keys():
    branch_freq_df['freq'].loc[item] = branch_freq_dict[item]
branch_freq_df= branch_freq_df.sort('freq')

#print len(branch_freq_dict.keys())

filtered_stem =map(stem, filtered_word)

###The following function helps to see the branches from each stem
for item in filtered_stem:

for item in set(filtered_word):
    if stem(item) in stem_branch_dict.keys():
branch_num_df=pd.DataFrame(data=0, columns=['num'], index=stem_branch_dict.keys())
for item in stem_branch_dict.keys():
    branch_num_df['num'].loc[item] = len(stem_branch_dict[item])
#Obtain stem_freq_dict
for item in filtered_stem:
    if item in stem_freq_dict.keys():
        stem_freq_dict[item] =1

(3) Uncovering Most-Mentioned Words

Before classifying particular reviews in further detail, let's first inspect what terms get mentioned most among all reviews.

stem_freq_df = pd.DataFrame(data=0, index= stem_freq_dict.keys(), columns=['freq'])
for item in stem_freq_dict.keys():
    stem_freq_df['freq'].loc[item] = stem_freq_dict[item]
stem_freq_df= stem_freq_df.sort('freq')
print('Enter the minimum occruence of words to be investigated:')
minimum= input("Enter: ")
frqly_ocr_df= stem_freq_df.loc[stem_freq_df['freq'] >minimum]
if len(frqly_ocr_df)<=0:
    print "No word is occur that many times."
    plt.rcParams['figure.figsize'] = 12, 6
    frqly_ocr_df.ix[0:].plot(kind='bar', fontsize=13)

Here's a graphical display of the results for that little toaster:

Screen Shot 2015-08-06 at 12.33.07 AM

Note that I discarded the most common English-language terms, or "stop words," and that I identified words by stemming (e.g., conflating the words "price," "priced," and "pricy" into the single term "price").

(4) Classifying Reviews by Rating

The next step is to classify each review for analysis according to the rating each one gave that little toaster. Since reviews on Amazon are likely to be either four or five stars, I grouped one-to-three star reviews together under the category of "negative opinion."

for j in range(5):
    tmp_df = review_df[review_df['star_given']==k]
    len_r =len(tmp_df)
    for i in range(len_r):
        text= tmp_df['review_txt'].iloc[i] #get a particular review
        text= text.lower()      # chage into lower case
        text= rm_target(text, chars_to_remove) #remove punctuation
        text= text.split()       #string to list
        text= list(set(text)) #Same words in one review account for 1 freq
        tmp_bag_of_word += text
    #remove the stop next (this can apply to list, not string)
    tmp_filtered_word = filter_stop_word( tmp_bag_of_word )
    tmp_filtered_word = map(stem, tmp_filtered_word)
    #obtain the branch_freq dict
    for item in tmp_filtered_word:
        if item in tmp_stem_freq_dict.keys():
            tmp_stem_freq_dict[item] =1
    #Obtain the branch_freq_df, so that we can sort by freq
    #tmp_stem_freq_df =pd.Series(tmp_branch_freq_dict, index = tmp_branch_freq_dict.keys())
for i in range(4,6):
    for item in sub_dict[i][1]:
        tmp_dict[item]= sub_dict[i][1][item]*1.0/sub_dict[i][0]
    tmp_df = pd.DataFrame(data=0, index= tmp_dict.keys(), columns=['freq'])
    for item in tmp_dict.keys():
        tmp_df['freq'].loc[item] = tmp_dict[item]
    tmp_df= tmp_df.sort('freq', ascending=False)
tmp_keys= sub_dict[1][1].keys()+sub_dict[2][1].keys()+sub_dict[3][1].keys()
tmp_keys= list(set(tmp_keys))
#tmp_keys= tmp_keys[1:]
tmp_df = pd.DataFrame(data=0, columns=['freq'], index=tmp_keys)
total_len = sub_dict[1][0]+sub_dict[2][0]+sub_dict[3][0]
for item in tmp_keys:
    map(lambda x:sub_dict[1][1][x] if x in sub_dict[1][1].keys() else 0, [item])[0]+
    map(lambda x:sub_dict[2][1][x] if x in sub_dict[2][1].keys() else 0, [item])[0]+
    map(lambda x:sub_dict[3][1][x] if x in sub_dict[3][1].keys() else 0, [item])[0]

tmp_df=tmp_df.sort('freq', ascending=False)

frq_pcntg_dict[1] =tmp_df

(5) Additional User Input: How Many Words to Inspect?

This next part of my code will allow a user to select the number of words to be inspected.

print "Up to top ___ most frequentlly occurred words do you want to investigate?"
minimum= input('Enter a number: ')
limit = min(len(frq_pcntg_dict[1]), len(frq_pcntg_dict[4]), len(frq_pcntg_dict[5]))
if limit>=minimum:
    print '|---------------------------------------|'
    print '|%12s | %10s | %10s |' % ('less than 3', '4 stars', '5 stars')
    print '|---------------------------------------|'
    for k in range(0,minimum):
        print '|%12s | %10s | %10s |' %(frq_pcntg_dict[1].index[k],
    print '|---------------------------------------|'
    print "The number entered is longer than the list of the words."

The result is produced as the following “word bag”:

Screen Shot 2015-08-06 at 12.35.19 AM

(6) Gaining Insight: Contextualizing Words in Key Sentences

If users would like to more deeply investigate some of the listed words, we can employ the following script:

def matching(sentence, stem_set):
    word_bag = sentence.split()
    if len(set(word_bag) & stem_set)==0:
        return sentence

def separate(paragraph):
    sentences = re.split('(?<!w.w.)(?<![A-Z][a-z].)(?<=.|?)s', paragraph)
    return sentences
def addlst(x, y):
    return x+y

print('The level of rated to be investigated:nEnter 5 for 5 starsnEnter 4 for 4 starsnEnter 3 for LESS THAN 3 starsn')
star = input("Enter:")
if 5>= star & star>=4:
    select_df = review_df[ review_df['star_given']==star ]
elif star ==3:
    select_df = review_df[ review_df['star_given']<=star ]
    print "Invalid input, %d rated doen't exist" % star
txt_lst= list(select_df['review_txt'])
sentence_lst = map(separate, txt_lst)
word= raw_input('nThe word you want to investigate? ')
sentence_lst = reduce(addlst, sentence_lst)
a = np.repeat(set(stem_branch_dict[word]),len(sentence_lst) , axis=0)
a= list(a)
result_lst =map(matching, sentence_lst,a)
print "n"
for term in result_lst:
    if term!= None:
        print str(i)+". "+term+'n'

The above code allows users to specify which word stem from which class (five stars, four stars, or three or less stars) to be investigated. Let's look at the word "small" from reviews with three or less stars:

Screen Shot 2015-08-06 at 11.55.11 AM

As we see from the sentences containing "small," it seems the toaster is not big enough to even hold regular bread bought from supermarket! The sentences pictured in red squares above mention this problem explicitly. Readers are further invited to view reviews mentioning this potential product pitfall from higher-rated reviews. Even some higher-rated reviews cite the relatively small slot size. Counterintuitively, some buyers actually appreciate this.

Lets try another example.


Demo #2: This Little Face Cleaning Machine


Here’s a face cleaning machine called “Clarisonic Mia 2 Facial Sonic Cleansing System.” Because I have no idea at all what this product is (other than that my girlfriend is interested in buying it), I need to rely heavily on reviews to even understand the product.

Here’s the “word bag” result after I type in the number of words I want to inspect:


Screen Shot 2015-08-06 at 1.55.00 PM

What messages this list may carry aren’t yet clear. Let’s therefore collect the sentences that include these words and see if they tell us anything.

Screen Shot 2015-08-06 at 2.02.04 PM

That settles it!–from here I simply left and went to the official Clarisonic webpage.

My script can perform more complicated analysis, as well, as my next two examples will show.


Demo #3: This Little Digital Camera


Here’s the word bag for the top 15 words:

Screen Shot 2015-08-06 at 2.41.38 PM

What’s that word “manual” doing in so many “less than 3” reviews? We might suspect that buyers complain about the manual because it’s poorly written...which, as it turns out, would be an understatement:

Screen Shot 2015-08-06 at 2.44.36 PM

The actual problem is that the manual is in Arabic! This is very much the kind of problem you might like to know about before you purchase the “International Version” of the Nikon Coolpix. Since digital camera setup and operation have become relatively standardized, however, the Arabic manual might not be such an insurmountable obstacle for non-Arabic-reading users.

In fact, higher raters did not mention the manual quite so much. Was it just not a serious issue for them? Do higher raters for this project tend also to be able to read Arabic? Or something else? Obviously, further study is needed to know why the manual gets brought up where it does. Still, whether an issue gets mentioned more by higher raters or lower ones is a useful metric to keep tabs on.

What do higher raters care about for this particular camera? We see from the word bag that both four- and five-star reviews bring up the words “price,” “nice,” “little,” and “small” a lot. (Obviously this is a different kind of “small” than in the toaster example!) As before, going over the sentences containing these words allows us to discern why people might have brought them up.

There appear to be two possible explanations behind the trends noted above. First, higher raters seem to be happy about the low price. Second, higher raters seem happy about the nice little equipment that is easy to carry everywhere.

An even more interesting contrast worth pointing to is that lower raters mentioned the name of the brand, Nikon, more often than higher raters. Why is that? Let’s take a look at the sentences with “Nikon” in it:

Screen Shot 2015-08-06 at 3.39.18 PM

I didn’t highlight any sentences here as the message is not straightforward. However, a little reading over the sentences above (not bad compared to wading through every single review to uncover this information), we see that these lower raters tend to have had a good impression of Nikon, and that their expectations for accordingly “good” customer service were not met.

You might say the complainers got what they paid for. The Nikon Coolpix is a high-end brand’s lower-end product, after all. Maybe certain compensations inevitably follow. Or do they? One way to determine the extent of the tradeoff is by getting a sense of the range of quality for analogous products at similar prices.

In this next demonstration, I compare the Nikon Coolpix to a product in the same price range by another name brand, Canon.


Demo #4That Little Digital Camera


From this word bag, we can gather some analogous observations: higher raters mentioned “price” and “little” more while lower raters mentioned “Canon.”

Screen Shot 2015-08-06 at 3.53.18 PM

We can see from the word bag that lower raters talk about “disappoint[ment]” and not getting what you “want.” Could the Canon PowerShot actually be a worse buy than the Nikon Coolpix? Let’s look at some sentences to see whether this is really the case.

Screen Shot 2015-08-06 at 3.57.58 PM

Even more than in the case of Nikon, these lower raters tend to have had a good impression of Canon that got tainted by their experiences with the PowerShot.

Ought this kind of consumer disillusionment to worry companies like Nikon and Canon?

I’m no business expert, but this little app could serve as a helpful tool for professional analysts. This app would allow analysts to be able to connect the dots between new products and brand loyalty, tell a story about how products are capable of jeopardizing faith in overall brand quality, and be able to target particular areas for improvement–actionable insights grounded in very much “real life” data. Replacing those Arabic manuals with English ones seems eminently accomplishable for Nikon, should they care to do so.


Conclusion: Loved or Hated and Why?

I set out to make a user-friendly problem locator. The app would highlight key insight sentences for a particular product, revealing those issues Amazon reviewers were most concerned about. Users would be able to tell what is loved or hated by other users and why, and decide for themselves whether they would love or hate these product features, too.

However, the more I dedicated on this project I’ve become, the more I believe that analyzing distributions of issues within and potentially across rating categories can be an effective way of classifying issues and characterizing products–giving all of us a new perspective on why a product is hated or loved.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp