Recommendation System and Spam Review Analysis

Posted on Feb 18, 2017

Contributed by Xu Gao. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 30th, 2017. This post is based on his web scraping projects. - See the IPython notebook at:


I. Introduction

This post introduces the recommendation system and spam review analysis based on the reviews from Google App Store Top Charting apps.  Google App Store is the biggest Android app store, which not only has numerous applications, but also movies and songs. In this project, I scraped the apps listed in the Top Charting page. ( Scrapy is the key tool in this project and I will introduce the detail in Part II.

The recommendation system is based on algorithm "doc2vec" developed by Google geeks in 2013, which helps transfer the text to numeric vector. Then I use cosine similarity to find the matched reviews and apps.

The spam review is based on "Sentiment Analysis and Opinion" from Bing Liu, published in May 2012, in which he specifies the different types of spam reviews. In this project, due to some limitation, I select spam reviews manually and perform an analysis on these results.


II. Web Scraping on Google App Store


In this web scraping project, the item structure is listed in the following table:

Top Charting Category Company App Name
Username Date Rating Review

where Top Charting is categorical variable which includes Top Free in Android Apps, Top Paid in Android Apps, Top Grossing Android Apps, Top Free Games, Top Paid Games, Top Grossing Games

And in the source page, there are 3 layers for scraping. I first use the basic parse function to parse the source page. Then through the "See More" button to get the next page which locates the right upper corner in the screenshots. Then through the picture of each apps I finally get the target pages to get the reviews of users.

google app subpage1 subpage2

Source Code: Scrapy

Finally I got 8972 reviews from these apps. To show which words have more importance, I plot 3 WordCloud from 3 Top-Charting categories: Top Free in Android Apps, Top Paid in Android, Top Grossing Games.


Top Free Apps in Android

Top Free Apps in Android

Top Paid Apps in Android


Top Grossing Games


From these 3 WordCloud graphs, we can find that "game", "money", "work" and other words are important words. If I was an application developer, I might find the direction of my next app on these words.


III. Recommendation System

Natural Language Processing is one of the most important topics in Machine Learning problem. There are several choices about algorithms to detect words such as term frequency-inverse document frequency(TF-IDF), LDA and LSI. What I used in this project is "doc2vec" algorithm which was developed by Google geeks in 2013 and free in "genism" library in Python.

This algorithm can transfer text to numeric vector. Then I use cosine similarity to calculate the similarity between the input vector and every vector in the dataset to find the most similar reviews.

Cosine Sim

First we can have a test. Try to find top 3 single words in all reviews which is similar to "game" and "great" but different from "bad".


The result is "app", "job" and "love" with their similarity. Looks not bad! Word "app" is quite similar to "game" in this dataset and "love" means positive attitude which is opposite to "bad".

Then I write a input "great puzzle game!" After calculating the best similarity, the algorithm generates the following review.

´(Review-6784, Similarity:0.8765): The game is very fun, but the amount of little bugs in the game bothers me; tricks wont register or multiplier decides to not go up. I love the game but just wish it would run smooth, especially after its been out this long

App name is: Block! Hexa Puzzle , and the similarity is equal to: 0.7963.

which is quite good. "Block! Hexa Puzzle" is obviously a puzzle game. And in this review the user shows a positive attitude. This means this game is at least not bad.

But before I continue, I notice the similarity is only 0.7963. This makes me doubtful on this problem. Although the first example is cool, what about other inputs? So I try another input "Great RPG Game". This time I get totally different result.

´(Review-5845, Similarity:0.9248): New version loses all stats. -- New version is low quality pixelated graphics. -- New version loses all paid purchases. -- New version is slow & laggy. -- No support, thieves. -- GARBAGE -- Advertisements in a paid application. Go rot in a pit. -- Seriously, you got paid, I don't care if it is $0.01 or $1000.00. -- You got paid, therefore advertisements should be removed. -- I'll take lessons taught here & keep them in mind to decide if I should purchase future titles from publisher & developers involved.

´App name is: My Talking Angela , and the similarity is equal to: 0.7891,

"My Talking Angela" is a virtual pet game like "My Talking Tom". It is not a typical role-playing game at all. So there must be some shortcomings or mistakes on this recommendation system.

In my opinion, there are three possible reasons leading to this bad result. The first one is the small dataset. 8972 is not a big number for a review dataset. When people usually do the research about reviews, 30,000 might be the minimum because of the complexity of natural language. The second reason is spam review. The recommendation system is based on the assumption that all the reviews in the dataset are real and effective. However sometimes people might post biased reviews or fake reviews due to some private reasons. This will make the review dataset unreliable.  The last reason is about the algorithm shortcoming. Generally speaking, natural language processing cannot be 100% correct. Since it is based on the training set and labels which are set manually, the accuracy of this algorithm is quite uncertain.


IV. Spam Review Analysis

According to the Book "Sentiment Analysis and Opinion" from Bing Liu in the University of Illinois at Chicago, the spam reviews can be divided into two types: real spam reviews(I called trash reviews) and fake reviews. The trash reviews mean that due to some reasons, comments in the reviews are biased from the real attitude of users, or comments do not show any emotion or evaluation which makes those comments spam. The fake reviews are just like their name. They are created by someone due to profit or other purposes.

The best way to detect fake reviews is from the user basis. The duplicates will show this review is highly possibly made.(P.S. In my dataset, I remove the absolutely same reviews) But the definition of trash reviews is still not clear.  So I use NLP to do the sentiment analysis and compare with the rating, and try to do a logistic regression on several factors related reviews on ratings.


From this boxplot, we can find that sentiment and rating are correlated in mean. But the data out of the 25% to 75% quantile is in big quantity.

The Logistic Regression Detail is listed below:

´Rating > 3: Positive

´Rating <=3: Negative

´Factors on this regression:

´1. Length of reviews; 2. date index; 3. Number mentioned brand/company name; 4. Compound Sentiment; 5. Negative Sentiment; 6. Neutral Sentiment; 7. Positive Sentiment


´Accuracy On Training Set: 69.9%

´Accuracy On Test Set: 41.8%

From these two facts, we can find that the rating and sentiment are not truly correlated in this dataset although they are supposed to be. As a fact, I decide to define the trash review as the review when absolute value of difference between rating and adjusted sentiment(1-5) is larger than 3.

abs(Adjusted Sentiment-Rating)≥3

Piechart spam

Based on the review's app category, I make a pie chart to show what kind of apps have high spam reviews. In this graph, we can clearly see that Casino apps like card games or other betting apps have the most trash reviews. Then Action and Strategy apps are next.

Another topic is about the fake reviews. Based on the user side, we can divide duplicate fake reviews as 4 types:

´Type 1: Duplicates from different users on the same apps´

´Type 2: Duplicates from the same user on different apps´

´Type 3: Duplicates from the same user on the same apps´

´Type 4: Duplicates from different users on different apps´

In this dataset, I try to find the duplicates and find the percentage of duplicates 19.77% approximately.

duplicatesclick to see

PieChart Duplicates

This is the pie chart on category of duplicates reviews. We can see that Casino, Action, Strategy, Casual are still Top 4. But in this time, Puzzle game gets more duplicate fake reviews.


V. Summary

In this project, I use Scrapy to scrap the reviews from Google App Store, make a recommendation system and do a manual analysis about spam reviews. There are several result we can learn from this project. For all top apps people downloads, "game" is the top one word. Second,"doc2vec" is a very useful algorithm to treat text to numeric vectors. Third one is how we can define and detect trash reviews manually. And the conclusion from all the techniques above is "Casino Apps are not reliable."


What's More:

  1. Bigger Datasets
  2. The way to find duplicates in the dataset is not a general way because users may change some words. Is cosine similarity a good method to improve this problem?
  3. User-Interface Recommendation System
  4. If we have a big dataset, it is possible to build a spam review classifier based on the NLP.



Sentiment Analysis and Opinion Mining,draft, Bing Liu, Morgan & Claypool Publishers, May 2012

30 Ways You Can Spot Fake Online Reviews, By Ben Popken April 14, 2010

About Author



Xu is a Master of Financial Engineering student in New York University. He received Bachelor of Economics in University of International Business and Economics. Xu has a good experience about machine learning and pair trading system. Besides, he...
View all posts by Xu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp