Round 7 Of The Yelp Dataset Challenge

Posted on Jul 25, 2016

In this blog i am documenting my explorations in the data from "The Yelp Dataset Challenge". In particular i am interested in finding peculiar patterns and some interesting insights the data could lend us.


  1. YELP Dataset Challenge
  2. Data
  3. Data Munging
  4. Visual Exploratory Data Analysis
  5. Conclusions
  6. Future Work

YELP Dataset Challenge

YELP is a crowd-sourced review platform for local businesses where people can rate and write a review about a variety of services. For every year since Since 2013 YELP have been organizing the "Yelp Dataset Challenge", where YELP provide some of the data collected from it's website and also offer prizes for data science projects deemed good. Possible research directions such as  Cultural Trends, Location Mining , Urban Planning etc are listed on the challenge's website.


The data is available on Yelp's website across five different  json files under business , users , checkin, tips and reviews categories. Yelp's description includes a concise look at the data’s contents.

  • 2.2M reviews and 591K tips by 552K users for 77K businesses
  • 566K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 552K users for a total of 3.5M social edges.
  • Aggregated check-ins over time for each of the 77K businesses
  • 200,000 pictures from the included businesses


Data Munging

This maiden experience of working with json data in R was difficult. However, with some research, the data’s loading and transformation was eventually complete. 


Visual Exploratory Data Analysis

I first explored the distribution of business ratings. Most are rated between 3.5-4.5 out of 5. Perhaps these ratings are warranted or there is a leaning towards positive ratings by people. 



A similar pattern occurs when the microscope is turned on the average number of reviews for different ratings. On an average businesses with ratings between 3.5-4.5 stars have more reviews. I suspect that this is due to a positive feedback loop. People will look for highly rated business on YELP to visit and then themselves leave high ratings themselves increasing number of reviews .


Going deeper in the data, a cursory glance at the state level revealed that Arizona has the most businesses listed on YELP.


At the city-level, Las Vegas took the crown for most businesses in Yelp. This is unsurprisingly given the city’s many casinos , hotels and restaurants.


The Las Vegas effect is perhaps responsible for Nevada’s having the highest number of reviews even though Arizona has more Yelp business listingsAgain, I think this is a knock-on effect of  its the service industry .


When it comes to the businesses themselves, the field is dominated by restaurants



The distribution of the various restaurant subcategories is  more uniform.




This is some of my preliminary analysis of this data. There is definitely room for a deeper look at its intricacies which will reveal some of America’s attitudes towards various businesses. Here are some things that I learned and possible directions for future analysis:


  • Most businesses are rated above 3.
  • 3-5 rated businesses have most number of reviews on an average.
  • Arizona and Nevada have the  most number of businesses listed.
  • Arizona and Nevada have the most number of reviews.
  • Restaurants are the most frequently represented business in the data set

Future Work

  • Explore the other three datasets provided - User checkins , tips and reviews
  • Apply  statistical analysis and create more visualizations 
  • Apply predictive modeling when round 8 begins

About Author

Deepak Khurana

Deepak holds a Masters Degree in Physics from the Indian Institute of Technology Kharagpur, one of the top engineering school in India. He was then awarded the Henry M. MacCracken fellowship at New York University to pursue a...
View all posts by Deepak Khurana >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp