Round 7 Of The Yelp Dataset Challenge

Posted on Jul 25, 2016

In this blog i am documenting my explorations in the data from "The Yelp Dataset Challenge". In particular i am interested in finding peculiar patterns and some interesting insights the data could lend us.


  1. YELP Dataset Challenge
  2. Data
  3. Data Munging
  4. Visual Exploratory Data Analysis
  5. Conclusions
  6. Future Work

YELP Dataset Challenge

YELP is a crowd-sourced review platform for local businesses where people can rate and write a review about a variety of services. For every year since Since 2013 YELP have been organizing the "Yelp Dataset Challenge", where YELP provide some of the data collected from it's website and also offer prizes for data science projects deemed good. Possible research directions such as  Cultural Trends, Location Mining , Urban Planning etc are listed on the challenge's website.


The data is available on Yelp's website across five different  json files under business , users , checkin, tips and reviews categories. Yelp's description includes a concise look at the data’s contents.

  • 2.2M reviews and 591K tips by 552K users for 77K businesses
  • 566K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 552K users for a total of 3.5M social edges.
  • Aggregated check-ins over time for each of the 77K businesses
  • 200,000 pictures from the included businesses


Data Munging

This maiden experience of working with json data in R was difficult. However, with some research, the data’s loading and transformation was eventually complete. 


Visual Exploratory Data Analysis

I first explored the distribution of business ratings. Most are rated between 3.5-4.5 out of 5. Perhaps these ratings are warranted or there is a leaning towards positive ratings by people. 



A similar pattern occurs when the microscope is turned on the average number of reviews for different ratings. On an average businesses with ratings between 3.5-4.5 stars have more reviews. I suspect that this is due to a positive feedback loop. People will look for highly rated business on YELP to visit and then themselves leave high ratings themselves increasing number of reviews .


Going deeper in the data, a cursory glance at the state level revealed that Arizona has the most businesses listed on YELP.


At the city-level, Las Vegas took the crown for most businesses in Yelp. This is unsurprisingly given the city’s many casinos , hotels and restaurants.


The Las Vegas effect is perhaps responsible for Nevada’s having the highest number of reviews even though Arizona has more Yelp business listingsAgain, I think this is a knock-on effect of  its the service industry .


When it comes to the businesses themselves, the field is dominated by restaurants



The distribution of the various restaurant subcategories is  more uniform.




This is some of my preliminary analysis of this data. There is definitely room for a deeper look at its intricacies which will reveal some of America’s attitudes towards various businesses. Here are some things that I learned and possible directions for future analysis:


  • Most businesses are rated above 3.
  • 3-5 rated businesses have most number of reviews on an average.
  • Arizona and Nevada have the  most number of businesses listed.
  • Arizona and Nevada have the most number of reviews.
  • Restaurants are the most frequently represented business in the data set

Future Work

  • Explore the other three datasets provided - User checkins , tips and reviews
  • Apply  statistical analysis and create more visualizations 
  • Apply predictive modeling when round 8 begins

About Author

Deepak Khurana

Deepak holds a Masters Degree in Physics from the Indian Institute of Technology Kharagpur, one of the top engineering school in India. He was then awarded the Henry M. MacCracken fellowship at New York University to pursue a...
View all posts by Deepak Khurana >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI