Data Study on Successful Restaurants on OpenTable

Posted on Mar 3, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

The Objective

The goal of this web scraping project is to identify a restaurant type with an above average success rate based on data from OpenTable.  This project correlates number of reviews with popularity and average rating as a metric for how well a restaurant is liked.  Creating a successful restaurant is difficult, so the standard deviation is a useful statistic when considering how likely a restaurant is to succeed of fail.  Here, we compare the standard deviation of ratings and number of reviews for different restaurant groupings.

How OpenTable Operates

OpenTable is a SAS company that focuses on connecting people with restaurants throughout the world.  OpenTable provides information on the location, cuisine, price range, rating and number of reviews for many restaurants throughout the world.  A key feature of OpenTable's great success is its  online reservation system which includes rewards.  Restaurants pay to be a part of the OpenTable system and are charged a dollar per seat reserved through the website.  This relationship results in OpenTable skewing their review data toward higher ratings.  The lowest rating a restaurant on OpenTable can receive is 3 out of 5 stars.

Data on The Web Scraping and Cleaning Process

The data was collected using Scrapy. Getting the data to analyze required sifting though many paths.  Starting on OpenTables main page I collected all of the URL's associated with each area.  Sixty areas in total were collected.  The collection ranged from 6,580 restaurants in Miami, United States to 68 restaurants in Kyoto, Japan and many in between.  Once each area was collected the URL's  associated with each area needed to be generated.  The URL's included a latitude, longitude and metroID.  This information was collected from the main page of each area and then formatted to redirect by each location. 

Finally, I scraped every restaurant off of OpenTable and collected the area, location, restaurant name, review count, rating, cuisine, cost, link to the restaurant, positioning on the website page,  booking information for that day and weather or not it was currently being promoted.  In total 163,831 restaurants were collected.

After collecting this information the data needed to be cleaned.  All areas, locations, restaurant names and cuisines were converted to lowercase and any white space was removed.  The cost was converted from dollar signs to the ranges associated with those dollar signs.  For example, $$$ was converted to $31 to $50.  The rating was converted to a percent.  This was done by subtracting 3, then dividing by 2 and multiplying by 100, since all restaurant scores fell between 3 and 5 stars.  Finally, the promoted section was converted to a boolean.

The Data Analysis

For the majority of the data analysis the 163,831 restaurants are reduced to 23,727 restaurants, where all restaurants with less than 50 reviews are eliminated.  This would prevent outliers with very few or  no rating at all from heavily impacting the analysis.  However this also scews the information presented to represent a subset of all restaurants that may not be representative of the entire population, an important aspect to keep in concideration.  This issue is illustrated in the pie charts below.  A larger percentage of expensive resaurants are represnted in the final dataset.Data Study on Successful Restaurants on OpenTable

Data Study on Successful Restaurants on OpenTableRating DistributionData Study on Successful Restaurants on OpenTable

Rating Based on Area

This graph illustrates how scewed the OpenTable data is toward higher ratings.  Only 8.32% of all restaurants in the dataset have a rating below 4 starts.  In order to idntify the best location to open a restaurant we are going to use a funneling approach.  Looking only at the four major cities in the United States we compare ratings and popularity.

 

Chicago Restaurant Popularity 

These graphs indicated that Chicago restaurants are more likely to be successful on OpenTable.  They receive more reviews on average and have a heigher average reveiw.  Now we will investigate the top neihgborhoods within Chicago. 

River North and Gold Coast

River North and Gold Coast  are comparable in popularity, lets investigate further by splitting each neighborhood by cost.  The Gold Coast in Chicago seems to perform particularly well in the upper end price range. 

Top Cuisines in Chicago

So what food would be best to sell?  Below we compare the top cuisines represented on OpenTable and compare their performance within Chicago.

Steakhouses

From these graphs it is clear that Steak Houses outperform other cuisine types.  Finally we compare this choice with all restaurants in our dataset and all steakhouses to see if our proposal is accurate.

These graphs support the claim.  While this approach is not exhausitve and does not represent the full investigation that a perspective owner might do before opening a new location it demonstraites an apporach to investigating large data to come to a conclussion.

Conclusion

When using the number of reviews and the ratings of restaurants in the four major cities in the United States as criteria for success it is shown that, if you were considering investing in a restaurant, a likely successful location would be a steakhouse in Chicago.  With a smaller standard deviation in both ratings and review count as well as higher average ratings and reviews, steakhouses in Chicago are less likely to fail than other restaurant types through out the United States.

Future Works on Data

  • Collect Review data to identify words associated with positively reviewed restaurants
  • comparing restaurant ratings with menu items
  • Collect restaurant data over time to identify trends, assuming that restaurants that are removed from the website went out of business
  • Use regular expressions to identify chain restaurants by name and compare their success in varying locations
  • Identifying the impact of location on page on the number of online bookings for OpenTable
  • Identifying the impact of being promoted on the number of online bookings for OpenTable
  • Seeing the relationship between booking and average income of an area

About Me

Michael Emmert

GitHub

LinkedIn

About Author

Michael Emmert

Michael Emmert graduated from The George Washington University in May of 2019 with a Bachelors degree in Mechanical Engineering. Through his Bachelors he gained skills in mathematics, communicating ideas to non-technical groups, data manipulation and trend identification as...
View all posts by Michael Emmert >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI