Web Scraping OpenTable

Avatar
Posted on Mar 3, 2020

The Objective

The goal of this web scraping project is to identify a restaurant type with an above average success rate based on data from OpenTable.  This project correlates number of reviews with popularity and average rating as a metric for how well a restaurant is liked.  Creating a successful restaurant is difficult, so the standard deviation is a useful statistic when considering how likely a restaurant is to succeed of fail.  Here, we compare the standard deviation of ratings and number of reviews for different restaurant groupings.

How OpenTable Operates

OpenTable is a SAS company that focuses on connecting people with restaurants throughout the world.  OpenTable provides information on the location, cuisine, price range, rating and number of reviews for many restaurants throughout the world.  A key feature of OpenTable's great success is its  online reservation system which includes rewards.  Restaurants pay to be a part of the OpenTable system and are charged a dollar per seat reserved through the website.  This relationship results in OpenTable skewing their review data toward higher ratings.  The lowest rating a restaurant on OpenTable can receive is 3 out of 5 stars.

The Web Scraping  and Cleaning Process

The data was collected using Scrapy. Getting the data to analyze required sifting though many paths.  Starting on OpenTables main page I collected all of the URL's associated with each area.  Sixty areas in total were collected.  The collection ranged from 6,580 restaurants in Miami, United States to 68 restaurants in Kyoto, Japan and many in between.  Once each area was collected the URL's  associated with each area needed to be generated.  The URL's included a latitude, longitude and metroID.  This information was collected from the main page of each area and then formatted to redirect by each location.  Finally, I scraped every restaurant off of OpenTable and collected the area, location, restaurant name, review count, rating, cuisine, cost, link to the restaurant, positioning on the website page,  booking information for that day and weather or not it was currently being promoted.  In total 163,831 restaurants were collected.

After collecting this information the data needed to be cleaned.  All areas, locations, restaurant names and cuisines were converted to lowercase and any white space was removed.  The cost was converted from dollar signs to the ranges associated with those dollar signs.  For example, $$$ was converted to $31 to $50.  The rating was converted to a percent.  This was done by subtracting 3, then dividing by 2 and multiplying by 100, since all restaurant scores fell between 3 and 5 stars.  Finally, the promoted section was converted to a boolean.

The Data Analysis

For the majority of the data analysis the 163,831 restaurants are reduced to 23,727 restaurants, where all restaurants with less than 50 reviews are eliminated.  This would prevent outliers with very few or  no rating at all from heavily impacting the analysis.  However this also scews the information presented to represent a subset of all restaurants that may not be representative of the entire population, an important aspect to keep in concideration.  This issue is illustrated in the pie charts below.  A larger percentage of expensive resaurants are represnted in the final dataset.

Conitnuing to explore the data we plot a histogram of the restaurant ratings. 

This graph illustrates how scewed the OpenTable data is toward higher ratings.  Only 8.32% of all restaurants in the dataset have a rating below 4 starts.  In order to idntify the best location to open a restaurant we are going to use a funneling approach.  Looking only at the four major cities in the United States we compare ratings and popularity.

 

These graphs indicated that Chicago restaurants are more likely to be successful on OpenTable.  They receive more reviews on average and have a heigher average reveiw.  Now we will investigate the top neihgborhoods within Chicago. 

River North and Gold Coast  are comparable in popularity, lets investigate further by splitting each neighborhood by cost.  The Gold Coast in Chicago seems to perform particularly well in the upper end price range. 

So what food would be best to sell?  Below we compare the top cuisines represented on OpenTable and compare their performance within Chicago.

From these graphs it is clear that Steak Houses outperform other cuisine types.  Finally we compare this choice with all restaurants in our dataset and all steakhouses to see if our proposal is accurate.

These graphs support the claim.  While this approach is not exhausitve and does not represent the full investigation that a perspective owner might do before opening a new location it demonstraites an apporach to investigating large data to come to a conclussion.

Conclusion

When using the number of reviews and the ratings of restaurants in the four major cities in the United States as criteria for success it is shown that, if you were considering investing in a restaurant, a likely successful location would be a steakhouse in Chicago.  With a smaller standard deviation in both ratings and review count as well as higher average ratings and reviews, steakhouses in Chicago are less likely to fail than other restaurant types through out the United States.

Future Works

  • Collect Review data to identify words associated with positively reviewed restaurants
  • comparing restaurant ratings with menu items
  • Collect restaurant data over time to identify trends, assuming that restaurants that are removed from the website went out of business
  • Use regular expressions to identify chain restaurants by name and compare their success in varying locations
  • Identifying the impact of location on page on the number of online bookings for OpenTable
  • Identifying the impact of being promoted on the number of online bookings for OpenTable
  • Seeing the relationship between booking and average income of an area

About Me

Michael Emmert

GitHub

LinkedIn

About Author

Avatar

Michael Emmert

Michael Emmert graduated from The George Washington University in May of 2019 with a Bachelors degree in Mechanical Engineering. Through his Bachelors he gained skills in mathematics, communicating ideas to non-technical groups, data manipulation and trend identification as...
View all posts by Michael Emmert >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp