Web Scraping Seamless.com: East Village Restaurant Rating Predictors

Avatar
Posted on Nov 24, 2019

Overview

Seamless is a food ordering service that allows users to place delivery or takeout orders online. It operates in New York City, among other major metropolitan areas in the US and Europe. I scraped data on 1,513 restaurants that deliver in my neighborhood in Manhattan, and analyzed how cuisine type, order accuracy, order timeliness, and food quality correlate with overall user restaurant ratings.

Background

During my MBA program, a colleague and I were admitted to a program within the Yale School of Management that allowed students to work on new ventures for course credit. Our venture offered demand-based pricing analysis to small independent restaurants and burgeoning chains. Our hypothesis was that while large chains may have sophisticated pricing procedures, small restaurants price non-optimally based on loose competitor analysis and general intuition, thus leaving money on the table (pun intended). Our hypothesis turned out to be correct, but our service never gained traction as restauranteurs proved highly risk averse when it came to experimenting with price adjustments. We felt that richer data on competitor pricing might have allowed us to overcome this barrier to adopting our service, but had no systematic way to collect this data.

Building on this experience, my goal for this project was to explore how the tools I have been learning at NYC Data Science Academy might have helped our venture (or a similar venture) succeed in its earliest stages.

Starting small, I decided to investigate the following three key questions instead of pricing considerations:

  1. Are user ratings for restaurants that deliver to the East Village in Manhattan via Seamless.com significantly different by cuisine type?
  2. How are order accuracy, timeliness, and food quality correlated with ratings, if at all?
  3. Do the correlations mentioned above differ by cuisine type?

The Scraping Procedure and Data

In order to answer these three questions, I filtered on the Seamless website for all restaurants that deliver to my address in the East Village. This yielded 76 pages of results with 20 restaurants listed per page. I then scraped the Seamless urls for each listed restaurant on each page by identifying the HTML tag containing the address as shown below.

Using python and Selenium, I then built another web scraper to pull the data listed in the table below from each web address. Selenium was chosen as data on the urls is retrieved via Ajax calls, so an active connection was required; a wait time was also built in before running the scraper to allow the data to load. The HTML tag for star rating only reported the range of pixels to be filled in with a yellow color, so this range was converted to the corresponding rating during the scraping procedure.

Question 1: Impact of Cuisine Type

The bar chart above shows the differences in average rating by restaurant cuisine type. While these differences appear minute, an ANOVA test (F = 9.57, p<0.01) determined that at least one of the groups showed a statistically significant difference (Bartlett test for equal variance and visual inspection of ratings distribution confirmed ANOVA assumptions were met). While not all pairwise t-tests were performed, those performed showed that lower Manhattan consumers rated Asian cuisines significantly higher than all other types except for Italian. This may be due to restaurant quality, consumer preference, or demographic mix in the area.

Question 2: Importance of Delivery Quality, Timeliness, and Accuracy

*Data slightly stylized - jitter added along the y-axis

The graphs above show that delivery quality, timeliness, and accuracy all show moderate positive correlation with restaurant rating. However, a separate analysis indicated that these factors have a high degree of multicollinearity among themselves, suggesting that they are not all independent explanatory factors for restaurant rating.

The next step in my analysis here will be to build a multivariate linear or penalized linear regression model to isolate the best predictor(s). This has not been completed as of yet. Given the limited variance and interval nature of the rating variable, a classification model may also be useful.

Question 3: Importance of Delivery Quality, Timeliness, and Accuracy

*Data slightly stylized - jitter added along the y-axis

For the third question, I decided to look just at two cuisine types - Chinese and American. One might hypothesize that for Chinese delivery, customers would care more about accuracy and timeliness, relative to American food. The above graphs only show correlation, but suggest that this hypothesis may be worth investigating further. Accuracy and timeliness are more strongly correlated with rating among Chinese restaurants vs. American restaurants. However, quality is also more highly correlated with ratings for Chinese restaurants, though the gap in correlation coefficients is less. An additional caveat is that the looser correlations among American restaurants may be due to the wide variability in quality, accuracy, and timeliness for poorly related American restaurants.

Again, the next step in this analysis would be to create to mutlivariate linear models, and compare coefficients across the models.

Conclusions and Next Steps

This project demonstrated that cuisine type significantly impacts ratings for restaurants that deliver to the East Village in Manhattan. Order accuracy, quality, and timeliness are also correlated with ratings, and this correlation may differ by cuisine type. As mentioned above, additional modeling is required to tease out the nuances of these relationships. 

In addition, a broader business analysis is required to understand the import of these findings. For example, how does rating relate to demand? Number of review could be used as a proxy for demand in this scenario. How much might it cost a restaurant to improve order accuracy or timeliness, and would this lead to a bump in demand that makes such an investment worthwhile.

Finally, by allotting a longer time to run my scraper, I would be able to collect data on all restaurants in Manhattan and beyond, and could also collect the prices of all menu items offered online. Building this type of pricing database would be highly valuable to restaurant owners, as competitive dynamics are a key input to price determination. I would not be surprised to learn that Seamless already returns similar analytics to its partners as an incentive to list on the site. Regardless, it is clear that the ability to scrape and analyze restaurant data would have been highly valuable to my (failed) restaurant consulting venture.

About Author

Avatar

Aron Berke

Aron is a healthcare and business-oriented data scientist. He holds a Master's Degree in Public Health (MPH) and a Master's Degree in Business Administration (MBA) from Yale University. He has previously worked as a Data Analyst for the...
View all posts by Aron Berke >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp