Exploring The Bubble Tea Industry By Scraping Yelp.com

Posted on Aug 5, 2023

Bubble tea originated  in Taiwan and has become increasingly popular around the globe since the early 2000s. The market continues its rate of rapid growth to this day.. To gain insights into this market in the United states, this project scrapped Yelp.com's related business establishments in the top 10 populous cities in the country.  

Please see my Github repo for details on the code details.

1. Introduction

The objective of this project is to explore how the bubble tea industry is growing in the major US cities throughout the years. This project aims to answer the following questions:

  1. Is the industry actually growing?
  2. Who are the biggest players? 
  3. Where are the locations worth investing in?

2. Data

Two data sets were used:  

  • Bubble tea business: data is obtained from scraping Yelp.com
  • The top populous city: obtained by referencing the metropolitan statistical area (MSA) data provided by the US Census Bureau   

All qualified bubble tea search results on Yelp with Scrapy were obtained. During the scraping process, the criteria that used to define a valid bubble tea business are as follows: 

  • Has the label “Bubble Tea”
  • At least 1 review

Top 10 cities by population according to the US Census Bureau are: New York, Los Angeles, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San Diego, Dallas, and San Jose.

After the scraped data is cleaned, about 1200 bubble tea stores are identified within the selected cities. 

As a side note, there were some issues during the scrapping process. CAPTCHA proved to be an issue when too much data is scrapped in a shorter amount of time. Also, with the same crawler code, missing fields were present. To bypass CAPTCHA, the code is adjusted to produce fewerless results in each crawl to prevent detection and blockage. We conducted crawls Reruns on the same pages more than once were conducted to ensure the completeness of the data obtained.

3. Data Exploration

Growing Trend - Overall

As seen in the graph above, the number of businesses has been growing until 2020 for 15 years until COVID - 19. 

Growing Trend - by City

Digging into the city level, a similar growth trend is observed.

​​City Statistics Comparison

To determine if certain cities are doing better than others, the ratings and number of reviews by city are shown. Overall, the ratings are quite consistent (high 3s  to low 4s). However, the fact that  San Jose and Los Angeles have significantly more reviews per business location than other cities indicates they may get more visitors there. On the flip side, San Antonio may have fewer customers compared to the other big cities due to fewer visitors there.

Zip Codes with the Most Stores

Looking more closely into the zip codes with the most bubble tea store counts, it can be seen that these are business areas which tend to associate with a greater number of Asian presence:

  • 77036: Houston's Chinatown
  • 95035: Milpitas
  • 95122: East San Jose
  • 10013: Lower Manhattan
  • 95014: Cupertino
  • 77072: Houston area with a lot of Asian restaurants
  • 10003: Left part of East village
  • 75075: Houston area with a lot of Asian restaurants

Biggest Players Account for  20% of Stores in the Nation

As seen in the graph, the top 20 chains are currently taking up 20% of the total number of stores in the scope of this study. 

Top Bubble Tea Chain Expansions

“New business” numbers are estimated by each business’ first review date on Yelp.

4. Conclusion 

To answer the questions in the introduction: 

  1. Is the industry actually growing?
  • Overall, yes, but but there was a sharp dip in 2020
  1. Who are the biggest players?
  • Kungfu tea
  • Gongcha
  • Sharetea
  1. Where are the best locations to invest in?
  • Cannot conclude as of now, which means we have to consider next steps

5. Future Works

The following points are potential future works to improve the current status of the project:

  • Expand city locations
    • Expand the scope from top 0 to top 50 cities in US so other potential locations may become more obvious 
  • What contributes to stores being popular?
    • High number of reviews at certain locations
    • Demographics of the residents and visitors to that area, including age/race/income
    • Nearby business/facilities
  • Web-scraping process improvement
    • Using additional services to swap IP periodically
  • Change of data gathering method
    • Instead of web scraping, utilize APIs provided by Yelp

6. References

  • Moreh, Jack. "Financial Graph - Capital Markets." stockvault.net. Page Link
  • “The global bubble tea market size was valued at $2.29 billion in 2022 & is projected to grow from $2.46 billion in 2023 to $4.08 billion by 2030” Fortune Business Insights. Page Link
  • US Census Bureau. “Metropolitan and Micropolitan Statistical Areas Population Totals: 2020-2022.” census.gov. Page Link
  • Top photo by Katie Rainbow 🏳️‍🌈 on Unsplash

About Author

Brian Kuo

A consultant and data engineering professional with 6+ years of experience in business intelligence, warehousing solutions, ETL, and project management. Enjoys collaboration in team projects to support multi-disciplinary stakeholders and generate valuable results that align with business goals.
View all posts by Brian Kuo >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI