Exploring The Bubble Tea Industry By Scraping Yelp.com
Bubble tea originatedΒ in Taiwan and has become increasingly popular around the globe since the early 2000s. The market continues its rate of rapid growth to this day.. To gain insights into this market in the United states, this project scrapped Yelp.com's related business establishments in the top 10 populous cities in the country.Β Β
Please see my Github repo for details on the code details.
1. Introduction
The objective of this project is to explore how the bubble tea industry is growing in the major US cities throughout the years. This project aims to answer the following questions:
- Is the industry actually growing?
- Who are the biggest players?
- Where are the locations worth investing in?
2. Data
Two data sets were used:
- Bubble tea business: data is obtained from scraping Yelp.com
- The top populous city: obtained by referencing the metropolitan statistical area (MSA) data provided by the US Census Bureau
All qualified bubble tea search results on Yelp with Scrapy were obtained. During the scraping process, the criteria that used to define a valid bubble tea business are as follows:
- Has the label βBubble Teaβ
- At least 1 review
Top 10 cities by population according to the US Census Bureau are: New York, Los Angeles, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San Diego, Dallas, and San Jose.
After the scraped data is cleaned, about 1200 bubble tea stores are identified within the selected cities.
As a side note, there were some issues during the scrapping process. CAPTCHA proved to be an issue when too much data is scrapped in a shorter amount of time. Also, with the same crawler code, missing fields were present. To bypass CAPTCHA, the code is adjusted to produce fewerless results in each crawl to prevent detection and blockage. We conducted crawls Reruns on the same pages more than once were conducted to ensure the completeness of the data obtained.
3. Data Exploration
Growing Trend - Overall
As seen in the graph above, the number of businesses has been growing until 2020 for 15 years until COVID - 19.
Growing Trend - by City
Digging into the city level, a similar growth trend is observed.
ββCity Statistics Comparison
To determine if certain cities are doing better than others, the ratings and number of reviews by city are shown. Overall, the ratings are quite consistent (high 3s to low 4s). However, the fact that San Jose and Los Angeles have significantly more reviews per business location than other cities indicates they may get more visitors there. On the flip side, San Antonio may have fewer customers compared to the other big cities due to fewer visitors there.
Zip Codes with the Most Stores
Looking more closely into the zip codes with the most bubble tea store counts, it can be seen that these are business areas which tend to associate with a greater number of Asian presence:
- 77036: Houston's Chinatown
- 95035: Milpitas
- 95122: East San Jose
- 10013: Lower Manhattan
- 95014: Cupertino
- 77072: Houston area with a lot of Asian restaurants
- 10003: Left part of East village
- 75075: Houston area with a lot of Asian restaurants
Biggest Players Account for 20% of Stores in the Nation
As seen in the graph, the top 20 chains are currently taking up 20% of the total number of stores in the scope of this study.
Top Bubble Tea Chain Expansions
βNew businessβ numbers are estimated by each businessβ first review date on Yelp.
4. Conclusion
To answer the questions in the introduction:
- Is the industry actually growing?
- Overall, yes, but but there was a sharp dip in 2020
- Who are the biggest players?
- Kungfu tea
- Gongcha
- Sharetea
- Where are the best locations to invest in?
- Cannot conclude as of now, which means we have to consider next steps
5. Future Works
The following points are potential future works to improve the current status of the project:
- Expand city locations
- Expand the scope from top 0 to top 50 cities in US so other potential locations may become more obvious
- What contributes to stores being popular?
- High number of reviews at certain locations
- Demographics of the residents and visitors to that area, including age/race/income
- Nearby business/facilities
- Web-scraping process improvement
- Using additional services to swap IP periodically
- Change of data gathering method
- Instead of web scraping, utilize APIs provided by Yelp
6. References
- Moreh, Jack. "Financial Graph - Capital Markets." stockvault.net. Page Link
- βThe global bubble tea market size was valued at $2.29 billion in 2022 & is projected to grow from $2.46 billion in 2023 to $4.08 billion by 2030β Fortune Business Insights. Page Link
- US Census Bureau. βMetropolitan and Micropolitan Statistical Areas Population Totals: 2020-2022.β census.gov. Page Link
- Top photo by Katie Rainbow π³οΈβπ on Unsplash