ClassPass Web Scraping project

George Alster
Posted on Apr 28, 2019

Motivation

Over the past decade multiple scientifc articles have been published detailing the mentally benficial effects of consistent physical exercise. Leading a desk-focused lifestyle, the importance of taking the time to participate in physical activity has become very apparent to me. This led me to ClassPass. 

ClassPass is a global company that provides monthly subscription to all gym classes within a specific area. This project was completed during my three month bootcamp course in New York City and therefore the data I scraped contained all of the gym classes in the New York area for one particular day.

There were three main questions that intended to answer from this project. 

  • Where should you work out?
  • What should you work out?
  • How much should it cost you?

These questions made up the bulk of my analysis and you can find the solutions below. 

The Scraping

Selenium was used to write the code to scrape this website thanks to the non-changing URL upon clicking to the next page of classes. There were two layers to the scraping of this website. The first being the log in page (thankfully there was a 2 week free trial at the time) and the second consisting of the multiple pages listing the gym classes available. The final product of the scraping was a csv file with nine different variables and over 2000 observations. Please refer to the table below for a sample of the dataframe layout. 

Table 1: Sample layout of csv file produced from the scraping

Where should you work out?

To determine where in the city one should choose to work out, I first took a look at what areas have the highest concentration of gyms. The notable 'winners' of this analysis were the upper east and west side, closely followed by greenwich village and Union Square. The general concentration of gyms across the board can be observed from the following word cloud. 

Figure 1: Word cloud detailing the most highly concentrated neighborhoods of gyms

Following on from this, I wanted to understand how the price and popularity compared across these top locations.  Figure 2 shows a two columned horizontal barplot for this information in decreasing order of price. The two most notable takeaways from this graph are that, on average, the gyms in Williamsburg, Brooklyn are the most expensive and that the gyms in NoHo are the most popular. I found both of these observations to be surprising as Manhattan is notorioulsy more expensive than Brooklyn and also (as far as I was aware) locations such as SoHo and Chelsea are more reputable for their gyms than NoHo. 

Figure 2: Graph to show average pricing and number of reviews for eacj location with over 40 gyms

When deciding where to work out, one must also take a look at the particular venues themselves as well as the locations. Table 2 below lists the top ten most commonly reviewed venues in New York. We see that there are multiple venues that are repeated in this list. This clearly indicates that its not all about where you are but that some particular venues are particularly highly rated. We can also see that of these top ten venues, they are above average in terms of reviews, rating and pricing. However, the majority of them are shorter than average. This is unsurprising in the competetive, fast paced business world that is Manhattan. 

What should you work out?

Once you have settled on a location and a venue to work out, you might also want to consider what type of workout you are going to do. This analysis focused on five main categories; Boxing, Cycling, Pilates, Dance and Yoga. To answer this question, I primarily looked at the average pricing, popularity (reviews) and duration for each genre. The results of these can be found in Figure 3 and 4. Fig. 3 displays some key observations. We see that the shortest class (cycling) is also the most popular which links back to the generalized comment made earlier regarding New Yorkers busy lifestyle and noyt having large amounts of free time.  Another key takeaway is that Pilates is expensive, on average 25% higher than the next most expensive class. Fig. 4 provides a greater degree of granularity with the Boxplots that highlight anomolous behaviour of the different class genres. Particularly, it is shown that there are some yoga classes that have very high popularity, on par with cycling. The greater duration of dance and yoga and the inaffordability of pilates is further clarified by these graphs. 

Figure 3: Bar plots to compare the five main genres for each key variable
Figure 4: Box plots to compare the five main genres for each key variable

How much should it cost you?

The final question I wanted the answer for was how much should one be paying for these classes.  The solution could be found by analysing how the price was correlated with the rating of the class. This can be seen in Figure 5. We observe from this graph that there are a select number of classes in the lower price range but yet with the highest rating. By utilising the color bar on the side of the graph we see that these highly rated, inexpensive classes have few reviews. The likely cause of this is that these classes have a select amount of loyal customers who rate it very highly. However, the gym is not frequently visited and reviewed and therefore the reliability of these ratings is lower. We see the majority of reliable ratings lie in the 4.8/4.9 range with the most frequently reviewed classes in the 10-15 price range. We note that this price is in credits ($80 one month subscription provides you with 45 credits). To further decide on how much to spend on a class, we take a look at the relationship between popularity and price. This has the closest to linear correlation of all the variables in the dataset. We see a general trend of increased popularity with price, however, the majority of highly popular classes again sit around the 10-15 credits mark and therefore this is the suggested price range discovered from the analysis to ensure quality at a reasonable price. 

Figure 5: Scatter plot to show how price varies class rating
Figure 6: Relationship between popularity and price of the class

Conclusions and future work

This analysis provided fruitful in answering the three main questions detailed at the start of this blog. However, there is still a vast amount of further research that can be done with this website. As mentioned earlier, the data scraped for this project was for one particular day. If the data were to be scraped for each day of the week then analysis could be done comparing classes of different days, for example, are weekend classes more expensive than week day prices?

ClassPass is a global website containing classes for many global gym brands. Is peloton more expensive in Los Angeles than New York? Is Barrys bootcamp more popular in London than New York? Given more time, this is an interesting analysis to be conducted.

Finally, throughout this blog questions were answered for the consumer. However, analysis of this data could provide extremely beneficial to investors or companies in the industry. This data would highlight the most profitible and  competitive locations to open up gyms of various types. Valuable information for someone planning to open New Yorks next best boxing gym!

If you are searching for more answers into how project was completed, please visit my Github page! 

 

https://github.com/Alster96/gsa_project1.git

 

About Author

George Alster

George Alster

George graduated with First Class Honours in his Chemical Engineering (MEng) degree at University College London (UCL) in 2018. Alongside completing groundbreaking research in the Electrochemical Innovation Lab at his university, George also has experience in the private...
View all posts by George Alster >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp