ClassPass Web Scraping project

Motivation
Over the past decade multiple scientifc articles have been published detailing the mentally benficial effects of consistent physical exercise. Leading a desk-focused lifestyle, the importance of taking the time to participate in physical activity has become very apparent to me. This led me to ClassPass.
ClassPass is a global company that provides monthly subscription to all gym classes within a specific area. This project was completed during my three month bootcamp course in New York City and therefore the data I scraped contained all of the gym classes in the New York area for one particular day.
There were three main questions that intended to answer from this project.
- Where should you work out?
- What should you work out?
- How much should it cost you?
These questions made up the bulk of my analysis and you can find the solutions below.
The Scraping
Selenium was used to write the code to scrape this website thanks to the non-changing URL upon clicking to the next page of classes. There were two layers to the scraping of this website. The first being the log in page (thankfully there was a 2 week free trial at the time) and the second consisting of the multiple pages listing the gym classes available. The final product of the scraping was a csv file with nine different variables and over 2000 observations. Please refer to the table below for a sample of the dataframe layout.

Where should you work out?
To determine where in the city one should choose to work out, I first took a look at what areas have the highest concentration of gyms. The notable 'winners' of this analysis were the upper east and west side, closely followed by greenwich village and Union Square. The general concentration of gyms across the board can be observed from the following word cloud.

Following on from this, I wanted to understand how the price and popularity compared across these top locations. Figure 2 shows a two columned horizontal barplot for this information in decreasing order of price. The two most notable takeaways from this graph are that, on average, the gyms in Williamsburg, Brooklyn are the most expensive and that the gyms in NoHo are the most popular. I found both of these observations to be surprising as Manhattan is notorioulsy more expensive than Brooklyn and also (as far as I was aware) locations such as SoHo and Chelsea are more reputable for their gyms than NoHo.

When deciding where to work out, one must also take a look at the particular venues themselves as well as the locations. Table 2 below lists the top ten most commonly reviewed venues in New York. We see that there are multiple venues that are repeated in this list. This clearly indicates that its not all about where you are but that some particular venues are particularly highly rated. We can also see that of these top ten venues, they are above average in terms of reviews, rating and pricing. However, the majority of them are shorter than average. This is unsurprising in the competetive, fast paced business world that is Manhattan.
What should you work out?
Once you have settled on a location and a venue to work out, you might also want to consider what type of workout you are going to do. This analysis focused on five main categories; Boxing, Cycling, Pilates, Dance and Yoga. To answer this question, I primarily looked at the average pricing, popularity (reviews) and duration for each genre. The results of these can be found in Figure 3 and 4. Fig. 3 displays some key observations. We see that the shortest class (cycling) is also the most popular which links back to the generalized comment made earlier regarding New Yorkers busy lifestyle and noyt having large amounts of free time. Another key takeaway is that Pilates is expensive, on average 25% higher than the next most expensive class. Fig. 4 provides a greater degree of granularity with the Boxplots that highlight anomolous behaviour of the different class genres. Particularly, it is shown that there are some yoga classes that have very high popularity, on par with cycling. The greater duration of dance and yoga and the inaffordability of pilates is further clarified by these graphs.


How much should it cost you?
The final question I wanted the answer for was how much should one be paying for these classes. The solution could be found by analysing how the price was correlated with the rating of the class. This can be seen in Figure 5. We observe from this graph that there are a select number of classes in the lower price range but yet with the highest rating. By utilising the color bar on the side of the graph we see that these highly rated, inexpensive classes have few reviews. The likely cause of this is that these classes have a select amount of loyal customers who rate it very highly. However, the gym is not frequently visited and reviewed and therefore the reliability of these ratings is lower. We see the majority of reliable ratings lie in the 4.8/4.9 range with the most frequently reviewed classes in the 10-15 price range. We note that this price is in credits ($80 one month subscription provides you with 45 credits). To further decide on how much to spend on a class, we take a look at the relationship between popularity and price. This has the closest to linear correlation of all the variables in the dataset. We see a general trend of increased popularity with price, however, the majority of highly popular classes again sit around the 10-15 credits mark and therefore this is the suggested price range discovered from the analysis to ensure quality at a reasonable price.


Conclusions and future work
This analysis provided fruitful in answering the three main questions detailed at the start of this blog. However, there is still a vast amount of further research that can be done with this website. As mentioned earlier, the data scraped for this project was for one particular day. If the data were to be scraped for each day of the week then analysis could be done comparing classes of different days, for example, are weekend classes more expensive than week day prices?
ClassPass is a global website containing classes for many global gym brands. Is peloton more expensive in Los Angeles than New York? Is Barrys bootcamp more popular in London than New York? Given more time, this is an interesting analysis to be conducted.
Finally, throughout this blog questions were answered for the consumer. However, analysis of this data could provide extremely beneficial to investors or companies in the industry. This data would highlight the most profitible and competitive locations to open up gyms of various types. Valuable information for someone planning to open New Yorks next best boxing gym!
If you are searching for more answers into how project was completed, please visit my Github page!
https://github.com/Alster96/gsa_project1.git