Web Scraping Apple's App Store
Introduction & Motivation:
Mobile phones have become ubiquitous nowadays that it would be archaic if you do not own one. The portability and convenience of being able to chat with your friends that are a few continents away and essentially lookup anything is the ultimate luxury that many born in this age have been blessed to be a part of. With the market being primarily partitioned by Apple and Android, Apple's App Store has an eclectic mix of approximately 2 million mobile applications from which its users can choose from. As a big-time Apple fan, I decided to explore some of the most popular mobile application in the app store for each category as a form of preliminary market research that could be used in the future for a very formidable business case on strategically deciding how to go about developing an application that would be launched in the App Store. I felt like having a better understanding on the App Store beyond just taking what I read on the internet at face value, so I decided to scrape Apple's App Store's most popular applications in each category, before performing some exploratory data analysis. A link to the my project's code can be found in my Github Repository.
Tools & Process:
I utilized Scrapy, a web scraping tool in Python that I felt was capable of accomplishing this task.
The general methodology for being able to extrapolate the necessary information from the App Store required me to design a creative iterative process that begun on the page in Figure 1, moving through each of the categories and getting the desired information that would be used later on for data analysis.
The circled information in the Netflix example in Figure 2, shows the information that I was interested in scraping from each app. I scraped a total of 5000+ applications, extracting the following information:
- App Name
- App Rating (0-5)
- Age Rating
- Total Ratings
Data Cleaning & Preprocessing:
After successfully scraping the raw data, I used several tools in Python to make sure my data was cleaned and formatted nicely before any analysis was done. More specifically, "Pandas" library was the primary tool used for data cleaning in conjunction with regular expressions. To make things more elegant for data analysis, I decided to encapsulate all my preprocessing code in a function that was turned into an importable module to handle the entirety of the data cleaning.
The first thing that I decided to look into when performing my exploratory data analysis was the distribution of app sizes (MB). I was curious to have a better understanding of the density in app sizes that are being deployed to the App Store, as well as the range of app sizes. As illustrated in Figure 5, the majority of applications are between 50 and 100 MB.
I then decided to look more closely at the correlation between app size and category. Figure 6 shows that games tend to be the "heaviest" apps being deployed to the app store which is not surprising by any means. Games typically require more computational power, given that a lot of them nowadays employ very high-end graphics to bolster their user-experience.
The rating feature was the most prominent metric collected during my scraping process, since the app store website did not have some other metrics (i.e number of installs ) that I felt could be stronger predictors in evaluating an apps value. Utilizing what was available, I decided to perform a series of exploratory data analysis on the "rating" feature, comparing their respective relationships with other components associated with each app.
As shown above, gaming is a big-time category, having the most ratings from all categories on the app store.
Although there wasn't a very strong linear relationship between the number of languages and app rating, we can still see that as the number of languages increased, there were very few apps with which had low ratings.
According to the two charts above, Lyft seems to be the most dominant in the Travel category, while Zillow supersedes its contemporaries in lifestyle. It is interesting that most of the apps within the top 10 Lifestyle category are either dating apps (Tinder & Bumble) or Real Estate related apps.
I decided to contrast medical and fitness, just like I did with lifestyle and fitness since I thought they share some similarities. In the case of Medical and Fitness, they are more contrasting than the former, but I found that GoodRx was the most rated medical app, followed by "Leafly:Marijuana Reviews". "Weedmaps" also appeared in the top 10, almost in the same fashion as Dating & Real Estate apps do in the Lifestyle Category. As far as the Fitness category is concerned, there appears to be an eclectic mix of meditation related applications, fitness logs and also diet trackers.
Lastly, we look at the most rated news apps, and we can see Twitter and Reddit lead the top 10. This result partially shocked me and definitely provoked some thoughts as to what sorts of inference can be made from this. It's suggestive that we have drifted away from the traditional method of watching the news on TV, and have now reverted to social media for our news outlets. A platform such as twitter where one can not only see current events but also voice their personal opinions on them, appears to be more preferable way of consuming the news.
In summary, gaming appeared to be the most prominent category amongst all applications in the app store, suggesting that it is most likely, the category with the highest money-potential to be made. Although more research needs to be done by looking deeper into the subcategories of games, it is fair to say that making a good game on the app store will certainly yield to great profits.
- Scrape Google Play Store and compare it against Apple's App Store.
- Look into a specific category to find more concrete evidence that could be leveraged for market research, since app store spans over a broad range of categories
- Scrape more than 5000+ observations from each store, so we have a more representative sample of the 1.5 to 2 million apps that are within the store.