Web Scraping Apple's App Store

Precious Chima
Posted on Aug 24, 2019

Introduction & Motivation:

Mobile phones have become ubiquitous nowadays that it would be archaic if you do not own one. The portability and convenience of being able to chat with your friends that are a few continents away and essentially lookup anything is the ultimate luxury that many born in this age have been blessed to be a part of. With the market being primarily partitioned by Apple and Android, Apple's App Store has an eclectic mix of approximately 2 million mobile applications from which its users can choose from. As a big-time Apple fan, I decided to explore some of the most popular mobile application in the app store for each category as a form of preliminary market research that could be used in the future for a very formidable business case on strategically deciding how to go about developing an application that would be launched in the App Store. I felt like having a better understanding on the App Store beyond just taking what I read on the internet at face value, so I decided to scrape Apple's App Store's most popular applications in each category, before performing some exploratory data analysis. A link to the my project's code can be found in my Github Repository.

Tools & Process:

 I utilized Scrapy, a web scraping tool in Python that I felt was capable of accomplishing this task. 

Figure 1: Preview of App Store's Applications

The general methodology for being able to extrapolate the necessary information from the App Store required me to design a creative iterative process that begun on the page in Figure 1, moving through each of the categories and getting the desired information that would be used later on for data analysis.

Figure 2: Example Web Display of an App

The circled information in the Netflix example in Figure 2, shows the information that I was interested in scraping from each app. I scraped a total of 5000+ applications, extracting the following information:

  • App Name
  • Size(MB)
  • Category
  • Compatibility
  • Languages
  • App Rating (0-5)
  • Age Rating
  • Total Ratings
  • Price
Figure 3: Snippet of Raw Data after Scraping

Data Cleaning & Preprocessing:

After successfully scraping the raw data, I used several tools in Python to make sure my data was cleaned and formatted nicely before any analysis was done. More specifically, "Pandas" library was the primary tool used for data cleaning in conjunction with regular expressions. To make things more elegant for data analysis, I decided to encapsulate all my preprocessing code in a function that was turned into an importable module to handle the entirety of the data cleaning.

Figure 4: Preprocessing Module

Data Analysis:

The first thing that I decided to look into when performing my exploratory data analysis was the distribution of app sizes (MB). I was curious to have a better understanding of the density in app sizes that are being deployed to the App Store, as well as the range of app sizes. As illustrated in Figure 5, the majority of applications are between 50 and 100 MB.

I then decided to look more closely at the correlation between app size and category. Figure 6 shows that games tend to be the "heaviest" apps being deployed to the app store which is not surprising by any means. Games typically require more computational power, given that a lot of them nowadays employ very high-end graphics to bolster their user-experience.

 

Figure 5: Distribution of App Sizes (MB)
Figure 6: App Sizes per Category

The rating feature was the most prominent metric collected during my scraping process, since the app store website did not have some other metrics (i.e number of installs ) that I felt could be stronger predictors in evaluating an apps value. Utilizing what was available, I decided to perform a series of exploratory data analysis on the "rating" feature, comparing their respective relationships with other components associated with each app. 

Figure 7 : Ratings per Category

As shown above, gaming is a big-time category, having the most ratings from all categories on the app store.

Figure 8: No. Languages vs. App Rating

Although there wasn't a very strong linear relationship between the number of languages and app rating, we can still see that as the number of languages increased, there were very few apps with which had low ratings. 

According to the two charts above, Lyft seems to be the most dominant in the Travel category, while Zillow supersedes its contemporaries in lifestyle. It is interesting that most of the apps within the top 10 Lifestyle category are either dating apps (Tinder & Bumble) or Real Estate related apps.

I decided to contrast medical and fitness, just like I did with lifestyle and fitness since I thought they share some similarities. In the case of Medical and Fitness, they are more contrasting than the former, but I found that GoodRx was the most rated medical app, followed by "Leafly:Marijuana Reviews". "Weedmaps" also appeared in the top 10, almost in the same fashion as Dating & Real Estate apps do in the Lifestyle Category. As far as the Fitness category is concerned, there appears to be an eclectic mix of meditation related applications, fitness logs and also diet trackers.

Lastly, we look at the most rated news apps, and we can see Twitter and Reddit lead the top 10. This result partially shocked me and definitely provoked some thoughts as to what sorts of inference can be made from this. It's suggestive that we have drifted away from the traditional method of watching the news on TV, and have now reverted to social media for our news outlets. A platform such as twitter where one can not only see current events but also voice their personal opinions on them, appears to be more preferable way of consuming the news.

Conclusion:

In summary, gaming appeared to be the most prominent category amongst all applications in the app store, suggesting that it is most likely, the category with the highest money-potential to be made. Although more research needs to be done by looking deeper into the subcategories of games, it is fair to say that making a good game on the app store will certainly yield to great profits.

Future Work:

  1. Scrape Google Play Store and compare it against Apple's App Store.
  2. Look into a specific category to find more concrete evidence that could be leveraged for market research, since app store spans over a broad range of categories
  3. Scrape more than 5000+ observations from each store, so we have a more representative sample of the 1.5 to 2 million apps that are within the store.

About Author

Precious Chima

Precious Chima

Precious Chima is NYC Data Science Fellow with a Bachelors Degree in Applied Mathematics & Statistics from Stony Brook University. Prior to enrolling in the NYCDSA, he worked in the Oil & Gas industry, specializing in optimizing drilling...
View all posts by Precious Chima >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp