Web Scraping Apple's App Store

Posted on Aug 24, 2019

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

 

Side note: To understand how I use Data and AI, check out my strictly by the numbers player grouping algorithm in action! or my NBA player comparison dashboard 🙂

Introduction & Motivation:

Mobile phones have become ubiquitous nowadays that it would be archaic if you do not own one. The portability and convenience of being able to chat with your friends that are a few continents away and essentially lookup anything is the ultimate luxury that many born in this age have been blessed to be a part of. With the market being primarily partitioned by Apple and Android, Apple's App Store has an eclectic mix of approximately 2 million mobile applications from which its users can choose from.

As a big-time Apple fan, I decided to explore some of the most popular mobile application in the app store for each category as a form of preliminary market research that could be used in the future for a very formidable business case on strategically deciding how to go about developing an application that would be launched in the App Store. I felt like having a better understanding on the App Store beyond just taking what I read on the internet at face value, so I decided to scrape Apple's App Store's most popular applications in each category, before performing some exploratory data analysis. A link to the my project's code can be found in my Github Repository.

Tools & Process:

 I utilized Scrapy, a web scraping tool in Python that I felt was capable of accomplishing this task. 

Figure 1: Preview of App Store's Applications

The general methodology for being able to extrapolate the necessary information from the App Store required me to design a creative iterative process that begun on the page in Figure 1, moving through each of the categories and getting the desired information that would be used later on for data analysis.

 
Figure 2: Example Web Display of an App

The circled information in the Netflix example in Figure 2, shows the information that I was interested in scraping from each app. I scraped a total of 5000+ applications, extracting the following information:

  • App Name
  • Size(MB)
  • Category
  • Compatibility
  • Languages
  • App Rating (0-5)
  • Age Rating
  • Total Ratings
  • Price
 
Figure 3: Snippet of Raw Data after Scraping

Data Cleaning & Preprocessing:

After successfully scraping the raw data, I used several tools in Python to make sure my data was cleaned and formatted nicely before any analysis was done. More specifically, "Pandas" library was the primary tool used for data cleaning in conjunction with regular expressions. To make things more elegant for data analysis, I decided to encapsulate all my preprocessing code in a function that was turned into an importable module to handle the entirety of the data cleaning.

Figure 4: Preprocessing Module

Data Analysis:

The first thing that I decided to look into when performing my exploratory data analysis was the distribution of app sizes (MB). I was curious to have a better understanding of the density in app sizes that are being deployed to the App Store, as well as the range of app sizes. As illustrated in Figure 5, the majority of applications are between 50 and 100 MB.

I then decided to look more closely at the correlation between app size and category. Figure 6 shows that games tend to be the "heaviest" apps being deployed to the app store which is not surprising by any means. Games typically require more computational power, given that a lot of them nowadays employ very high-end graphics to bolster their user-experience.

 

Figure 5: Distribution of App Sizes (MB)
 
Figure 6: App Sizes per Category

The rating feature was the most prominent metric collected during my scraping process, since the app store website did not have some other metrics (i.e number of installs ) that I felt could be stronger predictors in evaluating an apps value. Utilizing what was available, I decided to perform a series of exploratory data analysis on the "rating" feature, comparing their respective relationships with other components associated with each app. 

Figure 7 : Ratings per Category

As shown above, gaming is a big-time category, having the most ratings from all categories on the app store.

 
Figure 8: No. Languages vs. App Rating

Although there wasn't a very strong linear relationship between the number of languages and app rating, we can still see that as the number of languages increased, there were very few apps with which had low ratings. 

According to the two charts above, Lyft seems to be the most dominant in the Travel category, while Zillow supersedes its contemporaries in lifestyle. It is interesting that most of the apps within the top 10 Lifestyle category are either dating apps (Tinder & Bumble) or Real Estate related apps.

I decided to contrast medical and fitness, just like I did with lifestyle and fitness since I thought they share some similarities. In the case of Medical and Fitness, they are more contrasting than the former, but I found that GoodRx was the most rated medical app, followed by "Leafly:Marijuana Reviews". "Weedmaps" also appeared in the top 10, almost in the same fashion as Dating & Real Estate apps do in the Lifestyle Category. As far as the Fitness category is concerned, there appears to be an eclectic mix of meditation related applications, fitness logs and also diet trackers.

Lastly, we look at the most rated news apps, and we can see Twitter and Reddit lead the top 10. This result partially shocked me and definitely provoked some thoughts as to what sorts of inference can be made from this. It's suggestive that we have drifted away from the traditional method of watching the news on TV, and have now reverted to social media for our news outlets. A platform such as twitter where one can not only see current events but also voice their personal opinions on them, appears to be more preferable way of consuming the news.

Conclusion:

In summary, gaming appeared to be the most prominent category amongst all applications in the app store, suggesting that it is most likely, the category with the highest money-potential to be made. Although more research needs to be done by looking deeper into the subcategories of games, it is fair to say that making a good game on the app store will certainly yield to great profits.

Future Work:

  1. Scrape Google Play Store and compare it against Apple's App Store.
  2. Look into a specific category to find more concrete evidence that could be leveraged for market research, since app store spans over a broad range of categories
  3. Scrape more than 5000+ observations from each store, so we have a more representative sample of the 1.5 to 2 million apps that are within the store.

About Author

Precious Chima

Precious Chima is a Data Scientist, Solutions Architect, Technical Consultant, and Inventor working at IBM. Precious has extensive experience designing, architecting, implementing, and executing novel, cutting-edge solutions to various industries. To learn more about his passion projects, check...
View all posts by Precious Chima >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI