App-tastic! - Navigating the App Jungle

Posted on Aug 9, 2023

*Blog Image by Ben Bessie, Best Brand Consult

Discover the enchanting Google Play Storeβ€”a bustling digital wonderland of apps, games, and delights, where developers' ingenuity, tech enthusiasts' curiosity, and users' needs converge. This virtual trove combines innovation, diverse tools, educational wonders, mesmerizing games, and captivating multimedia. In this blog, I explore the critical metrics and distinct patterns underpinning this dynamic platform, seeking to unravel the latest trends in the Android app market. On that basis, I aim to unearth valuable business insights about advertisements and investments. I draw on the Google Play Store dataset from Kaggle.

Data Overview

The dataset comprised 10,841 observations organized into 13 features. These features encompass specific product names, product categories, product ratings, the number of reviews, app sizes, installations, subscription options, product prices, content ratings, app genres, last update dates, current versions, and compatible Android versions. The focus of my analysis centered around a selection of these variables.

Data Preparation

To begin with, I excluded the Genres, Type, Last Updated, Current Ver, and Android Ver columns from the analysis since they were irrelevant to this study. I also removed one row containing invalid inputs to ensure a clean dataset. Initially, the Reviews, Size, Installations, and Price columns were in object format, but I converted them to numeric types for better analysis and visualization. Lastly, I replaced the lost values in Ratings with the mean to address the problem of missing data (1474 entries) in ratings. This approach was chosen due to the randomness of the lost data, making it resilient against potential systematic errors.

Data Analysis and Visualization

I selected the graphs based on the specific objectives and data types to effectively illustrate measures of centrality and reveal distinct underlying trends.


The rating represents the user's numerical evaluation of the product. On the Play Store, the rating scale spans one to five, with one denoting the lowest score. Product developers seek this feedback to assess their product's performance and offer potential users subjective insights into its quality, performance, and satisfaction. 

The figures below illustrate the statistical distribution of the ratings in a histogram and box plot.

The histogram features a prominent peak around 4.2, representing the median value, while a slight elevation at the full scale signifies the presence of a few perfectly rated products. The graph's broad distribution of data points vividly illustrates dynamic user assessments, resembling a perfect democracy of opinions.

The box plot reveals a left-skewed distribution with extreme outliers, offering valuable statistical insights:

  • The calculated mean (4.19) is lower than the median (4.30).
  • Approximately 81.6% of recorded ratings fall within 4.0 and 4.5.
  • Instagram, Subway Surfers, and Google Photos are standout products with notable attributes. These apps boast an impressive average rating of at least 4.5, surpassing the overall mean. Additionally, they garnered numerous installations and user reviews, indicating high user engagement.
  • On the other hand, 274 products received perfect ratings but experienced relatively low downloads. Such products include Hojiboy Tojiboyev Life Hacks, American Girls Mobile Numbers, and Awake Dating.

In contrast, House Party - live chat, Speech Therapy: F, and Clarksburg AH received the lowest ratings among the recorded data.


Reviews capture a product's users' candid and subjective experiences and opinions, much like ratings. They encompass many attributes, including the product's functionality, usability, reliability, design and value for money. Here we focus on a tally of the feedback per product.

The histogram above reveals a small number of reviews for most products, with the majority recording less than half a million reviews. Despite this trend, certain products stand out with significantly higher review counts than their counterparts, as demonstrated in the violin plot below.

The violin plot reveals a highly-skewed distribution and extreme outliers. The products received 444,152 reviews on average, with a median of 2,094. The most reviewed products are Facebook, WhatsApp Messenger, Instagram, Messenger - Text and Video Chat For Free, and Clash of Clans. Facebook stands out with a record-breaking 78,128,208 reviews, while the others garnered between 445,000 and 69.1 million reviews.
I found that Facebook received the lowest rating compared to its counterparts. WhatsApp Messenger, Instagram, and Clash of Clans maintained consistently high ratings. This inconsistency suggests relying on review counts may not accurately identify highly-rated products. Further investigation is needed to gain deeper insights into the underlying sentiments. The graph below illustrates this evaluation.


This attribute represents the installation frequency of a product. The products recorded an average of 15,464,338 installations, while the median installation count was 100,000. However, the considerable disparity between these two metrics is mainly attributed to a few products with extreme outlier values, as evident in the violin plot below.

The violin plot above depicts a right-skewed distribution with indications of extreme outliers, represented by the protrusions along the tail. The broad section at the bottom of the graph indicates a concentration of values between zero and fifty million (0.05 * 1e9). In the center of this section, a discreet black-and-white figure provides a visual representation of the median value. Notably, 58 (0.5%) products recorded a billion user installations, with Subway Surfers, Instagram, and Google Photos receiving the highest average ratings.


Price is the product's cost in dollars. Most products, 99.8% (10,816), are priced at $50 or less, and 92.6% (10,040) are free.

Among the products that cost above $50, there are only 24 such products, all of which are different versions of the I'm Rich Lifestyle App. Priced at $399, it is the most expensive product and has yet to experience high installations.

On the other hand, two products, Minecraft and Hitman Sniper, stood out. Despite being premium products, they received impressive ratings of 4.6 and 4.5, respectively, and have been installed by many users, reaching 10,000,000 installations. Interestingly, these products are considerably cheaper than the other paid products. Minecraft costs $6.99, and Hitman Sniper costs only $0.99.

The violin plot below presents the statistical distribution.

Feature Correlation

Next, I explored the relationship between the numeric attributes to uncover any unique interrelations. The heatmap and pair plot summarizes the results of these correlations.

A weak relationship exists among the numeric attributes, except for the connection between reviews and instals, demonstrating a moderate correlation. This outcome was familiar, considering the results of the earlier assessments. The pair plot below illustrates the Kernel Density Estimation graphs and scatter plots between the numeric features.

Because they showed the most significant correlation value, I isolated Installs and Reviews and visualized them using a scatter plot with a regression line to uncover a more profound inference.

The graph above illustrates a positive correlation between the installs and reviews. A typical positive correlation should display all data points along the regression line. A less ideal relationship would display a minimal distance between the data points and the regression line.

However, this graph shows a more conflicting distribution around the regression line. As the installs increase, there is a marked distance between the regression line and the data points. This peculiar behavior weakens the strength of the correlation between the two attributes.

Nevertheless, it is understandable why a product with many downloads could attract many reviews. Since both features primarily depend on dynamic user behavior, establishing a consistent relationship between them proves challenging.

Business Insights and Recommendations

  • The most eligible products for potential advertisement and investment are Facebook, Subway Surfers, WhatsApp, Instagram, YouTube, Google Photos and Clash of Clans. Besides their high downloads, they recorded excellent ratings, which indicates significant user preference. 
  • Products like Facebook, Subway, Instagram and Clash of Clans will prove helpful for brand visibility due to the high engagement inferred from their figures.
  • Minecraft and Hitman Sniper are two products for game lovers to check out for a low price. Their excellent ratings and high downloads make the potential experience worth a shot.
  • Product developers should consider cost-effective strategies in their creations. Most premium or expensive products showed slight user preference.
  • There is a unique relationship between ratings and reviews, so potential product users should consider both in deciding on a product.

Future Work

  • I want to analyze sentiments using natural language processing methods to understand user reviews, especially for notable products.
  • With more data, I would explore the performance of competing apps to identify their strengths and weaknesses and benchmark them against industry rivals. 
  • Also, I would like to analyze app popularity and user engagement in different regions to aid the developers in deciding where to focus efforts for app localization and marketing campaigns.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI