App-tastic! - Navigating the App Jungle
*Blog Image by Ben Bessie, Best Brand Consult
Discover the enchanting Google Play Storeβa bustling digital wonderland of apps, games, and delights, where developers' ingenuity, tech enthusiasts' curiosity, and users' needs converge. This virtual trove combines innovation, diverse tools, educational wonders, mesmerizing games, and captivating multimedia. In this blog, I explore the critical metrics and distinct patterns underpinning this dynamic platform, seeking to unravel the latest trends in the Android app market. On that basis, I aim to unearth valuable business insights about advertisements and investments. I draw on the Google Play Store dataset from Kaggle.
Data Overview
The dataset comprised 10,841 observations organized into 13 features. These features encompass specific product names, product categories, product ratings, the number of reviews, app sizes, installations, subscription options, product prices, content ratings, app genres, last update dates, current versions, and compatible Android versions. The focus of my analysis centered around a selection of these variables.
Data Preparation
To begin with, I excluded the Genres, Type, Last Updated, Current Ver, and Android Ver columns from the analysis since they were irrelevant to this study. I also removed one row containing invalid inputs to ensure a clean dataset. Initially, the Reviews, Size, Installations, and Price columns were in object format, but I converted them to numeric types for better analysis and visualization. Lastly, I replaced the lost values in Ratings with the mean to address the problem of missing data (1474 entries) in ratings. This approach was chosen due to the randomness of the lost data, making it resilient against potential systematic errors.
Data Analysis and Visualization
I selected the graphs based on the specific objectives and data types to effectively illustrate measures of centrality and reveal distinct underlying trends.
Rating
The rating represents the user's numerical evaluation of the product. On the Play Store, the rating scale spans one to five, with one denoting the lowest score. Product developers seek this feedback to assess their product's performance and offer potential users subjective insights into its quality, performance, and satisfaction.
The figures below illustrate the statistical distribution of the ratings in a histogram and box plot.

The histogram features a prominent peak around 4.2, representing the median value, while a slight elevation at the full scale signifies the presence of a few perfectly rated products. The graph's broad distribution of data points vividly illustrates dynamic user assessments, resembling a perfect democracy of opinions.

The box plot reveals a left-skewed distribution with extreme outliers, offering valuable statistical insights:
- The calculated mean (4.19) is lower than the median (4.30).
- Approximately 81.6% of recorded ratings fall within 4.0 and 4.5.
- Instagram, Subway Surfers, and Google Photos are standout products with notable attributes. These apps boast an impressive average rating of at least 4.5, surpassing the overall mean. Additionally, they garnered numerous installations and user reviews, indicating high user engagement.
- On the other hand, 274 products received perfect ratings but experienced relatively low downloads. Such products include Hojiboy Tojiboyev Life Hacks, American Girls Mobile Numbers, and Awake Dating.
In contrast, House Party - live chat, Speech Therapy: F, and Clarksburg AH received the lowest ratings among the recorded data.
Reviews
Reviews capture a product's users' candid and subjective experiences and opinions, much like ratings. They encompass many attributes, including the product's functionality, usability, reliability, design and value for money. Here we focus on a tally of the feedback per product.

The histogram above reveals a small number of reviews for most products, with the majority recording less than half a million reviews. Despite this trend, certain products stand out with significantly higher review counts than their counterparts, as demonstrated in the violin plot below.

The violin plot reveals a highly-skewed distribution and extreme outliers. The products received 444,152 reviews on average, with a median of 2,094. The most reviewed products are Facebook, WhatsApp Messenger, Instagram, Messenger - Text and Video Chat For Free, and Clash of Clans. Facebook stands out with a record-breaking 78,128,208 reviews, while the others garnered between 445,000 and 69.1 million reviews.
I found that Facebook received the lowest rating compared to its counterparts. WhatsApp Messenger, Instagram, and Clash of Clans maintained consistently high ratings. This inconsistency suggests relying on review counts may not accurately identify highly-rated products. Further investigation is needed to gain deeper insights into the underlying sentiments. The graph below illustrates this evaluation.

Installs
This attribute represents the installation frequency of a product. The products recorded an average of 15,464,338 installations, while the median installation count was 100,000. However, the considerable disparity between these two metrics is mainly attributed to a few products with extreme outlier values, as evident in the violin plot below.

The violin plot above depicts a right-skewed distribution with indications of extreme outliers, represented by the protrusions along the tail. The broad section at the bottom of the graph indicates a concentration of values between zero and fifty million (0.05 * 1e9). In the center of this section, a discreet black-and-white figure provides a visual representation of the median value. Notably, 58 (0.5%) products recorded a billion user installations, with Subway Surfers, Instagram, and Google Photos receiving the highest average ratings.
Price
Price is the product's cost in dollars. Most products, 99.8% (10,816), are priced at $50 or less, and 92.6% (10,040) are free.
Among the products that cost above $50, there are only 24 such products, all of which are different versions of the I'm Rich Lifestyle App. Priced at $399, it is the most expensive product and has yet to experience high installations.
On the other hand, two products, Minecraft and Hitman Sniper, stood out. Despite being premium products, they received impressive ratings of 4.6 and 4.5, respectively, and have been installed by many users, reaching 10,000,000 installations. Interestingly, these products are considerably cheaper than the other paid products. Minecraft costs $6.99, and Hitman Sniper costs only $0.99.
The violin plot below presents the statistical distribution.

Feature Correlation
Next, I explored the relationship between the numeric attributes to uncover any unique interrelations. The heatmap and pair plot summarizes the results of these correlations.

A weak relationship exists among the numeric attributes, except for the connection between reviews and instals, demonstrating a moderate correlation. This outcome was familiar, considering the results of the earlier assessments. The pair plot below illustrates the Kernel Density Estimation graphs and scatter plots between the numeric features.

Because they showed the most significant correlation value, I isolated Installs and Reviews and visualized them using a scatter plot with a regression line to uncover a more profound inference.

The graph above illustrates a positive correlation between the installs and reviews. A typical positive correlation should display all data points along the regression line. A less ideal relationship would display a minimal distance between the data points and the regression line.
However, this graph shows a more conflicting distribution around the regression line. As the installs increase, there is a marked distance between the regression line and the data points. This peculiar behavior weakens the strength of the correlation between the two attributes.
Nevertheless, it is understandable why a product with many downloads could attract many reviews. Since both features primarily depend on dynamic user behavior, establishing a consistent relationship between them proves challenging.
Business Insights and Recommendations
- The most eligible products for potential advertisement and investment are Facebook, Subway Surfers, WhatsApp, Instagram, YouTube, Google Photos and Clash of Clans. Besides their high downloads, they recorded excellent ratings, which indicates significant user preference.
- Products like Facebook, Subway, Instagram and Clash of Clans will prove helpful for brand visibility due to the high engagement inferred from their figures.
- Minecraft and Hitman Sniper are two products for game lovers to check out for a low price. Their excellent ratings and high downloads make the potential experience worth a shot.
- Product developers should consider cost-effective strategies in their creations. Most premium or expensive products showed slight user preference.
- There is a unique relationship between ratings and reviews, so potential product users should consider both in deciding on a product.
Future Work
- I want to analyze sentiments using natural language processing methods to understand user reviews, especially for notable products.
- With more data, I would explore the performance of competing apps to identify their strengths and weaknesses and benchmark them against industry rivals.
- Also, I would like to analyze app popularity and user engagement in different regions to aid the developers in deciding where to focus efforts for app localization and marketing campaigns.