Analyzing Tanzania Water Pump Maintenance Needs
When's the last time you thought to yourself "will I be able to find a glass of potable water today?" For individuals in first world countries, you've probably never asked that question. And that's a good thing. However in many parts of the world, like the east African country known as Tanzania, this question can still arise on a daily basis.
Hand-driven and gravity fed pumps are still a main source of potable water in Tanzania and the maintenance of these pumps is an ongoing issue. Because many communities either don't have the funds or the knowledge to maintain these pumps/wells, they're constantly breaking or becoming polluted. This leads to the question, how can we use machine learning to help mitigate this issue?
I acquired this dataset from a machine learning challenge posted by Taarifa on DrivenData.
In their own words:
Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.
The dataset is composed of information pertaining to 59,400 different water pumps located in Tanzania, Africa. More info can be found here.
The challenge set by Taarifa is to use machine learning to predict which pumps are functional, non-functional, or functional but in need of repair based on the variables provided. At this point in time, I don't have a solid machine learning foundation, so my objective was to complete the first step by figuring out which variables most closely correlate with each of the three status groups previously listed. This algorithm would literally be life saving to many communities in Tanzania. It will help ensure most pumps remain functional, and let us know which communities we need to give more attention to. Not only will this be useful in Tanzania, but we can use it in any other country that relies on a similar source for water.
Feel free to mess around with my Shiny App to look into other categories I didn't go over here.
The first thing I did was convert my y-axis to percent as opposed to total count in order to more accurately determine which features have a closer relationship with either of the three status groups. As seen below, I was able to isolate each Region and Basin to determine which areas are more likely to have a functional pump, non-functional pump, or a pump in need of repair.
Some insight that can be made is that the Iringa Region and the Lake Nyasa Basin are more likely than other categories to have functional pumps. Meanwhile, the Lindi Region and Ruvuma/Southern Coast Basin are more likely to have non-functional pumps.
Another key category was construction year. As you probably suspected, older pumps are more likely to be non-functional while newer pumps are more likely to be the opposite. Although this seem obvious, it is important to confirm your suspicions when preparing to write a machine learning algorithm.
Functional: This chart confirms our suspicions, however we have two outliers. 1971 and 1965 have a higher amount of functioning pumps than the years around them. This may be due to the companies that installed them, or whether or not they've received continuous maintenance. Next steps would be to look into these assumptions.
The GPS Height relates to the altitude of the well. As we can see below, there isn't an obvious pattern of which GPS Height is more likely to be comprised of one of the status groups.
I also created a heat map to visually locate each of the 59,400 pumps. My goal was to have more interactive features in order to select specific pumps based on density, location, etc., however I unfortunately wasn't able to accomplish this. The map still gives us an idea of where the pump locations are and raises more questions. Are the isolated pumps in the middle of the country more likely to be non-functional? Again, this needs to be looked into further.
After looking into every combination in order to figure out how to best predict which pump belongs to which status-group, I realized that most of the variables are somehow interconnected. With further work, I hope to confirm my intuitions by visually showing how these groups are related and thus creating a solid foundation for a machine learning algorithm.