Analyzing Tanzania Water Pump Maintenance Needs

Stephen Shafer
Posted on May 6, 2018


When's the last time you thought to yourself "will I be able to find a glass of potable water today?" For individuals in first world countries, you've probably never asked that question. And that's a good thing.  However in many parts of the world, like the east African country known as Tanzania, this question can still arise on a daily basis.

Hand-driven and gravity fed pumps are still a main source of potable water in Tanzania and the maintenance of these pumps is an ongoing issue. Because many communities either don't have the funds or the knowledge to maintain these pumps/wells, they're constantly breaking or becoming polluted. This leads to the question, how can we use machine learning to help mitigate this issue?


The Dataset

I acquired this dataset from a machine learning challenge posted by Taarifa on DrivenData.

In their own words:

Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.

The dataset is composed of information pertaining to 59,400 different water pumps located in Tanzania, Africa. More info can be found here.


The Mission

The challenge set by Taarifa is to use machine learning to predict which pumps are functional, non-functional, or functional but in need of repair based on the variables provided. At this point in time, I don't have a solid machine learning foundation, so my objective was to complete the first step by figuring out which variables most closely correlate with each of the three status groups previously listed. This algorithm would literally be life saving to many communities in Tanzania. It will help ensure most pumps remain functional, and let us know which communities we need to give more attention to. Not only will this be useful in Tanzania, but we can use it in any other country that relies on a similar source for water.

Feel free to mess around with my Shiny App to look into other categories I didn't go over here.

My Findings

The first thing I did was convert my y-axis to percent as opposed to total count in order to more accurately determine which features have a closer relationship with either of the three status groups. As seen below, I was able to isolate each Region and Basin to determine which areas are more likely to have a functional pump, non-functional pump, or a pump in need of repair.

Some insight that can be made is that the Iringa Region and the Lake Nyasa Basin are more likely than other categories to have functional pumps. Meanwhile, the Lindi Region and Ruvuma/Southern Coast Basin are more likely to have non-functional pumps.


Construction Year

Another key category was construction year. As you probably suspected, older pumps are more likely to be non-functional while newer pumps are more likely to be the opposite. Although this seem obvious, it is important to confirm your suspicions when preparing to write a machine learning algorithm.

Functional: This chart confirms our suspicions, however we have two outliers. 1971 and 1965 have a higher amount of functioning pumps than the years around them.  This may be due to the companies that installed them, or whether or not they've received continuous maintenance. Next steps would be to look into these assumptions. 


GPS Height

The GPS Height relates to the altitude of the well. As we can see below, there isn't an obvious pattern of which GPS Height is more likely to be comprised of one of the status groups.


Heat Map

I also created a heat map to visually locate each of the 59,400 pumps. My goal was to have more interactive features in order to select specific pumps based on density, location, etc., however I unfortunately wasn't able to accomplish this. The map still gives us an idea of where the pump locations are and raises more questions. Are the isolated pumps in the middle of the country more likely to be non-functional? Again, this needs to be looked into further.



After looking into every combination in order to figure out how to best predict which pump belongs to which status-group, I realized that most of the variables are somehow interconnected. With further work, I hope to confirm my intuitions by visually showing how these groups are related and thus creating a solid foundation for a machine learning algorithm.

About Author

Stephen Shafer

Stephen Shafer

BS in Accounting with a concentration in Management Information Systems (MIS) at Binghamton Universtiy. Previous FinTech sales experience has allowed me to more clearly understand where true value lies in data, and how it can be directly translated...
View all posts by Stephen Shafer >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp