Analyzing Tanzania Water Pump Maintenance Needs

Posted on May 6, 2018

Introduction

When's the last time you thought to yourself "will I be able to find a glass of potable water today?" For individuals in first world countries, you've probably never asked that question. And that's a good thing.  However in many parts of the world, like the east African country known as Tanzania, this question can still arise on a daily basis.

Hand-driven and gravity fed pumps are still a main source of potable water in Tanzania and the maintenance of these pumps is an ongoing issue. Because many communities either don't have the funds or the knowledge to maintain these pumps/wells, they're constantly breaking or becoming polluted. This leads to the question, how can we use machine learning to help mitigate this issue?

 

The Dataset

I acquired this dataset from a machine learning challenge posted by Taarifa on DrivenData.

In their own words:

Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.

The dataset is composed of information pertaining to 59,400 different water pumps located in Tanzania, Africa. More info can be found here.

 

The Mission

The challenge set by Taarifa is to use machine learning to predict which pumps are functional, non-functional, or functional but in need of repair based on the variables provided. At this point in time, I don't have a solid machine learning foundation, so my objective was to complete the first step by figuring out which variables most closely correlate with each of the three status groups previously listed. This algorithm would literally be life saving to many communities in Tanzania. It will help ensure most pumps remain functional, and let us know which communities we need to give more attention to. Not only will this be useful in Tanzania, but we can use it in any other country that relies on a similar source for water.

Feel free to mess around with my Shiny App to look into other categories I didn't go over here.

My Findings

The first thing I did was convert my y-axis to percent as opposed to total count in order to more accurately determine which features have a closer relationship with either of the three status groups. As seen below, I was able to isolate each Region and Basin to determine which areas are more likely to have a functional pump, non-functional pump, or a pump in need of repair.

Some insight that can be made is that the Iringa Region and the Lake Nyasa Basin are more likely than other categories to have functional pumps. Meanwhile, the Lindi Region and Ruvuma/Southern Coast Basin are more likely to have non-functional pumps.

 

Construction Year

Another key category was construction year. As you probably suspected, older pumps are more likely to be non-functional while newer pumps are more likely to be the opposite. Although this seem obvious, it is important to confirm your suspicions when preparing to write a machine learning algorithm.

Functional: This chart confirms our suspicions, however we have two outliers. 1971 and 1965 have a higher amount of functioning pumps than the years around them.  This may be due to the companies that installed them, or whether or not they've received continuous maintenance. Next steps would be to look into these assumptions. 

 

GPS Height

The GPS Height relates to the altitude of the well. As we can see below, there isn't an obvious pattern of which GPS Height is more likely to be comprised of one of the status groups.

 

Heat Map

I also created a heat map to visually locate each of the 59,400 pumps. My goal was to have more interactive features in order to select specific pumps based on density, location, etc., however I unfortunately wasn't able to accomplish this. The map still gives us an idea of where the pump locations are and raises more questions. Are the isolated pumps in the middle of the country more likely to be non-functional? Again, this needs to be looked into further.

 

Conclusion

After looking into every combination in order to figure out how to best predict which pump belongs to which status-group, I realized that most of the variables are somehow interconnected. With further work, I hope to confirm my intuitions by visually showing how these groups are related and thus creating a solid foundation for a machine learning algorithm.

About Author

Stephen Shafer

BS in Accounting with a concentration in Management Information Systems (MIS) at Binghamton Universtiy. Previous FinTech sales experience has allowed me to more clearly understand where true value lies in data, and how it can be directly translated...
View all posts by Stephen Shafer >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI