Data Visualization on Student Enrollment

Posted on Dec 22, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


This visualization project will use data that has been collected by World Bank related to the School enrollment, secondary (gross), gender parity index (GPI). This post shows details on how data is loaded, cleansed, filtered and used to plot visualizations using ggplot2 and dplyr. The reshape2 library has also been used to convert data from the source into format that is useful for analysis.

Data Source

The data that has been used for this visualization has been sourced from the World Bank Gender Data portal from the education section.  According to the data shared by them,

Gender parity index for gross enrollment ratio in secondary education is the ratio of girls to boys enrolled at secondary level in public and private schools. Ratio of girls to boys gross enrollment ratio in secondary school is calculated by dividing female gross enrollment ratio in secondary education by male gross enrollment ratio in secondary education.

Data on education are collected by the UNESCO Institute for Statistics from official responses to its annual education survey. All the data are mapped to the International Standard Classification of Education (ISCED) to ensure the comparability of education programs at the international level. The current version was formally adopted by UNESCO Member States in 2011.

The reference years reflect the school year for which the data are presented. In some countries the school year spans two calendar years (for example, from September 2010 to June 2011); in these cases the reference year refers to the year in which the school year ended (2011 in the example).

Data Cleansing

The raw csv file was opened up in a text editor, and as a preliminary step to data analysis in R, data that did not confirm to the csv format was removed. Specifically, some header lines about the column heading were removed. It was also determined that the column header had a missing field that led to issues during analysis. This was also taken care of.

The source data contained information in two files. The first file contained 264 observations of 62 variables. Majority of the 264 observations correlated to the nations of the world. Out of the 62 variables, 58 of them were related to the GPI data for the years 1960 ~ 2016.  There were also few rows that contained summary data.

The second file consisted of country information, specifically classifying various countries into specific regions and Income Groups. Data from both these files were joined together to enable categorization of GPI data based on region and income levels.

The data was difficult to use directly for R visualization. To perform visualization of the progress in different nations, GPI data that was organized as separate columns for each of those were converted to a melted form using the reshape2 package. This resulted in the generation of 12098 rows of observations of GPI data.


Income based Visualization


Data Visualization on Student Enrollment

SVG Version - rplot01


Region based visualization

Data Visualization on Student Enrollment

SVG Version - rplot13

Data Visualization on Student Enrollment

A global heat map depiction



By performing an introductory data visualization, its clear that the global gender parity index for school admissions has steadily climbed closer to a point of equality (Ratio of 1.0) between the two genders. We can also clearly see that areas like Europe and Americas have had a good ratio from the 70s. The Arab, middle east and north African areas although historically at a low 0.7 GPI has in the recent past improved the ratio to 0.95. The conflict affected areas and the sub Saharan areas are still a concern as they have improved the ratio only to of 0.8.

We can also clearly see a pattern of higher income countries having a better ratio vs the lower income countries who have always lagged over the course of 50 years.

About Author

Smitha Mathew

Technology Enthusiast, with attention to detail, having global exposure. She is a self-motivated problem solver with experience analyzing data and deriving meaningful statistical information. Her goal is to be able to make a positive difference in peoples lives...
View all posts by Smitha Mathew >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI