Data Visualization on Student Enrollment
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
This visualization project will use data that has been collected by World Bank related to the School enrollment, secondary (gross), gender parity index (GPI). This post shows details on how data is loaded, cleansed, filtered and used to plot visualizations using ggplot2 and dplyr. The reshape2 library has also been used to convert data from the source into format that is useful for analysis.
Data Source
The data that has been used for this visualization has been sourced from the World Bank Gender Data portal from the education section. According to the data shared by them,
Gender parity index for gross enrollment ratio in secondary education is the ratio of girls to boys enrolled at secondary level in public and private schools. Ratio of girls to boys gross enrollment ratio in secondary school is calculated by dividing female gross enrollment ratio in secondary education by male gross enrollment ratio in secondary education.
Data on education are collected by the UNESCO Institute for Statistics from official responses to its annual education survey. All the data are mapped to the International Standard Classification of Education (ISCED) to ensure the comparability of education programs at the international level. The current version was formally adopted by UNESCO Member States in 2011.
The reference years reflect the school year for which the data are presented. In some countries the school year spans two calendar years (for example, from September 2010 to June 2011); in these cases the reference year refers to the year in which the school year ended (2011 in the example).
Data Cleansing
The raw csv file was opened up in a text editor, and as a preliminary step to data analysis in R, data that did not confirm to the csv format was removed. Specifically, some header lines about the column heading were removed. It was also determined that the column header had a missing field that led to issues during analysis. This was also taken care of.
The source data contained information in two files. The first file contained 264 observations of 62 variables. Majority of the 264 observations correlated to the nations of the world. Out of the 62 variables, 58 of them were related to the GPI data for the years 1960 ~ 2016. There were also few rows that contained summary data.
The second file consisted of country information, specifically classifying various countries into specific regions and Income Groups. Data from both these files were joined together to enable categorization of GPI data based on region and income levels.
The data was difficult to use directly for R visualization. To perform visualization of the progress in different nations, GPI data that was organized as separate columns for each of those were converted to a melted form using the reshape2 package. This resulted in the generation of 12098 rows of observations of GPI data.
Visualizations
Income based Visualization
SVG Version - rplot01
Region based visualization
SVG Version - rplot13
A global heat map depiction
Conclusion
By performing an introductory data visualization, its clear that the global gender parity index for school admissions has steadily climbed closer to a point of equality (Ratio of 1.0) between the two genders. We can also clearly see that areas like Europe and Americas have had a good ratio from the 70s. The Arab, middle east and north African areas although historically at a low 0.7 GPI has in the recent past improved the ratio to 0.95. The conflict affected areas and the sub Saharan areas are still a concern as they have improved the ratio only to of 0.8.
We can also clearly see a pattern of higher income countries having a better ratio vs the lower income countries who have always lagged over the course of 50 years.