Data Comparison Between Portuguese Schools
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Shuo Zhang. She is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016 . Please refer to the following link for R codes:
Data shows education is a key factor for achieving long-term economic growth. Determinants of students’ performance have been the subject of ongoing debate among educators, academics, and policy makers.
This study focuses on secondary education in Portugal. During the last decades, the Portuguese education level has improved. In the secondary schools, the core classes of Mathematics and Portuguese (the native language) is the most important since they provide fundamental knowledge for the success in the remaining school subjects (e.g. physics or history).
The data of student performance in Mathematics and Portuguese holds valuable information and can be used to improve decision making by parents and schools and to optimize student success. Modeling student performance is an important tool for both educators, parents and students. It can help us better understand this phenomenon and ultimately improve it.
Data Set description
This data set provides information about student achievement in two Portuguese secondary schools. The data attributes include student grades, demographic, social and school related features. It was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The target is to investigate the contributing factors associated with G3 (the final year grade).
The raw data contains 382 observations and 53 variables. The target variables are G3.x (the final year grade of Math) and G3.y (the final year grade in Portuguese). The contributing factors will be presented in 3 categories: school-related (i.e. school extra education support), student-related (i.e. past course performance, age, study time, desire to pursue higher education) and family-related (i.e. parents' status, quality of family relationship, parents' education and job). I analyzed most variables and listed the top contributing factors in the following illustration.
Data Visualization in R
Which school has better student performance?
The boxplots of final year grade distribution show the difference in student performance by school. In this graph we can see that for the GP school the median final year grade for Math is C and the median final year grade for Portuguese is B, while for the GP school the median final year grade for Math is D and the median final year grade for Portuguese is C. We can conclude that the GP school has better student performance. In the following analysis, I will separate the plots based on school.
Does the current student performance have a correlation with the past?
The next question we want to assess is whether G1 and G2 will significantly influence G3. Let's take math performance for example.
There are a couple interesting facts that show up in these graphs. First, we notice the data trend can be categorized into two relationships: the cluster with 0 grade (students who dropped a course) and a strong correlation between G3 and G2, G1 (students who did not drop a course ). So the analysis is divided to two parts based on the trend.
Students who did not drop a class:
The figure shows a linear relationship between the current grade and the past grade, which means the better you did in the first and second grade, the higher final year grade you would get.
Students who dropped a class:
Upon further inspection of the data, it becomes obvious that the group with 0 grade most likely belongs to students who dropped the course. There are a couple of interesting facts that show up on the previous graph. First, it has G1 and/or G2 grades but final grades of 0. Second, there are no G1s of 0 but there are G2s with 0 value.
The graph shows that the students who dropped G3 failed both at G1 and G2. The further investigation of the data displays that 13 students dropped G2, 39 students dropped G3 and all of students dropped G3 also dropped G2.
Is student performance affected by past class failure?
The graph shows the fact that past class failure plays a role in current student performance, and we can summarize that successful students tend to have a history of success.
Does student performance change based on age?
From this graph, we can conclude that the age of the students also plays a factor in the final year grade. The older the student is, the lower the final year grade he is likely to achieve.
Does the student who wants to take higher education do better at school?
In terms of study motivation, the student with a desire to pursue higher education has a higher probability of achieving success.
Is it true that the more time a student spends on studying, the greater his chances are of getting a higher grade?
The graph shows an association between study time and the final year grade; the successful students tend to spend more time on coursework.
Does absence relate to student performance?
It is hard to conclude a relationship between number of school absences and final year grade. To get a better understanding of the plot, I grouped the number of school absences into 4 categories: 0-9, 10-19, 20-29, 30+.
The new graph presents that successful students tend to have less school absences.
Does the parents' education and job influence student performance?
Let's take mother's education and job for example.
The boxplots at the left of the graph indicate that students with working mothers tend to have better course performance than those with home-staying mothers. Also upon further investigation of mother job types, the boxplots at the right of the graph demonstrate that students whose mothers have a higher education level are most likely to achieve success in their courses.
Furthermore plotting mother's job with education allows us to understand how the job distribution varies among education levels. Here we see the working mother has a greater portion of higher education level and especially the mother who works as a teacher has the most advanced education degree on average. In conclusion, the student who has a working and well-educated mother tends to be more successful.
What is the top consideration to choose a school?
From the graph we can see that the first consideration is the quality of school course and the second is whether the school is close to home.
I have addressed the data visualization of secondary student grades of two core classes (Mathematics and Portuguese) by using past school grades (first and second periods), demographic and school related data. In conclusion, the student achievement is highly affected by previous performances. Also, there are other relevant factors that contribute to student performance, such as: school related, demographic (e.g. student’s age, study time, desire to pursue higher education, parent’s job and education). The conclusion is summarized:
GP school has more successful students than MS.
The final year grade is highly affected by the first and second year grades. Students are more likely to drop a course if they’ve had bad initial grades in that course.
Successful students tend to be younger, have a history of success and a desire to continue onto higher education, be absent less, and also spend more time on coursework.
- Successful students are prone to have working and well-educated parents.
If more data can be provided about the student performance from more schools from the same community and more subjects (i.e. history) , the analysis can be more accurate.