NYC School Performance, Poverty and Class size Analysis
March 26, 2015
Considering the importance of education in an increasingly knowledge based economy, I performed an exploratory data analysis of school performance in relation to various attributes that might potentially have an influence, with the following objectives.
- Understand what attributes actually influence school performance.
- Analyse if our current affirmative action plans and college admission policies reflect such influence.
Scope, Variables and Datasets:
Analysis was restricted to
NYC puclic schools ( comprising 32 school districts)
School district size
English language learners ratio
SAT score. covering Math , Reading and Writing was used as an indicator of school performance.
Following datasets were used for the analysis
- All source datasets were merged by District-id:School-id to create the master file.
- The dataset was scaled and centered as the features measured are vastly different- for example , poverty ratio is in percentage , class size in tens and SAT scores in hundreds .
- Data set was checked for Near-Zero variance attributes using
nearZeroVarfunction, so they can be dropped from feature set, there were none .
- Data set was checked for highly correlated variables using
viffunction, so they can be dropped from feature set, there were none .
regsubsets was for used feature selection - following 3 features out of the total 8 feature, were picked up by regsubsets as features that have some influence on SAT scores
To cross validate, feature selection was repeated with
steps - the same 3 features were picked up by
steps function as well.
## (Intercept) poverty.ratio size female.ratio ## -4.4799063 -0.4308900 0.1759099 0.1954207
Influence of Poverty , Class size and gender ratio over school performance:
A linear regression of School Performance with these three variables as the predictors was performed.
## ## Call: ## lm(formula = total.percent ~ poverty.ratio + size + female.ratio, ## data = scaled.district.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.0172 -0.4736 -0.1017 0.3043 1.8254 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.47991 1.55994 -2.872 0.00769 ** ## poverty.ratio -0.43089 0.14734 -2.925 0.00676 ** ## size 0.17591 0.06108 2.880 0.00754 ** ## female.ratio 0.19542 0.12841 1.522 0.13925 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.6657 on 28 degrees of freedom ## Multiple R-squared: 0.5997, Adjusted R-squared: 0.5568 ## F-statistic: 13.98 on 3 and 28 DF, p-value: 9.303e-06
poverty features displayed statistically significant influence and hence gender-ratio was dropped from further analysis.
I decided to take a closer look on the impact of these two key attributes on the overall performance.
When a regression plot on school performance was plotted against class size and poverty, the result was a surprise.
While the influence of poverty on SAT scores was in line with the expectation (increased poverty rates result in decreased scores), the impact of class size was totally unexpected.
The trend line shows performance declining with smaller class sizes.
Taking a second look at these plots , the impact of class size over school performance looks like almost a mirror image of poverty plot. I wanted to understand the relation between these two factors. What I found was really interesting.