NYC School Performance, Poverty and Class size Analysis

Posted on Mar 30, 2015


Considering the importance of education in an increasingly knowledge based economy, I performed an exploratory data analysis of school performance in relation to various attributes that might potentially have an influence, with the following objectives.

  • Understand what attributes actually influence school performance.
  • Analyse if our current affirmative action plans and college admission policies reflect such influence.

Scope, Variables and Datasets:

Analysis was restricted to NYC puclic schools ( comprising 32 school districts)

Factors considered:

  • Attendance rate
  • School safety
  • Class size
  • School district size
  • Poverty ratio
  • Ethnic background
  • Gender ratio
  • English language learners ratio

SAT score. covering Math , Reading and Writing was used as an indicator of school performance.

Following datasets were used for the analysis

  1. School Attendance File
  2. Class size File
  3. Demographics File
  4. School Safety report
  5. 2010 SAT score file
  6. 2014 SAT score file


  • All source datasets were merged by District-id:School-id to create the master file.
  • The dataset was scaled and centered as the features measured are vastly different- for example , poverty ratio is in percentage , class size in tens and SAT scores in hundreds .
  • Data set was checked for Near-Zero variance attributes using nearZeroVar function, so they can be dropped from feature set, there were none .
  • Data set was checked for highly correlated variables using vif function, so they can be dropped from feature set, there were none .

Feature Selection:

regsubsets was for used feature selection - following 3 features out of the total 8 feature, were picked up by regsubsets as features that have some influence on SAT scores
Class size
Poverty and
Gender Ratio