import pandas as pd
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
datasets used in this project come from NYC Department of Education
for f in data_files:
d = pd.read_csv("/Users/33Phoebe/Documents/OneDrive/Data Scientist Path/Data Sets/NYC School Data/{0}".format(f))
key_name = f.replace(".csv", "")
data[key_name] = d
all_survey = pd.read_csv("/Users/33Phoebe/Documents/OneDrive/Data Scientist Path/Data Sets/NYC School Data/survey_all.txt", delimiter = "\t", encoding = "windows-1252")
d75_survey = pd.read_csv("/Users/33Phoebe/Documents/OneDrive/Data Scientist Path/Data Sets/NYC School Data/survey_d75.txt", delimiter = "\t", encoding = "windows-1252")
#combine two df one after another
survey = pd.concat([all_survey, d75_survey], axis = 0)
all_survey.head()
#rename dbn to DBN and select relevant variables
survey["DBN"] = survey["dbn"]
survey = survey[["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]]
data["survey"] = survey
print(data["survey"].shape)
for key, value in data.items():
print(key)
print(value.head())
data cleaning: make the key across datasets consistent¶
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
#function to pad 0 for single digit CSD in class_size
def pad(num):
numstr = str(num)
if len(numstr) == 2:
return numstr
return numstr.zfill(2) #pad 0 to single digit num
data["class_size"]["padCSD"] = data["class_size"]["CSD"].apply(pad)
#create DBN by concatenate padCSD and school code columns
data["class_size"]["DBN"] = data["class_size"]["padCSD"] + data["class_size"]["SCHOOL CODE"]
data["class_size"].info()
#combine three scores into one
scores = ["SAT Math Avg. Score", "SAT Critical Reading Avg. Score", "SAT Writing Avg. Score"]
for score in scores:
data["sat_results"][score] = pd.to_numeric(data["sat_results"][score], errors = "coerce")
data["sat_results"]["sat_score"] = data["sat_results"][scores[0]] + data["sat_results"][scores[1]] + data["sat_results"][scores[2]]
data["sat_results"].head()
#coordinate values are imbedded in the "Location 1" variable of "hs_directory", need to make it stand-alone
import re
def strip(string):
coordinates = re.findall("\(.+\)", string)
lat = coordinates[0].split(",")[0].replace("(", "")
return lat
#strip out the latitudes from the coordinates in the Location 1 col
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(strip)
data["hs_directory"].head()
#strip lon and convert both cols to numeric values
def strip_lon(string):
coordinates = re.findall("\(.+\)", string)
lat = coordinates[0].split(",")[1].replace(")", "")
return lat
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(strip_lon)
coor = ["lat", "lon"]
directory = data["hs_directory"]
for c in coor:
directory[c] = pd.to_numeric(directory[c], errors = "coerce")
directory.head()
data["class_size"].head()
#variables "GRADE " and "PROGRAM TYPE" have many redundant or missing values that are not needed in this analysis
print(data["class_size"]["GRADE "].unique())
print(data["class_size"]["PROGRAM TYPE"].unique())
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size.head()
#condense the class_size data and make the DBN unique
import numpy
dbn_group = class_size.groupby("DBN", as_index = False)
class_size = dbn_group.agg(numpy.mean)
data["class_size"] = class_size
data["class_size"].head()
#condense demographics data by selecting the corresponding school year 20112012
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["demographics"].head()
#condense graduation data
grad = data["graduation"]
grad = grad[grad["Cohort"] == '2006']
grad = grad[grad["Demographic"] == "Total Cohort"]
data["graduation"] = grad
grad.head()
#work on ap_2010 datasets
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
ap = data["ap_2010"]
for col in cols:
ap[col] = pd.to_numeric(ap[col], errors = "coerce")
print(ap[col].dtypes)
Combine datasets¶
#combine sat_results with ap_2010 and graduation by left join
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], how = 'left', on = "DBN")
combined = combined.merge(data["graduation"], how = 'left', on = "DBN")
combined.shape
#inner join the other datasets
dfs = [data["class_size"], data["demographics"], data["survey"], data["hs_directory"]]
for df in dfs:
combined = combined.merge(df, how = "inner", on = "DBN")
print(combined.head())
combined.shape
clean the combined data¶
#fillna will fill with mean first and still missing value will be filled with 0
means = combined.mean()
combined = combined.fillna(means)
combined = combined.fillna(0)
combined.head()
#extract the first two char in DBN
def ext2(string):
return string[:2]
combined["school_dist"] = combined["DBN"].apply(ext2)
combined["school_dist"].head()
#find correlations between variables and "sat_score"
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
#look at the high correlation between enrollment and sat_score with scatterplot
import matplotlib.pyplot as plt
%matplotlib inline
combined.plot(x = "total_enrollment", y = "sat_score", kind = "scatter")
#as it seems the high correlation of enrollment rate and sat score is caused by the ones in the left lower corner:
low_enrollment = combined[combined["total_enrollment"] < 1000]
low_enrollment = low_enrollment[low_enrollment["sat_score"] < 1000]
print(low_enrollment["School Name"])
with some research on the internet reveals these schools target the children of new immigrants, whose unfamiliarity of the English language could be a major barrier preventing them to get higher sat scores
#it shows higher percentage of english language learner might correlate with the low SAT score, to look into it:
combined.plot(x = "ell_percent", y = "sat_score", kind = "scatter")
this scatterplot further supports our hypothesis that the language barrier among English Language Learner could make them lag behind in the tests
#focus on the nyc by setting boundaries of lon and lat
from mpl_toolkits.basemap import Basemap
m = Basemap(
projection='merc',
llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272,
resolution='i')
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
longitudes = combined["lon"].tolist()
latitudes = combined["lat"].tolist()
#zorder=2 indicates the spots will be ontop of maps, latlon indicates instead of x, y, the lon and lat will be passed into the function
#cmap indicates the color scheme, c = df[col] meaning the color scheme will be rendered on this var
m.scatter(x = longitudes, y = latitudes, s = 20, zorder = 2, latlon = True, c = combined["ell_percent"], cmap = "summer")
plt.show()
#group schools by district
group_dist = combined.groupby("school_dist", as_index = False)
districts = group_dist.agg(numpy.mean)
districts.head(10)
#scatter plot with school district instead of individual school
from mpl_toolkits.basemap import Basemap
m = Basemap(
projection='merc',
llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272,
resolution='i')
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
longitudes = districts["lon"].tolist()
latitudes = districts["lat"].tolist()
m.scatter(longitudes, latitudes, s = 50, zorder = 2, latlon = True, c = districts["ell_percent"], cmap = "summer")
plt.show()
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_10",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
sortedcorr = correlations[survey_fields].sort_values(ascending = False)
sortedcorr.plot(kind = 'barh', figsize = (12, 6))
Findings¶
- It appears the rate of response of parents are negatively related with the SAT score while student's positive response and response rate is highly correlated with the SAT score. Also teacher's response is positively correlated.
- Including safe and respect, communication quality all contribute to a higher SAT score.
combined.plot("saf_s_11", "sat_score", kind = 'scatter')
- when safety score is below 6.5, it doesn't seem correlated with sat score much.
- when safety score goes up to 7 and to about 8, it shows a positive correlation with SAT score, however, with the further going in the safety score, the effect disppears.
from mpl_toolkits.basemap import Basemap
m = Basemap(
projection='merc',
llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272,
resolution='i')
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
m.fillcontinents(color='white',lake_color='#85A6D9')
m.scatter(longitudes, latitudes, s = 50, zorder = 2, latlon = True, c = districts["saf_s_11"], cmap = 'summer')
- It appears the schools in the lower manhattan and north brooklyn has a higher response in safety
- the safety isn't so good in the mid-brooklyn area
race = ["white_per", "asian_per", "black_per", "hispanic_per"]
sorted_race = correlations[race].sort_values()
sorted_race.plot(kind = 'barh')
- As all of the correlations > abs[0.25], correlations appear to be pretty strong for all 4 races. However, white and asian races are positively correlated while being black and hispanic certainly have a strong negative effects.
combined.plot.scatter("hispanic_per", "sat_score")
It is pretty obvious hispanic percentage is negatively correlated with the SAT score, when the percentage of Latino goes up to 25% or higher, there is no score above 1500.
combined.head()
combined["SCHOOL NAME"][combined["hispanic_per"] > 95]
- Again, most schools here are international schools for immigrants, which might explain the low score
print(combined["SCHOOL NAME"][(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)])
print(combined["SCHOOL NAME"][combined["sat_score"] > 1800])
Most of these high schools are "specialized school" for talented students and only admit the ones who pass the entrance exam. It explains the high sat_scores, while taking off the "hispanic percentage < 10" condition, still 5 school are the same. Maybe hispanic tends to do worse on the standard exams, or they are ususally newer immigrants who have problems with the language thus not admitted in many quality schools.
sex = ["male_per", "female_per"]
correlations[sex].plot(kind = 'barh')
although not very strong, male percentage does have a negative correlation while female percentage have a positive correlation
combined.plot.scatter("female_per", "sat_score")
it's fairly interesting that too high or too low the female percentage would both have very negative effect on the SAT score: lower than 30% or higher than 80% female percentage cause no school have sat score higher than 1400. However, from 40%~60%, the increase in female percentage seems having a positive effect.
combined[(combined["female_per"] > 60) & (combined["sat_score"] > 1700)]
these schools boost strong AP test preparation, might be a strong indication of why students are performing better in SAT
combined["ap_per"] = combined["AP Test Takers "]/combined["total_enrollment"]
combined.plot.scatter('ap_per', 'sat_score')
combined.corr()["ap_per"]["sat_score"]
correlation between ap_per and sat score is fairly low with r score < 0.05. however, when looking at the scatter plot, it appears before ap percentage reaches 20%, there is a positive trend between these two. On the other hand, when ap percentage above 40%, concentrating on AP does not only not helping, but also has a very negative effect on the turnouts. when ap percentage is above 40, there is no score above 1200.