In [1]:
import pandas
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
recent_grads = pandas.read_csv("/Users/33Phoebe/Documents/OneDrive/Data Scientist Path/Data Sets/recent-grads.csv")
recent_grads.iloc[0]
Out[3]:
In [4]:
recent_grads.head()
Out[4]:
this dataset is based on American Community Survey 2010-2012 Public Use Microdata Series, edited and provided by FiveThirtyEight
In [5]:
recent_grads.tail()
Out[5]:
In [6]:
recent_grads.describe()
Out[6]:
In [7]:
raw_data_counts = len(recent_grads)
recent_grads.shape
Out[7]:
In [9]:
recent_grads = recent_grads.dropna()
cleaned_data_count = len(recent_grads)
cleaned_data_count
Out[9]:
In [10]:
cols = ["Sample_size", "Median", "Unemployment_rate", "Full_time", "ShareWomen", "Men", "Women"]
recent_grads.plot(x = cols[0], y= cols[1], kind = 'scatter', title = 'Sample size Vs. Median')
recent_grads.plot(x = cols[0], y= cols[2], kind = 'scatter', title = 'Sample size Vs. Unemployment_rate')
recent_grads.plot(x = cols[3], y= cols[1], kind = 'scatter', title = 'Full time Vs. Median')
recent_grads.plot(x = cols[4], y= cols[2], kind = 'scatter', title = 'ShareWomen Vs. Unemployment rate')
recent_grads.plot(x = cols[5], y= cols[1], kind = 'scatter', title = 'Men Vs. Median')
recent_grads.plot(x = cols[6], y= cols[1], kind = 'scatter', title = 'Women Vs. Median')
Out[10]:
Finding:¶
- No clear relationship between the popularity of the major to its median income
- No clear relationship between the women partaking in a major to its median income
- No clear relationship between the number of full_time employees to the median salary
- However, comparing women/men Vs. Median, it appears the average of men median income is higher than that of women
In [11]:
ax = recent_grads[cols[0]].hist(bins = 25, range =(0, 400))
ax.set_title("sample size")
Out[11]:
Exploring the histogram of the sample size shows most of the majors have less than 400 people responded, while 17.5% of the majors didn't respond at all.
In [12]:
ax = recent_grads[cols[1]].hist(bins = 25, range =(20000, 80000))
ax.set_title("Median Salary")
Out[12]:
Graduates of most majors have a median salary of 30,000~40,000
In [13]:
col = ["Employed", "Full_time", "ShareWomen", "Unemployment_rate", "Men", "Women"]
ax = recent_grads[col[0]].hist(bins = 25, range =(0, 200000))
ax.set_title("Employed")
#Employed concentrates around 0~50000, showing most of the majors are fairly new?
Out[13]:
In [14]:
ax = recent_grads[col[1]].hist(bins = 25, range = (0, 175000))
ax.set_title("Full Time")
Out[14]:
In [15]:
ax = recent_grads[col[2]].hist(bins = 50)
ax.set_title("ShareWomen")
#Except some, most majors have a fair quantity of women students
Out[15]:
In [16]:
ax = recent_grads[col[3]].hist(bins = 25)
ax.set_title("Unemployment Rate")
#Most Majors have around 6%~8% unemployment rate
Out[16]:
In [17]:
ax = recent_grads[col[4]].hist(bins = 25, range = (0, 40000))
ax.set_title("Men")
Out[17]:
In [18]:
ax = recent_grads[col[5]].hist(bins = 25, range = (0, 50000))
ax.set_title("Women")
Out[18]:
Finding¶
- Judging by the sharewomen graph, the majors with predominantly male and female are about the same amount
- Most common median salary is 30,000~40,000
In [20]:
from pandas.tools.plotting import scatter_matrix
In [21]:
scatter_matrix(recent_grads[["Sample_size", "Median"]], figsize = (8, 8))
Out[21]:
In [24]:
scatter_matrix(recent_grads[["Sample_size", "Median", "Unemployment_rate"]], figsize = (9, 9))
Out[24]:
In [25]:
recent_grads[:5]["Women"].plot(kind = "bar")
Out[25]:
In [26]:
recent_grads[:5].plot.bar(x = "Major", y = "Women")
Out[26]:
In [32]:
sorted_rg = recent_grads.sort_values("Median")
sorted_rg.head()
sorted_rg.tail()
Out[32]:
In [31]:
sorted_rg[:10].plot.bar("Major", "ShareWomen")
Out[31]:
The top 10 lowest income majors have mostly women students, with 3 majors have 90% women and the major with fewest women in this ranking still has more than 60% women.
In [39]:
sorted_rg[len(sorted_rg)-10:len(sorted_rg)].plot.barh("Major", "ShareWomen")
Out[39]:
In comparison, the top 10 highest paid majors have very low female:male ratio, on average around 15% with some less than 10%
In [40]:
sorted_rg[:10].plot.barh("Major", "Unemployment_rate")
sorted_rg[len(sorted_rg)-10:len(sorted_rg)].plot.barh("Major", "Unemployment_rate")
Out[40]:
These female dominated majors are of a roughly higher unemployment rate than the male dominate majors.