import pandas as pd
starwars = pd.read_csv("/Users/33Phoebe/Documents/OneDrive/Data Scientist Path/Data Sets/starwars.csv", encoding = "ISO-8859-1")
starwars.head(10)
this dataset is the results from a survey conducted by FiveThirtyEight in Jun 2014 with 1186 respondents from SurveyMonkey Audience
starwars.columns
# eliminate the null values
starwars = starwars[pd.notnull(starwars["RespondentID"])]
starwars.shape
Converting Yes/No to Boolean with Series.Map() Method¶
def convert(series):
yes_no = {"Yes": True, "No": False}
return series.map(yes_no)
cols = ["Have you seen any of the 6 films in the Star Wars franchise?", "Do you consider yourself to be a fan of the Star Wars film franchise?"]
for col in cols:
starwars[col] = convert(starwars[col])
starwars.head()
starwar_movies = ["Star Wars: Episode I The Phantom Menace", "Star Wars: Episode II Attack of the Clones", "Star Wars: Episode III Revenge of the Sith", "Star Wars: Episode IV A New Hope", "Star Wars: Episode V The Empire Strikes Back", "Star Wars: Episode VI Return of the Jedi"]
movie_dict = {}
for movie in starwar_movies:
movie_dict[movie] = True
#making sure the name in starwar_movies match the entries
for i in starwars.iloc[:, 3:9]:
print(starwars[i].unique())
import numpy as np
movie_dict[np.nan] = False
print(movie_dict)
cols = starwars.columns[3:9].tolist()
print(cols)
rename_dict = {}
num = 1
for col in cols:
rename_dict[col] = 'seen_' + str(num)
num += 1
rename_dict
#rename col 3~9:
starwars = starwars.rename(columns = rename_dict)
newcol = starwars
for col in newcol.iloc[:, 3:9]:
newcol[col] = newcol[col].map(movie_dict)
newcol.head()
starwars = newcol
starwars.head()
#convert data type
starwars[starwars.columns[9:15]] = starwars[starwars.columns[9:15]].astype(float)
rename_dict2 = {}
ranking_cols = starwars.columns[9:15].tolist()
num = 1
for col in ranking_cols:
rename_dict2[col] = 'ranking_' + str(num)
num += 1
rename_dict2
starwars = starwars.rename(columns = rename_dict2)
starwars.head()
ranking = starwars[starwars.columns[9:15]].mean()
import matplotlib.pyplot as plt
%matplotlib inline
ranking.plot.bar()
From the barchart above, we can tell that movie 5 is ranked highest among the franchise.
seen = starwars[starwars.columns[3:9]].sum()
seen.plot.bar()
It appears most people have seen the most recent two movies as well as the very first one. Movie 2 and 3 are ranked the lowest and correspondingly the viewership is the lowest two. While the 4th movie is a rejuvenation for the series, the viewership goes up at the sametime two, and highest viewership occurs at the best movie, and makes great sense, also the viewership persists at a high level from the 5th to 6th due to audience expectation, presumably.
males = starwars[starwars["Gender"] == "Male"]
females = starwars[starwars["Gender"] == "Female"]
male_seen = males[males.columns[3:9]].sum()
female_seen = females[females.columns[3:9]].sum()
male_ranking = males[males.columns[9:15]].mean()
female_ranking = females[females.columns[9:15]].mean()
test = [male_seen, female_seen, male_ranking, female_ranking]
j = [0, 0, 1, 1]
g = [0, 1, 0, 1]
fig, axes = plt.subplots(2, 2, figsize = (9, 9))
for i in range(4):
if i!=1:
test[i].plot.bar(ax = axes[j[i], g[i]])
else:
test[i].plot.bar(ax = axes[j[i], g[i]], ylim = (0, 400))
plt.show()
There are more men watched series than female do, as we would expect. Except movie 3, which both groups equally dislike, movie 4, female views dislike more, overall females have a more positive reviews on the series.