import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
bike = pd.read_csv("Data/bike_rental_hour.csv")
bike.head()
Here are the descriptions for the relevant columns:
- instant - A unique sequential ID number for each row
- dteday - The date of the rentals
- season - The season in which the rentals occurred
- yr - The year the rentals occurred
- mnth - The month the rentals occurred
- hr - The hour the rentals occurred
- holiday - Whether or not the day was a holiday
- weekday - The day of the week (as a number, 0 to 7)
- workingday - Whether or not the day was a working day
- weathersit - The weather (as a categorical variable)
- temp - The temperature, on a 0-1 scale
- atemp - The adjusted temperature
- hum - The humidity, on a 0-1 scale
- windspeed - The wind speed, on a 0-1 scale
- casual - The number of casual riders (people who hadn't previously signed up with the bike sharing program)
- registered - The number of registered riders (people who had already signed up)
- cnt - The total number of bike rentals (casual + registered)
This dataset contains the hourly count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information, the original data also includes the daily count, which is not used here download
bike["cnt"].hist()
plt.show()
bike.corr()["cnt"].sort_values()
def assign_label(num):
if num>=6 and num <12:
return 1
elif num >=12 and num < 18:
return 2
elif num >= 18 and num <= 24:
return 3
else:
return 4
bike["time_label"] = bike["hr"].apply(assign_label)
plt.scatter(bike["time_label"], bike["cnt"])
thinking of choosing decision tree, as the cols that correlate most with the target cells do not present a linear relationship, and the target cell is a continuous number, given the situation a DescisionTreeRegressor would be a good model to start from
bike.shape
#choose a random 80% of data from bike to be the training set
train = bike.sample(frac = .8)
train.shape
#test is the remainings:
test = bike.loc[~bike.index.isin(train.index)]
Guided proj asks for linear regression, and reminds the model works best when predictors are linearly correlated to the target and also independent
cols = ["season", "time_label", "atemp", "hum", "weathersit"]
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lr = LinearRegression()
lr.fit(train[cols], train["cnt"])
predict = lr.predict(test[cols])
mse = mean_squared_error(test["cnt"], predict)
rmse = np.sqrt(mse)
print(rmse)
bike["cnt"].plot(kind = "box")
np.std(bike["cnt"])
the rmse is very high and by looking at the scatterplot and std dev we know that the existance of the outliers might have highly distorted the output
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(min_samples_split = 50, min_samples_leaf = 15, max_depth = 11)
dt.fit(train[cols], train["cnt"])
prediction = dt.predict(test[cols])
mse = mean_squared_error(test["cnt"], prediction)
rmse = np.sqrt(mse)
print(rmse)
With min_samples_leaf parameter, the rmse does decrease, however, the change in the quantity once pass 10 doesn't seem change the output much. Although much better, decision tree still has a relatively high rmse. Will use random forest below
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 50, min_samples_split = 50, min_samples_leaf = 12, max_features = "auto", max_depth = 11)
rfr.fit(train[cols], train["cnt"])
prediction = rfr.predict(test[cols])
mse = mean_squared_error(test["cnt"], prediction)
rmse = np.sqrt(mse)
print(rmse)
random forest has improved pretty obviously from the decision tree due to sophisticated bagging and ensembling.