Results Of Bike Sharing Data Analysis Using Random Forest And Gradient Boosting Regressor
Types of the data set used in the study
Results from the bike sharing data is shown in the first section. The random forest model and the gradient bosting regressor model has been performed with this data set and the results are shown below.
Types of the data set used in the study
(17379, 17)
instant int64
dteday object
season int64
yr int64
mnth int64
hr int64
holiday int64
weekday int64
workingday int64
weathersit int64
temp float64
atemp float64
hum float64
windspeed float64
casual int64
registered int64
cnt int64
dtype: object
Descriptive statistics of the data set :
instant season yr mnth hr
count 17379.0000 17379.000000 17379.000000 17379.000000 17379.000000
mean 8690.0000 2.501640 0.502561 6.537775 11.546752
std 5017.0295 1.106918 0.500008 3.438776 6.914405
min 1.0000 1.000000 0.000000 1.000000 0.000000
25% 4345.5000 2.000000 0.000000 4.000000 6.000000
50% 8690.0000 3.000000 1.000000 7.000000 12.000000
75% 13034.5000 3.000000 1.000000 10.000000 18.000000
max 17379.0000 4.000000 1.000000 12.000000 23.000000
holiday weekday workingday weathersit temp
count 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000
mean 0.028770 3.003683 0.682721 1.425283 0.496987
std 0.167165 2.005771 0.465431 0.639357 0.192556
min 0.000000 0.000000 0.000000 1.000000 0.020000
25% 0.000000 1.000000 0.000000 1.000000 0.340000
50% 0.000000 3.000000 1.000000 1.000000 0.500000
75% 0.000000 5.000000 1.000000 2.000000 0.660000
max 1.000000 6.000000 1.000000 4.000000 1.000000
atemp hum windspeed casual registered
count 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000
mean 0.475775 0.627229 0.190098 35.676218 153.786869
std 0.171850 0.192930 0.122340 49.305030 151.357286
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.333300 0.480000 0.104500 4.000000 34.000000
50% 0.484800 0.630000 0.194000 17.000000 115.000000
75% 0.621200 0.780000 0.253700 48.000000 220.000000
max 1.000000 1.000000 0.850700 367.000000 886.000000
cnt
count 17379.000000
mean 189.463088
std 181.387599
min 1.000000
25% 40.000000
50% 142.000000
75% 281.000000
max 977.000000
The results from the descriptive statistics is shown in the table above and the results includes the value for mean, median, standard deviation and the three major percentiles.
Results from the feature length:
Feature vector length for categorical features: 4
Feature vector length for numerical features: 7
Total feature vector length: 11
Results from the Random forest regression model
Decision Tree feature vector : [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
Decision Tree feature vector length: 10
Decision Tree predictions: [ 38.07836056 38.07836056 38.07836056 …, 38.07836056 38.07836056
38.07836056]
Decision Tree depth: 2
Decision Tree depth: None
Results from the Decision tree log
Decision Tree Log – Mean Squared Error 3648.31141046
Decision Tree Log – Mean Squared Error 43.7264464287
The results from the decision tree and the decision tree log is shown in the table above and the it shows that the decision tree log for the first model is 3648.311 whereas in the second decision tree log mean squared error is 43.72. On the basis of the errors it can be said that the second model is better as the error value is less.
Results from the task 2 has been discussed in the current section
Descriptive statistics of the data set
Gradient booster Regression
Decision Tree feature vector : [ 0. 0. 0. 0. 0. 0. 0.
- 0. 0.03307816 0.96692184]
Decision Tree feature vector length: 10
Decision Tree predictions: [ 87.22443325 87.22443325 87.22443325 …, 117.91996176 99.23814659
87.22443325]
Decision Tree depth: 2
Decision Tree depth: None
Decision tree log
For the gradient booster regression the decision tree log has been shown below.
Decision Tree Log – Mean Squared Error 6359.19130385
Decision Tree Log – Mean Squared Error 57.0455144575
As the results shows in this case also the mean squared error is less in the second case, which means that the second model is better as compared to the first one.
For the second part of the project the data from the Kaggle has been downloaded and similar methodology was used to analyze the data. In this case the mtcars data from kaggle has been used for the analysis purpose and the results from the analysis are shown in the below section.
Description of the data
(32, 12)
Unnamed: 0 object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
As the results shows there are 12 different types of variables included in the data set with 32 data points.
Descriptive statistics of the continuous variable
mpg cyl disp hp drat wt
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000
qsec vs am gear carb
count 32.000000 32.000000 32.000000 32.000000 32.0000
mean 17.848750 0.437500 0.406250 3.687500 2.8125
std 1.786943 0.504016 0.498991 0.737804 1.6152
min 14.500000 0.000000 0.000000 3.000000 1.0000
25% 16.892500 0.000000 0.000000 3.000000 2.0000
50% 17.710000 0.000000 0.000000 4.000000 2.0000
75% 18.900000 1.000000 1.000000 4.000000 4.0000
max 22.900000 1.000000 1.000000 5.000000 8.0000
Descriptive statistics of the variables is presented in the table above which helps the researcher to have an overview of the collected data set.
Results from the Feature vector length:
Feature vector length for categorical features: 5
Feature vector length for numerical features: 5
Total feature vector length: 10
As the result suggest total feature vector length is 10 where the numerical is 5 and categorical is 5.
Results from the Decision tree:
Decision Tree feature vector : [ 0.16263247 0.31856472 0.04314577 0.00312414 0.4657291 0.0068038 0.
- 0. 0. ]
Decision Tree feature vector length: 10
Decision Tree predictions: [ 20.38783883 20.38783883 23.62423918 19.85791236 17.32767526
19.33232794 14.96165695 21.1793161 21.5733161 19.33232794
19.33232794 15.45256604 15.45256604 15.45256604 14.35451409
14.35451409 14.96165695 30.30945238 29.08578571 30.90028571
21.5733161 16.07032361 17.32767526 14.96165695 16.07032361
28.60745238 25.69613095 28.81042857 16.74459301 20.38783883
14.96165695 21.5733161 ]
Decision Tree depth: 2
Decision Tree depth: None
Decision Tree Log – Mean Squared Error 2.99511855857
Decision Tree Log – Mean Squared Error 1.34997530962
In this case also the mean squared error for the second model is less as compared to the model 1. However the difference in the model is not very high.
Results from the Gradient boosting Regressor
Decision Tree feature vector : [ 0.26241313 0.29232947 0.18088804 0. 0.25787725 0.00649211
- 0. 0. 0. ]
Decision Tree feature vector length: 10
Decision Tree predictions: [ 20.36937837 20.36937837 21.608521 19.67536304 17.93841175
19.47194207 16.63320595 21.608521 21.608521 18.96648356
18.96648356 17.93841175 17.93841175 17.93841175 14.747734 14.747734
16.63320595 27.65988835 26.81396877 27.65988835 21.608521
17.93841175 17.93841175 16.63320595 17.93841175 26.34583871
23.69848123 26.34583871 17.05336798 19.86391986 16.63320595
21.608521 ]
Decision Tree depth: 2
Decision Tree depth: None
Decision Tree Log – Mean Squared Error 2.99511855857
Decision Tree Log – Mean Squared Error 1.34997530962
In this case also the mean squared error are less in the second case as compared to the first case.