Recently, I’ve been doing some Kaggle competitions in my spare time and then sharing my approach / solution on here. Last week, I did a binary classification task around predicting Titanic survivors. In this second installment, let’s dive into a regression problem, Bike Sharing Demand.
The problem involves Washington D.C.’s bike sharing program. The goal is to forecast hourly demand based on historical usage patterns and weather data. We’re given a dataset that’s complete for the first 19 days of every month (over a 2 year period), and we need to predict how many bikes are rented by hour for the remaining days of each month.
There are several interesting pieces to this problem that can collectively help you arrive at a good model and result (top 5% of entries as of this post):
- There’s a timestamp column in the dataset that can be parsed to create several different time related variables (hour, day of the week, month, year, etc.) that are useful in modeling the intraday & seasonal trends.
- In the training data, rentals are broken out into two groups of users (registered and casual), and these groups exhibit different usage patterns (e.g., see day of week chart below). Because of this, it seems to be beneficial to regress demand for these users separately and then combine the results together.
- Try applying some data transformations of your dependent variables to better account for their skewed distributions (e.g., note the mass of values bunched at the low end of the ‘casual’ histogram).
- After landing on a suitable feature set, take some time to optimize your model parameters; even simple (non-comprehensive) tuning can dramatically improve your score in this challenge.
Enough with the chit chat, here’s the code (github repo here):
_data | atemp | casual | count | datetime | holiday | humidity | registered | season | temp | weather | windspeed | workingday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 14.395 | 3 | 16 | 2011-01-01 00:00:00 | 0 | 81 | 13 | 1 | 9.84 | 1 | 0 | 0 |
1 | train | 13.635 | 8 | 40 | 2011-01-01 01:00:00 | 0 | 80 | 32 | 1 | 9.02 | 1 | 0 | 0 |
2 | train | 13.635 | 5 | 32 | 2011-01-01 02:00:00 | 0 | 80 | 27 | 1 | 9.02 | 1 | 0 | 0 |
3 | train | 14.395 | 3 | 13 | 2011-01-01 03:00:00 | 0 | 75 | 10 | 1 | 9.84 | 1 | 0 | 0 |
4 | train | 14.395 | 0 | 1 | 2011-01-01 04:00:00 | 0 | 75 | 1 | 1 | 9.84 | 1 | 0 | 0 |
_data train
atemp 14.395
casual 3
count 16
datetime 2011-01-01 00:00:00
holiday 0
humidity 81
registered 13
season 1
temp 9.84
weather 1
windspeed 0
workingday 0
date 2011-01-01
day 1
month 1
year 2011
hour 0
dow 5
doy 1
woy 52
weeks_since_start 0
casual_log 1.386294
registered_log 2.639057
count_log 2.833213
Name: 0, dtype: object
casual registered
count 456.000000 456.000000
mean 859.945175 3713.467105
std 698.913571 1494.477105
min 9.000000 491.000000
25% 318.000000 2696.000000
50% 722.000000 3700.000000
75% 1141.750000 4814.250000
max 3410.000000 6911.000000
count | casual | registered | weather | temp | atemp | humidity | windspeed | holiday | workingday | season | day | month | year | hour | dow | doy | woy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.000000 | 0.690414 | 0.970948 | -0.128655 | 0.394454 | 0.389784 | -0.317371 | 0.101369 | -0.005393 | 0.011594 | 0.163439 | 0.019826 | 0.166862 | 0.260403 | 0.400601 | -0.002283 | 0.168056 | 0.152512 |
casual | 0.690414 | 1.000000 | 0.497250 | -0.135918 | 0.467097 | 0.462067 | -0.348187 | 0.092276 | 0.043799 | -0.319111 | 0.096758 | 0.014109 | 0.092722 | 0.145241 | 0.302045 | 0.246959 | 0.092957 | 0.079906 |
registered | 0.970948 | 0.497250 | 1.000000 | -0.109340 | 0.318571 | 0.314635 | -0.265458 | 0.091052 | -0.020956 | 0.119460 | 0.164011 | 0.019111 | 0.169451 | 0.264265 | 0.380540 | -0.084427 | 0.170805 | 0.156480 |
weather | -0.128655 | -0.135918 | -0.109340 | 1.000000 | -0.055035 | -0.055376 | 0.406244 | 0.007261 | -0.007074 | 0.033772 | 0.008879 | -0.007890 | 0.012144 | -0.012548 | -0.022740 | -0.047692 | 0.011746 | 0.019762 |
temp | 0.394454 | 0.467097 | 0.318571 | -0.055035 | 1.000000 | 0.984948 | -0.064949 | -0.017852 | 0.000295 | 0.029966 | 0.258689 | 0.015551 | 0.257589 | 0.061226 | 0.145430 | -0.038466 | 0.255887 | 0.240794 |
atemp | 0.389784 | 0.462067 | 0.314635 | -0.055376 | 0.984948 | 1.000000 | -0.043536 | -0.057473 | -0.005215 | 0.024660 | 0.264744 | 0.011866 | 0.264173 | 0.058540 | 0.140343 | -0.040235 | 0.262245 | 0.248653 |
humidity | -0.317371 | -0.348187 | -0.265458 | 0.406244 | -0.064949 | -0.043536 | 1.000000 | -0.318607 | 0.001929 | -0.010880 | 0.190610 | -0.011335 | 0.204537 | -0.078606 | -0.278011 | -0.026507 | 0.203155 | 0.216435 |
windspeed | 0.101369 | 0.092276 | 0.091052 | 0.007261 | -0.017852 | -0.057473 | -0.318607 | 1.000000 | 0.008409 | 0.013373 | -0.147121 | 0.036157 | -0.150192 | -0.015221 | 0.146631 | -0.024804 | -0.148062 | -0.145962 |
holiday | -0.005393 | 0.043799 | -0.020956 | -0.007074 | 0.000295 | -0.005215 | 0.001929 | 0.008409 | 1.000000 | -0.250491 | 0.029368 | -0.015877 | 0.001731 | 0.012021 | -0.000354 | -0.191832 | 0.001134 | 0.000976 |
workingday | 0.011594 | -0.319111 | 0.119460 | 0.033772 | 0.029966 | 0.024660 | -0.010880 | 0.013373 | -0.250491 | 1.000000 | -0.008126 | 0.009829 | -0.003394 | -0.002482 | 0.002780 | -0.704267 | -0.003024 | -0.022593 |
season | 0.163439 | 0.096758 | 0.164011 | 0.008879 | 0.258689 | 0.264744 | 0.190610 | -0.147121 | 0.029368 | -0.008126 | 1.000000 | 0.001729 | 0.971524 | -0.004797 | -0.006546 | -0.010553 | 0.970196 | 0.939284 |
day | 0.019826 | 0.014109 | 0.019111 | -0.007890 | 0.015551 | 0.011866 | -0.011335 | 0.036157 | -0.015877 | 0.009829 | 0.001729 | 1.000000 | 0.001974 | 0.001800 | 0.001132 | -0.011070 | 0.054102 | 0.018538 |
month | 0.166862 | 0.092722 | 0.169451 | 0.012144 | 0.257589 | 0.264173 | 0.204537 | -0.150192 | 0.001731 | -0.003394 | 0.971524 | 0.001974 | 1.000000 | -0.004932 | -0.006818 | -0.002266 | 0.998616 | 0.961809 |
year | 0.260403 | 0.145241 | 0.264265 | -0.012548 | 0.061226 | 0.058540 | -0.078606 | -0.015221 | 0.012021 | -0.002482 | -0.004797 | 0.001800 | -0.004932 | 1.000000 | -0.004234 | -0.003785 | -0.000837 | -0.003411 |
hour | 0.400601 | 0.302045 | 0.380540 | -0.022740 | 0.145430 | 0.140343 | -0.278011 | 0.146631 | -0.000354 | 0.002780 | -0.006546 | 0.001132 | -0.006818 | -0.004234 | 1.000000 | -0.002925 | -0.006735 | -0.006532 |
dow | -0.002283 | 0.246959 | -0.084427 | -0.047692 | -0.038466 | -0.040235 | -0.026507 | -0.024804 | -0.191832 | -0.704267 | -0.010553 | -0.011070 | -0.002266 | -0.003785 | -0.002925 | 1.000000 | -0.002786 | 0.007964 |
doy | 0.168056 | 0.092957 | 0.170805 | 0.011746 | 0.255887 | 0.262245 | 0.203155 | -0.148062 | 0.001134 | -0.003024 | 0.970196 | 0.054102 | 0.998616 | -0.000837 | -0.006735 | -0.002786 | 1.000000 | 0.961538 |
woy | 0.152512 | 0.079906 | 0.156480 | 0.019762 | 0.240794 | 0.248653 | 0.216435 | -0.145962 | 0.000976 | -0.022593 | 0.939284 | 0.018538 | 0.961809 | -0.003411 | -0.006532 | 0.007964 | 0.961538 | 1.000000 |
1.351102939
0.471594968034
cols: base_cols + []
rmse: 1.351102939
cols: base_cols + ['hour']
rmse: 0.471594968034
cols: base_cols + ['hour', 'dow']
rmse: 0.455313354995
cols: base_cols + ['hour', 'dow', 'year']
rmse: 0.362092301557
cols: base_cols + ['hour', 'dow', 'year', 'month']
rmse: 0.334982551858
cols: base_cols + ['hour', 'dow', 'year', 'month', 'weeks_since_start']
rmse: 0.337482905295
cols: base_cols + ['hour', 'dow', 'year', 'month', 'weeks_since_start', 'day']
rmse: 0.342750167966
trees: 500, mss: 6, rmse: 0.346277501816
trees: 500, mss: 8, rmse: 0.34516067175
trees: 500, mss: 10, rmse: 0.344992452172
trees: 500, mss: 12, rmse: 0.344630590643
trees: 500, mss: 14, rmse: 0.345382630487
trees: 1000, mss: 6, rmse: 0.346321699235
trees: 1000, mss: 8, rmse: 0.345137737679
trees: 1000, mss: 10, rmse: 0.344800789859
trees: 1000, mss: 12, rmse: 0.344309234546
trees: 1000, mss: 14, rmse: 0.345338741363
trees: 1500, mss: 6, rmse: 0.346564275074
trees: 1500, mss: 8, rmse: 0.345270886888
trees: 1500, mss: 10, rmse: 0.345261507783
trees: 1500, mss: 12, rmse: 0.344792046338
trees: 1500, mss: 14, rmse: 0.344961343139
best params: {'n_estimators': 1000, 'min_samples_split': 12, 'random_state': 123, 'n_jobs': -1}, rmse: 0.344309234546
boom.