At work recently, I had the chance to work on and learn about some interesting machine learning problems (around customer retention) and in my spare time, I’ve started doing some additional problems on Kaggle to get more exposure to some different datasets and modeling techniques.
I figured that by blogging about it, it would force me to keep my code sane and legible. So to kick things off, here’s my solution to the Titanic competition. The goal here was to predict who survived or not based on attributes such as gender, age, boarding class, port of embarkation, and a few others.
My model is pretty simple, and I certainly could have done a bunch more in terms of feature engineering, but even still it clocked in with an accuracy of 80.4% (top 15% at the time of this post). My decently commented code is below, and I have a repo on GitHub too if you’re in to that sort of thing.
df: test
shape: 418 rows, 12 cols
column info:
* PassengerId: 0 nulls, 418 unique vals, most common: {1128: 1, 1023: 1}
* Pclass: 0 nulls, 3 unique vals, most common: {1: 107, 3: 218}
* Name: 0 nulls, 418 unique vals, most common: {'Rosenbaum, Miss. Edith Louise': 1, 'Beauchamp, Mr. Henry James': 1}
* Sex: 0 nulls, 2 unique vals, most common: {'male': 266, 'female': 152}
* Age: 86 nulls, 79 unique vals, most common: {24.0: 17, 21.0: 17}
* SibSp: 0 nulls, 7 unique vals, most common: {0: 283, 1: 110}
* Parch: 0 nulls, 8 unique vals, most common: {0: 324, 1: 52}
* Ticket: 0 nulls, 363 unique vals, most common: {'PC 17608': 5, '113503': 4}
* Fare: 1 nulls, 169 unique vals, most common: {7.75: 21, 26.0: 19}
* Cabin: 327 nulls, 76 unique vals, most common: {'B57 B59 B63 B66': 3, 'C101': 2}
* Embarked: 0 nulls, 3 unique vals, most common: {'S': 270, 'C': 102}
* data: 0 nulls, 1 unique vals, most common: {'test': 418}
------
df: train
shape: 891 rows, 13 cols
column info:
* PassengerId: 0 nulls, 891 unique vals, most common: {891: 1, 293: 1}
* Survived: 0 nulls, 2 unique vals, most common: {0: 549, 1: 342}
* Pclass: 0 nulls, 3 unique vals, most common: {1: 216, 3: 491}
* Name: 0 nulls, 891 unique vals, most common: {'Graham, Mr. George Edward': 1, 'Elias, Mr. Tannous': 1}
* Sex: 0 nulls, 2 unique vals, most common: {'male': 577, 'female': 314}
* Age: 177 nulls, 88 unique vals, most common: {24.0: 30, 22.0: 27}
* SibSp: 0 nulls, 7 unique vals, most common: {0: 608, 1: 209}
* Parch: 0 nulls, 7 unique vals, most common: {0: 678, 1: 118}
* Ticket: 0 nulls, 681 unique vals, most common: {'CA. 2343': 7, '347082': 7}
* Fare: 0 nulls, 248 unique vals, most common: {13.0: 42, 8.0500000000000007: 43}
* Cabin: 687 nulls, 147 unique vals, most common: {'G6': 4, 'C23 C25 C27': 4}
* Embarked: 2 nulls, 3 unique vals, most common: {'S': 644, 'C': 168}
* data: 0 nulls, 1 unique vals, most common: {'train': 891}
------
data | passengerid | survived | age | cabin | embarked | fare | name | parch | pclass | sex | sibsp | ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 1 | 0 | 22 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | male | 1 | A/5 21171 |
1 | train | 2 | 1 | 38 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 1 | female | 1 | PC 17599 |
2 | train | 3 | 1 | 26 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | female | 0 | STON/O2. 3101282 |
3 | train | 4 | 1 | 35 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 1 | female | 1 | 113803 |
4 | train | 5 | 0 | 35 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 3 | male | 0 | 373450 |
data train
passengerid 1
survived 0
age 22
cabin NaN
embarked S
fare 7.25
name Braund, Mr. Owen Harris
parch 0
pclass 3
sex male
sibsp 1
ticket A/5 21171
gender 1
title Mr
port_C 0
port_Q 0
port_S 1
Name: 0, dtype: object
data | passengerid | survived | age | fare | parch | pclass | sibsp | gender | port_C | port_Q | port_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | train | 1 | 0 | 22 | 7.2500 | 0 | 3 | 1 | 1 | 0 | 0 | 1 |
1 | train | 2 | 1 | 38 | 71.2833 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
2 | train | 3 | 1 | 26 | 7.9250 | 0 | 3 | 0 | 0 | 0 | 0 | 1 |
3 | train | 4 | 1 | 35 | 53.1000 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
4 | train | 5 | 0 | 35 | 8.0500 | 0 | 3 | 0 | 1 | 0 | 0 | 1 |
9 total features
['age' 'fare' 'parch' 'pclass' 'sibsp' 'gender' 'port_C' 'port_Q' 'port_S']
891 rows, 9 features
accuracy: 0.799102
fare 0.265518
age 0.264835
gender 0.259047
pclass 0.078497
sibsp 0.054447
parch 0.046165
port_S 0.011954
port_C 0.011904
port_Q 0.007634
dtype: float64
0.836139169473
{'max_features': 0.5, 'n_estimators': 100, 'criterion': 'entropy', 'max_depth': 7}
boom.