Another week, another Kaggle competition write-up. I’m officially dubbing this month Machine Learning May. But rest assured, this will be the last one of these write-ups for a little while — I don’t want to get type-casted :)
For this competition, we’re tasked with predicting altruism. We’re given a dataset of textual requests for pizza from the Reddit community Random Acts of Pizza. All requests ask for one thing: a free pizza. Our goal is to create a model that can predict the success of these requests.
This is a nice intro challenge for people wanting to cut their teeth on some basic natural language processing (NLP). There’s also an accompanying academic paper by a group of Stanford PhD students here, which is a fun read and helpful in seeing how they approached the problem.
This time around, submissions are evaluated based on the area under the ROC curve. For more info on ROC analysis, this paper is a great primer, but basically it’s a way to visualize the performance of a classifier (true positives vs. false negatives) along various discrimination thresholds (the cutoff probability for the binary classification).
Alright, time for the models. I ended up doing two. The first is a very simple Naive Bayes Bag of Words model using the text from the request (after combining the title and body fields). The code for it is below and after some parameter tuning, it scores a 0.605. The second model uses a Gradient Boosting Classifier and incorporates more features (several of which come from the academic paper above); this one scores a 0.702 (code here).
Curious which individual words best predict whether a request will result in a pizza? See below!
_data train
request_id t3_l25d7
title Request Colorado Springs Help Us Please
body Hi I am in need of food for my 4 children we a...
got_pizza 0
Name: 0, dtype: object
Request Colorado Springs Help Us Please
--
Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
--
Request Colorado Springs Help Us Please Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
--
Request Colorado Springs Help Us Please Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
--
request colorado springs help us please hi need food children military family really hit hard times exahusted means help able feed family make another night ask know blessing coming whatever u find heart give greatly appreciated
--
Accuracy on training data: 0.885149
Accuracy on test data: 0.717822
AUC: 0.512876
alpha: 5.000000
min_df: 0.020000
best auc: 0.605254
Accuracy on training data: 0.757756
Accuracy on test data: 0.724752
request_id requester_received_pizza
4040 t3_i8iy4 0.325085
4041 t3_1mfqi0 0.545561
4042 t3_lclka 0.242250
4043 t3_1jdgdj 0.167956
4044 t3_t2qt4 0.142908
good words
word P(pizza | word)
161 jpg 0.427790
268 rice 0.383345
157 imgur 0.364823
325 tight 0.363433
141 helping 0.361993
235 person 0.354807
345 unemployed 0.344983
231 paycheck 0.338030
233 paying 0.337769
50 check 0.336646
---
bad words
word P(pizza | word)
34 birthday 0.181946
169 leave 0.180133
104 florida 0.175216
326 till 0.172488
112 friends 0.165682
20 area 0.162599
108 free 0.159392
285 sitting 0.159047
307 studying 0.155792
111 friend 0.143568