The magic words to get free pizza

Another week, another Kaggle competition write-up. I’m officially dubbing this month Machine Learning May. But rest assured, this will be the last one of these write-ups for a little while — I don’t want to get type-casted :)

For this competition, we’re tasked with predicting altruism. We’re given a dataset of textual requests for pizza from the Reddit community Random Acts of Pizza. All requests ask for one thing: a free pizza. Our goal is to create a model that can predict the success of these requests.

This is a nice intro challenge for people wanting to cut their teeth on some basic natural language processing (NLP). There’s also an accompanying academic paper by a group of Stanford PhD students here, which is a fun read and helpful in seeing how they approached the problem.

This time around, submissions are evaluated based on the area under the ROC curve. For more info on ROC analysis, this paper is a great primer, but basically it’s a way to visualize the performance of a classifier (true positives vs. false negatives) along various discrimination thresholds (the cutoff probability for the binary classification).

Alright, time for the models. I ended up doing two. The first is a very simple Naive Bayes Bag of Words model using the text from the request (after combining the title and body fields). The code for it is below and after some parameter tuning, it scores a 0.605. The second model uses a Gradient Boosting Classifier and incorporates more features (several of which come from the academic paper above); this one scores a 0.702 (code here).

Curious which individual words best predict whether a request will result in a pizza? See below!

import json, re
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, auc

dfs = {}
for name in ['train', 'test']:
    df = pd.read_json('../data/%s.json' % name)
    df['_data'] = name
    dfs[name] = df

# combine train and test data into one df
df = dfs['train'].append(dfs['test'])
df = df.reset_index(drop=True)

# limit to shared columns (plus predictor)
cols = list(dfs['test'].columns) + ['requester_received_pizza']
df = df[cols]

# rename a few columns to be pithier
df.rename(columns={
        'request_title': 'title', 
        'request_text_edit_aware': 'body',
        'requester_received_pizza': 'got_pizza',
}, inplace=True)

# convert got pizza indicator to ints
df['got_pizza'] = df['got_pizza'].apply(lambda x: -1 if pd.isnull(x) else int(x))

# toss out unused columns
cols_to_keep = ['_data', 'request_id', 'title', 'body', 'got_pizza']
df = df[cols_to_keep]

df.iloc[0]

_data                                                     train
request_id                                             t3_l25d7
title                   Request Colorado Springs Help Us Please
body          Hi I am in need of food for my 4 children we a...
got_pizza                                                     0
Name: 0, dtype: object

# combine title and body columns
df['txt_raw'] = df['title'] + ' ' + df['body']

# check that it worked
for col in ['title', 'body', 'txt_raw']:
    print df.iloc[0][col]
    print '--'

Request Colorado Springs Help Us Please
--
Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
--
Request Colorado Springs Help Us Please Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
--

# clean up text (lowercase, letters only, remove stop words)

def clean_txt(raw, remove_stop=False):
    # remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", raw) 

    # convert to lower case, split into individual words
    words = letters_only.lower().split()                             

    # remove stop words
    stops = set(stopwords.words("english"))
    words = [w for w in words if not w in stops]
    
    # join cleaned words
    return " ".join(words)

df['txt_clean'] = df['txt_raw'].apply(clean_txt)

# check that it worked
for col in ['txt_raw', 'txt_clean']:
    print df.iloc[0][col]
    print '--'

Request Colorado Springs Help Us Please Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
--
request colorado springs help us please hi need food children military family really hit hard times exahusted means help able feed family make another night ask know blessing coming whatever u find heart give greatly appreciated
--

# prep training set data for modeling
# creates numerical arrays for X (bag of words) and y (got pizza)

def get_xy(vectorizer=None, txt_col='txt_clean'):
    if vectorizer is None:
        vectorizer = CountVectorizer()
        
    dg = df[df['_data'] == 'train']

    X = vectorizer.fit_transform(dg[txt_col]).toarray()
    y = dg['got_pizza'].astype(int).as_matrix()

    return X, y

# randomly split training set and try default NB model

X, y = get_xy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1234)

model = MultinomialNB().fit(X_train, y_train)

print "Accuracy on training data: %f" % model.score(X_train, y_train)
print "Accuracy on test data:     %f" % model.score(X_test, y_test)

y_pred = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
print "AUC: %f" % auc(fpr, tpr)

Accuracy on training data: 0.885149
Accuracy on test data:     0.717822
AUC: 0.512876

#
# whoa, not very good - an roc auc just a little bit better
# than random chance and a big difference in error rates
# between training and test data implying overfitting
#

# let's calibrate some params (in vectorizer & classifier)
# and choose values that maximize roc auc

# the grid of params to search over
alphas = [1, 5, 10, 25]
min_dfs = [0.001, 0.01, 0.02, 0.05]

# loop through values to find optimal settings
best_alpha, best_min_df = None, None
max_auc = -np.inf

for alpha in alphas:
    for min_df in min_dfs:
        
        vectorizer = CountVectorizer(min_df = min_df)

        X, y = get_xy(vectorizer)
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

        model = MultinomialNB(alpha=alpha).fit(X_train, y_train)

        y_pred = model.predict_proba(X_test)[:, 1]        
        fpr, tpr, thresholds = roc_curve(y_test, y_pred)
        auc_val = auc(fpr, tpr)

        if auc_val > max_auc:
            max_auc = auc_val
            best_alpha, best_min_df = alpha, min_df 

                
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df
print "best auc: %f" % max_auc

alpha: 5.000000
min_df: 0.020000
best auc: 0.605254

# let's make sure this new model is less over-fit

vectorizer = CountVectorizer(min_df = best_min_df)

X, y = get_xy(vectorizer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1234)

model = MultinomialNB(alpha=best_alpha).fit(X_train, y_train)

print "Accuracy on training data: %f" % model.score(X_train, y_train)
print "Accuracy on test data:     %f" % model.score(X_test, y_test)

Accuracy on training data: 0.757756
Accuracy on test data:     0.724752

# finally, let's train on full training set with best params
# and save our predictions for submission

vectorizer = CountVectorizer(min_df = best_min_df)

X, y = get_xy(vectorizer)

model = MultinomialNB(alpha=best_alpha).fit(X, y)

df_test = df[df['_data'] == 'test'].copy()
X_test = vectorizer.transform(df_test['txt_clean'])
y_pred = model.predict_proba(X_test)[:, 1]

df_test['requester_received_pizza'] = y_pred
final_df = df_test[['request_id', 'requester_received_pizza']]

# sanity check entries 
print final_df.head(5)

# output predictions
final_df.to_csv('../output/predicted.csv', index=False)

     request_id  requester_received_pizza
 t3_i8iy4                  0.325085
t3_1mfqi0                  0.545561
 t3_lclka                  0.242250
t3_1jdgdj                  0.167956
 t3_t2qt4                  0.142908

# and for funsies, let's see which words best predict getting a pizza 

words = np.array(vectorizer.get_feature_names())

x = np.eye(X.shape[1])
probs = model.predict_proba(x)[:, 1]

word_df = pd.DataFrame()
word_df['word'] = words
word_df['P(pizza | word)'] = probs
word_df.sort('P(pizza | word)', ascending=False, inplace=True)

print 'good words'
print word_df.head(10)
print '\n---\n'
print 'bad words'
print word_df.tail(10)

good words
           word  P(pizza | word)
       jpg         0.427790
      rice         0.383345
     imgur         0.364823
     tight         0.363433
   helping         0.361993
    person         0.354807
unemployed         0.344983
  paycheck         0.338030
    paying         0.337769
      check         0.336646

---

bad words
         word  P(pizza | word)
 birthday         0.181946
   leave         0.180133
 florida         0.175216
    till         0.172488
 friends         0.165682
     area         0.162599
    free         0.159392
 sitting         0.159047
studying         0.155792
  friend         0.143568