Tutorial on using Vowpal Wabbit and R Magic for the Oracle Data Challenge

Step 1: Get Data

A. Download full compressed data sets

links on slack general channel

B. Download sampled data

links on slack general channel

C. Create your own sample from downloaded full data sets

Download and install odc from https://github.com/CoDataScience/oracle-audience

also see Pedro's posts on slack competition channel

In [55]:
%%bash
#sample from full data sets
python3 /Users/jmf/workspace/Jupyter/odc/oracle-audience-master/odc.py sample 10000 data/train data/sampled_train.txt  #but this only sampled negative examples?
Single file detected
Traceback (most recent call last):
  File "/Users/jmf/workspace/Jupyter/odc/oracle-audience-master/odc.py", line 173, in <module>
    cli()
  File "/usr/local/lib/python3.5/site-packages/click-6.6-py3.5.egg/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click-6.6-py3.5.egg/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click-6.6-py3.5.egg/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/site-packages/click-6.6-py3.5.egg/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click-6.6-py3.5.egg/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/jmf/workspace/Jupyter/odc/oracle-audience-master/odc.py", line 128, in sample
    files = [smart_open.smart_open(input_path)]
  File "/usr/local/lib/python3.5/site-packages/smart_open-1.3.5-py3.5.egg/smart_open/smart_open_lib.py", line 127, in smart_open
    return file_smart_open(parsed_uri.uri_path, mode)
  File "/usr/local/lib/python3.5/site-packages/smart_open-1.3.5-py3.5.egg/smart_open/smart_open_lib.py", line 558, in file_smart_open
    return open(fname, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/train'

Step 2: [If you are running on the full data set] Decompress Data

see http://codata.colorado.edu/competitions/oracle/

In [ ]:
%%bash

#decompress and merge train data
tar -xzvf data/train_set.tar.gz
bzip2 -d data/train/*.bz2
cat data/train/* > data/train.txt
In [ ]:
%%bash

#decompress and merge validation data
tar -xzvf data/val_set.tar.gz
bzip2 -d data/val/*.bz2
cat data/val/* > data/val.txt
In [ ]:
%%bash

#decompress and merge contest data
tar -xzvf data/contest_set.tar.gz
bzip2 -d data/contest/*.bz2
cat data/contest/* > data/contest.txt

Step 4: Convert training and validation data to Vowpal Wabbit format

code adapted from tutorial at http://codata.colorado.edu/notebooks/tutorials/odc_vowpalwabbit_nehal_kamat/

In [1]:
import itertools
DIR = "./data/"

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1
In [103]:
# convert training data to VW format

vw_compatible_file = open(DIR+"./sampled_10000-train-vw_compatible_file.vw", 'wb')
with open(DIR + 'sampled_10000-train.txt', 'r') as train_data:
    nRows = file_len(DIR+'sampled_10000-train.txt')
    for line in itertools.islice(train_data, 0, nRows):
        el = eval(line)
        vw_compatible_file.write(str(el[1]) + " '" + str(el[0]) + " | ")
        for item in el[2][1]:
            vw_compatible_file.write(str(item[0]) + ":" + str(item[1]) + " ")
        vw_compatible_file.write("\n")
vw_compatible_file.close()
In [106]:
# convert validation data to VW format

vw_compatible_file = open(DIR+"./sampled_10000-val-vw_compatible_file.vw", 'wb')
with open(DIR + 'sampled_10000-val.txt', 'r') as val_data:
    nRows = file_len(DIR+'sampled_10000-val.txt')
    for line in itertools.islice(val_data, 0, nRows):
        el = eval(line)
        vw_compatible_file.write(str(el[1]) + " '" + str(el[0]) + " | ")
        for item in el[2][1]:
            vw_compatible_file.write(str(item[0]) + ":" + str(item[1]) + " ")
        vw_compatible_file.write("\n")
vw_compatible_file.close()

Step 5: Train Vowpal Wabbit model(s)

see https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial

Note: vw-varinfo is just a wrapper around vw that allows human-readable output of feature weights. see https://github.com/JohnLangford/vowpal_wabbit/wiki/using-vw-varinfo

Default Model

In [284]:
%%bash

vw-varinfo -f data/model.vw -d data/sampled_10000-train-vw_compatible_file.vw > feature_weights.txt
#  -f says write the final model here
#  --loss_function logistic makes this binary classification
#  -d says use this training data

Lasso Regression

In [122]:
%%bash

vw-varinfo --l1 1e-7 -f data/model.vw -d data/sampled_10000-train-vw_compatible_file.vw > feature_weights.txt
# --l1 is lasso regression

Neural network model

In [139]:
%%bash

vw-varinfo --nn 10 -f data/model.vw -d data/sampled_10000-train-vw_compatible_file.vw > feature_weights.txt
#  --nn is neural network model with X number of hidden units

Step 6: Run Vowpal Wabbit model on Validation Set

In [285]:
%%bash

#example
vw -i data/model.vw -t data/sampled_10000-val-vw_compatible_file.vw -p predictions.txt
#  -t says, don`t learn, just score
#  -i says use this model to score
#  -p says output predictions to this file
only testing
predictions = predictions.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/sampled_10000-val-vw_compatible_file.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
126.284500 126.284500            1            1.0   0.0000  11.2376      346
139.537628 152.790756            2            2.0   0.0000  12.3609      171
287.765436 435.993244            4            4.0  84.9700  57.2009     4396
322.304579 356.843722            8            8.0   0.0000   3.0295       69
241.897910 161.491241           16           16.0   0.0000  21.8121      841
391.066750 540.235590           32           32.0   0.0000   0.6538       49
346.160955 301.255160           64           64.0   0.0000   7.1842      111
1054.787658 1763.414360          128          128.0   0.0000   1.5063       57
1095.489470 1136.191283          256          256.0   0.0000   2.9224      464
7020.840507 12946.191544          512          512.0   0.0000  65.4305     6204
5337.910548 3654.980590         1024         1024.0   0.0000   9.7615      267
5415.389770 5492.868991         2048         2048.0   0.0000  18.0220      735
4276.943999 3138.498227         4096         4096.0   0.0000   5.8025      235
3832.083136 3387.222273         8192         8192.0   0.0000  30.6766     1879

finished run
number of examples per pass = 10000
passes used = 1
weighted example sum = 10000.000000
weighted label sum = 139006.738209
average loss = 4090.008185
best constant = 13.900674
total feature number = 10495922

Step 7: Convert VW predictions.txt file to submission format

see Pedro's posts on slack

In [10]:
#nBuyers = 100000
nBuyers = 1000

import sys
def parse_lines(file):
    lines = []
    with open(file) as f:
        for l in f:
            score, hhid = l.strip().split()
            score = float(score)
            hhid = int(hhid)
            lines.append((score, hhid))
    return lines

def write_submission(lines, output):
    sorted_predictions = sorted(lines, reverse=True)
    with open(output, 'w') as f:
        f.write('household_id,advertise\n')
        for i in range(len(sorted_predictions)):
            _, hhid = sorted_predictions[i]
            if i < nBuyers:
                f.write('{},1\n'.format(hhid))
            else:
                f.write('{},0\n'.format(hhid))
                
                
def write_submission_ordered(lines, output):
    sorted_predictions = sorted(lines, reverse=True)
    with open(output, 'w') as f:
        f.write('household_id\n')
        for i in range(len(sorted_predictions)):
            _, hhid = sorted_predictions[i]
            if i < nBuyers:
                f.write('{}\n'.format(hhid))
            else:
                f.write('{}\n'.format(hhid))

#if __name__ == '__main__':
#    lines = parse_lines(sys.argv[1])
#    write_submission(lines, sys.argv[2])
In [286]:
lines = parse_lines('predictions.txt')
write_submission(lines,'submission.csv')
write_submission_ordered(lines, 'submission_ordered.csv')

Step 8: [If you are using sample validation data] Create sample_val_spend file for scoring

In [75]:
sample_val_spend_file = open(DIR+"./sample_val_spend.csv", 'wb')
with open(DIR + 'sampled_10000-val.txt', 'r') as sample_val_spend_data:
    nRows = file_len(DIR+'sampled_10000-val.txt')
    for line in itertools.islice(sample_val_spend_data, 0, nRows):
        el = eval(line)
        sample_val_spend_file.write(str(el[0]) + "," + str(el[1]))
        sample_val_spend_file.write("\n")
sample_val_spend_file.close()

Step 9: Run scoring script on validation data

see https://github.com/CoDataScience/oracle-audience and Pedro's posts on slack

Ratio Scoring

In [287]:
%%bash

python3 /Users/jmf/workspace/Jupyter/odc/oracle-audience-master/odc.py score --ratio data/all_val_spend.csv submission.csv

#0.45 default VW settings
#0.45 L1 (lasso regression) at 1e-7
#0.40 L1 (lasso regression) at 1e-4
#0.27 L1 (lasso regression) at .01

#0.44 neural network w/ 5 hidden units
#0.44 neural network w/ 10 hidden units
#0.28 neural network w/ 100 hidden units

#0.05 logistic loss function (on unmodified data sets)

#.44 default VW with interaction terms of top 79 features
Scoring Options
ratio: True
ordered: False
spend_file: data/all_val_spend.csv
submission_file: submission.csv
Revenue: 68285.85000000003
Possible Revenue: 139006.7400000001
Fraction of Possible Revenue: 0.49124128801236533
Number of Responders: 410
Possible Number of Responders: 909
Fraction of Possible Responders 0.45104510451045104

Ordered Scoring [must use full data set to be accurate?]

In [111]:
%%bash 

python3 /Users/jmf/workspace/Jupyter/odc/oracle-audience-master/odc.py score --ordered --ratio data/all_val_spend.csv submission_ordered.csv
Scoring Options
ratio: True
ordered: True
spend_file: data/all_val_spend.csv
submission_file: submission_ordered.csv
Revenue: 72796.56000000003
Possible Revenue: 139006.7400000001
Fraction of Possible Revenue: 0.5236908656371624
Number of Responders: 432
Possible Number of Responders: 909
Fraction of Possible Responders 0.4752475247524752

Step 10: Convert Contest Set data to Vowpal Wabbit format

In [ ]:
# convert validation data to VW format

vw_compatible_file = open(DIR+"./contest.vw", 'wb')
with open(DIR + 'contest.txt', 'r') as val_data:
    nRows = file_len(DIR+'contest.vw')
    for line in itertools.islice(val_data, 0, nRows):
        el = eval(line)
        vw_compatible_file.write(str(el[1]) + " '" + str(el[0]) + " | ")
        for item in el[2][1]:
            vw_compatible_file.write(str(item[0]) + ":" + str(item[1]) + " ")
        vw_compatible_file.write("\n")
vw_compatible_file.close()

Step 10: Run Vowpal Wabbit model on Contest Set

In [ ]:
%%bash

#example
vw -i data/model.vw -t data/contest-vw_compatible_file.vw -p predictions_contest.txt
#  -t says, don`t learn, just score
#  -i says use this model to score
#  -p says output predictions to this file

Step 11: Convert VW's contest set output to submission format

In [64]:
lines = parse_lines(DIR + 'predictions_contest.txt')
write_submission(lines, DIR +'submission_contest.csv')
------------------------------------------------------------
IOError                    Traceback (most recent call last)
<ipython-input-64-efbed775f5d1> in <module>()
----> 1 lines = parse_lines(DIR + 'predictions_contest.txt')
      2 write_submission(lines, DIR +'submission_contest.csv')

<ipython-input-61-04f26cf8fb94> in parse_lines(file)
      5 def parse_lines(file):
      6     lines = []
----> 7     with open(file) as f:
      8         for l in f:
      9             score, hhid = l.strip().split()

IOError: [Errno 2] No such file or directory: './data/predictions_contest.txt'
In [3]:
%load_ext rpy2.ipython
In [4]:
%%R
f = read.table('feature_weights.txt', header=T)
print(dim(f))
print(head(f))
f$RelScore = as.numeric(sub("%","",f$RelScore))
summary(f)
[1] 130086      6
  FeatureName HashVal MinVal MaxVal  Weight RelScore
1       35875   35875      0      0 5996.03   17.65%
2       39858   39858      0      0 5177.77   15.24%
3        7208    7208      0      0 4508.64   13.27%
4      110472  110472      0      0 1889.09    5.56%
5       66485   66485      0      0 1889.09    5.56%
6       12654   12654      0      0 1750.51    5.15%
  FeatureName        HashVal           MinVal      MaxVal      Weight         
 0      :     1   Min.   :     0   Min.   :0   Min.   :0   Min.   :-33980.20  
 10     :     1   1st Qu.: 33359   1st Qu.:0   1st Qu.:0   1st Qu.:    -0.22  
 100    :     1   Median : 66742   Median :0   Median :0   Median :    -0.03  
 1000   :     1   Mean   : 66745   Mean   :0   Mean   :0   Mean   :    -0.19  
 10000  :     1   3rd Qu.:100107   3rd Qu.:0   3rd Qu.:0   3rd Qu.:     0.12  
 100000 :     1   Max.   :133530   Max.   :0   Max.   :0   Max.   :  5996.03  
 (Other):130080                                                               
    RelScore         
 Min.   :-100.00000  
 1st Qu.:   0.00000  
 Median :   0.00000  
 Mean   :  -0.00044  
 3rd Qu.:   0.00000  
 Max.   :  17.65000  
                     
In [5]:
%%R
f$absRelScore = abs(f$RelScore)
hist(f$absRelScore[f$absRelScore>1], breaks=15)
hist(f$RelScore[abs(f$RelScore)>=1], breaks=15)
In [6]:
%%R
topFeatures = f[order(-f$absRelScore),]
print(nrow(topFeatures))
#topFeatures = topFeatures[order(-topFeatures$absRelScore),]
topFeatures[1:100,]
[1] 130086
       FeatureName HashVal MinVal MaxVal     Weight RelScore absRelScore
130086       78025   78025      0      0 -33980.200  -100.00      100.00
1            35875   35875      0      0   5996.030    17.65       17.65
2            39858   39858      0      0   5177.770    15.24       15.24
3             7208    7208      0      0   4508.640    13.27       13.27
130085       13930   13930      0      0  -3128.130    -9.21        9.21
4           110472  110472      0      0   1889.090     5.56        5.56
5            66485   66485      0      0   1889.090     5.56        5.56
6            12654   12654      0      0   1750.510     5.15        5.15
7            15330   15330      0      0   1716.390     5.05        5.05
8            53200   53200      0      0   1601.740     4.71        4.71
130084       35139   35139      0      0  -1504.100    -4.43        4.43
9            97998   97998      0      0   1474.850     4.34        4.34
130083      104584  104584      0      0  -1256.530    -3.70        3.70
10           89909   89909      0      0   1188.030     3.50        3.50
11           24090   24090      0      0    981.228     2.89        2.89
12           40883   40883      0      0    906.858     2.67        2.67
13           29339   29339      0      0    871.751     2.57        2.57
14          123831  123831      0      0    729.806     2.15        2.15
130082        2460    2460      0      0   -726.053    -2.14        2.14
15           13675   13675      0      0    700.632     2.06        2.06
16           75887   75887      0      0    647.230     1.90        1.90
130081       96609   96609      0      0   -570.376    -1.68        1.68
17          132020  132020      0      0    564.235     1.66        1.66
18          113689  113689      0      0    548.918     1.62        1.62
130080        2769    2769      0      0   -538.172    -1.58        1.58
130079       13188   13188      0      0   -497.602    -1.46        1.46
130078      118386  118386      0      0   -464.157    -1.37        1.37
19          110394  110394      0      0    463.707     1.36        1.36
20           57070   57070      0      0    463.707     1.36        1.36
21          112669  112669      0      0    462.643     1.36        1.36
22             158     158      0      0    462.643     1.36        1.36
130077      108839  108839      0      0   -448.514    -1.32        1.32
130076       77849   77849      0      0   -441.237    -1.30        1.30
130075       11601   11601      0      0   -433.235    -1.27        1.27
23           55881   55881      0      0    428.961     1.26        1.26
130074       71374   71374      0      0   -404.756    -1.19        1.19
24           15580   15580      0      0    374.565     1.10        1.10
25          111537  111537      0      0    368.788     1.09        1.09
26           83177   83177      0      0    368.620     1.08        1.08
130073       19556   19556      0      0   -359.455    -1.06        1.06
27           84417   84417      0      0    358.088     1.05        1.05
130072      127527  127527      0      0   -346.891    -1.02        1.02
130071       24268   24268      0      0   -334.316    -0.98        0.98
130070       17874   17874      0      0   -328.244    -0.97        0.97
28           44440   44440      0      0    326.484     0.96        0.96
130068       65460   65460      0      0   -321.337    -0.95        0.95
130069       34370   34370      0      0   -322.602    -0.95        0.95
130065      114878  114878      0      0   -319.911    -0.94        0.94
130066       10686   10686      0      0   -320.925    -0.94        0.94
130067       38885   38885      0      0   -321.062    -0.94        0.94
29           55133   55133      0      0    312.607     0.92        0.92
130064       32330   32330      0      0   -299.904    -0.88        0.88
130062       94312   94312      0      0   -288.496    -0.85        0.85
130063       64412   64412      0      0   -289.259    -0.85        0.85
130061       80783   80783      0      0   -273.671    -0.81        0.81
130060      117923  117923      0      0   -267.721    -0.79        0.79
130059      110696  110696      0      0   -260.236    -0.77        0.77
30          115049  115049      0      0    251.424     0.74        0.74
130058       28373   28373      0      0   -239.144    -0.70        0.70
31           54332   54332      0      0    224.978     0.66        0.66
32           32477   32477      0      0    224.901     0.66        0.66
130057       42413   42413      0      0   -209.135    -0.62        0.62
33           28677   28677      0      0    204.781     0.60        0.60
34           13054   13054      0      0    201.646     0.59        0.59
130056       68904   68904      0      0   -199.098    -0.59        0.59
130055       49835   49835      0      0   -196.616    -0.58        0.58
35           34791   34791      0      0    189.236     0.56        0.56
130054      122914  122914      0      0   -189.534    -0.56        0.56
130053      121268  121268      0      0   -183.706    -0.54        0.54
36           94689   94689      0      0    179.752     0.53        0.53
130052       62369   62369      0      0   -181.773    -0.53        0.53
37           69561   69561      0      0    162.146     0.48        0.48
38           98657   98657      0      0    161.584     0.48        0.48
130049       57947   57947      0      0   -162.636    -0.48        0.48
130050       94635   94635      0      0   -163.295    -0.48        0.48
130051       70162   70162      0      0   -164.569    -0.48        0.48
39           78553   78553      0      0    160.475     0.47        0.47
130046       94249   94249      0      0   -151.709    -0.45        0.45
130047       14676   14676      0      0   -152.356    -0.45        0.45
130048       38735   38735      0      0   -154.583    -0.45        0.45
40           21800   21800      0      0    148.688     0.44        0.44
130043       26448   26448      0      0   -149.624    -0.44        0.44
130044      129933  129933      0      0   -150.317    -0.44        0.44
130045      120246  120246      0      0   -150.514    -0.44        0.44
41            3566    3566      0      0    146.185     0.43        0.43
130042       77584   77584      0      0   -147.687    -0.43        0.43
130041       54275   54275      0      0   -141.398    -0.42        0.42
42           54795   54795      0      0    138.091     0.41        0.41
130040      119756  119756      0      0   -140.315    -0.41        0.41
130039       41106   41106      0      0   -134.787    -0.40        0.40
43           67944   67944      0      0    133.864     0.39        0.39
44           29023   29023      0      0    130.056     0.38        0.38
45           15953   15953      0      0    129.827     0.38        0.38
130037       24259   24259      0      0   -128.626    -0.38        0.38
130038       59065   59065      0      0   -129.770    -0.38        0.38
130035       54642   54642      0      0   -124.115    -0.37        0.37
130036       74968   74968      0      0   -126.765    -0.37        0.37
46           10681   10681      0      0    119.820     0.35        0.35
130034       68021   68021      0      0   -116.651    -0.34        0.34
47            1786    1786      0      0    110.523     0.33        0.33
In [7]:
%Rpull topFeatureNames
#sends topFeatureNames variable and data from R to python
topFeatureNames
------------------------------------------------------------
LookupError                Traceback (most recent call last)
<ipython-input-7-5d9fe6c92f6a> in <module>()
----> 1 get_ipython().magic(u'Rpull topFeatureNames')
      2 #sends topFeatureNames variable and data from R to python

//anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2161         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2162         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2163         return self.run_line_magic(magic_name, magic_arg_s)
   2164 
   2165     #-------------------------------------------------------------------------

//anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2082                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2083             with self.builtin_trap:
-> 2084                 result = fn(*args,**kwargs)
   2085             return result
   2086 

<decorator-gen-126> in Rpull(self, line)

//anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

//anaconda/lib/python2.7/site-packages/rpy2-2.8.3-py2.7-macosx-10.6-x86_64.egg/rpy2/ipython/rmagic.pyc in Rpull(self, line)
    346         outputs = args.outputs
    347         for output in outputs:
--> 348             robj = ri.globalenv.get(output)
    349             self.shell.push({output: converter.ri2py(robj) })
    350 

LookupError: 'topFeatureNames' not found