In this competition your objective is to use prior purchase data to maximize future revenue.
Suppose you are run a company called SuperMegaCorp which has enough money to send 100,000 advertisements per month to prospective customers. Your database has tens of millions of customers, but for each month you only look at roughly 1 million customers before deciding which 100,000 you send advertisements to. In industry terms you are building an audience for your advertising campaign.
This competition is sponsored by Oracle Data Cloud
While advertising and media continue to grow digitally and revolutionize marketing “online”, 93% of all consumer spending is still happening “offline”. Marketers want to find buyers of their products - not just “clickers”. Oracle Data Cloud is leveraging the power of big data, technology and predictive analytics to fuse offline sales with digital media. By helping marketers build, reach and measure purchase-based audiences, ODC serves as core marketing infrastructure for a data-driven world. The biggest advertisers and social media platforms in the world use these methods to help better understand and plan their advertising strategies.
Each row of the data represents one “household”. You can think of this as a representation of all the people and devices which live at a certain address. Each row contains a vector of sparse features representing data such as past transactional behavior and demographics. More details about these features may be provided, but for now this is all you know. Finally, each row contains the amount of money spent by the household in response to advertising.
You are provided four files for this competition:
train_set.tar.gz: Training data which contains data from 2016-03-01 to 2016-04-30
val_set.tar.gz: Validation data which contains data from 2016-05-01 to 2016-06-30
contest_set.tar.gz: Contest data which contains data from 2016-07-01 to 2016-08-30. This does not have labels
total_spend.tar.gz: Amount spent per household in the training and validation sets. This is provided for convenience.
Obtaining Data: To obtain the data please join our slack channel. If you don’t know how to do so please send an email to
Each file is a compressed gzip archive which can be decompressed using
tar -xzvf file.tar.gz The directory will contain a number of files named
part-00001.bz2, and so on. These can be decompressed using
bzip2 -d *.bz2. You can merge everything into a single file using
cat *.bz2 > everything.txt
Each row contains the household id, the amount spent by the household, the total number of features (constant across everything) and the sparse vector of features. Below is each feature labeled and an example
’(<household id>, <spend>, (<total features>, [(<feature idx>, <feature value>), …]))' ’(36839559, 0.0, (133531, [(0, 0.0040397762585456807), (144, 0.031645569620253167),…]))'
For convenience the spend per household is also included in case you plan on training on a random subsample of the dataset. Labels are included for everything in the training and validation sets. Each row contains elements like below:
<household id>, <lifetime spend>
This is identical to the training and validation except spend is always 0.
Labels are provided in both the training and validation sets. The intent is to use the training set to train your models and the validation set to validate them. The validation is similar to what a leaderboard would be based on. The contest set will be used for final judging and has all 0s instead of labels.
Each of the training and validation sets contains a random sample of ~1,000,000 households which did not respond to ads and ~100,000 that did respond to ads. In general there won’t be any or hardly any households between data sets. The following information may be helpful:
This competition will award prizes according to two error metrics with additional prizes for secondary goals as listed below. In the primary evaluation metrics of revenue and number of buyers you will submit one model for both which identifies 100,000 households to advertise to.
You will provide a submission in the following format named
household_id,advertise 241239,5 1242350,4 109539,0 534309,0 1398304,0 1231,3 1231541,2 474564,1
The household id should match the contest data set. The advertise column should contain exactly 100,000 non-zeros and all zeros besides that. A non-zero indicates you would like to advertise to this household and its relative rank where a 1 indicates most desirable to advertise to. A 0 indicates you do not want to advertise to this household. The order is useful but is not part of the formal evaluation.
Assume that is 1 if you decide to advertise to household . Further assume that is the amount spent by household if it is advertised to. With these definitions the quantity to maximize is revenue defined as:
To define this metric we introduce a variable which is equal to 1 if and 0 otherwise. With this definition the quantity to maximize is number of buyers defined as
In addition to the metrics above there will also be a prize awarded to a well performing and fast model. A separate submission is allowed for this and details will be given soon. In general though we expect your model to:
In addition to prizes above there will also be a prize for creativity. This is obviously subjective, but creative visualizations and analysis of the data are highly encouraged.