DATA

Data Desecription

Our raw data had a size of terabytes which recorded all visits to BostonGlobe.com from March 2013 to March 2015. The total number of records were of billions, and more than 500 attributes were associated with each of them. Among those attributes, we had system-defined identifiers (e.g. visid_high, visid_low), counters (e.g. visit_num, visit_pagenum), visiting preferences (e.g. cookies, domain), content (e.g. channel, urls) and device information (e.g. geo_country, geo_city, user_agent).

How we pre-processed the data

We trimmed down the data from 3 terabytes to 5 gigabytes by determining our target visitors, features of interests and time window. The original dataset contained 25 thousand unique subscribers and 80 millions unique non-subscribers. In consideration of both efficiency and precision, we focused our research on 1 year, 43 features, a sample of 25,000 subscribers and 30,000 non-subscribers randomly sampled from the database with simple random sampling method. Here we identified each unique visitor with default identifier of the Adobe system, and took the ‘210’ in feature post_evar5 as the flag of a subscriber.

How we performed the feature engineering and feature selection

For all the 500 features we thought about their nature, dug their connection to our objectives, and analysed their predictive performance. The following features are the final list for predictor variables used in our model after our processes of feature engineering.