Between June 2012 and September 2012 WordPress.com (with sponsorship from GigaOm and Splunk) ran a data competition on Kaggle. The competition results can be found on the Kaggle competition page.
The data consists of ~2.7 million posts from 12 weeks of posts on WordPress.com. We are releasing the full data set from the competition in the hopes that it will foster additional research within this and other domains.
wordpress-like-data-2012.tar.bz (2.3GB)
If you are looking for a starting point, code for the top submissions are available under the BSD license from theĀ Kaggle forums.
Please consider open sourcing any code you create and let us know about any results you achieve or any published papers.
Summary of the Data
The competition had two phases: first and final. Each phase was pulled from 6 weeks of posts from across ~500k blogs on WordPress.com. The blogs were selected based on those which had at least a modest amount of traffic in 2011 (approx 30+ views in Nov 2011). Additionally 6 months of aggregate data was provided for the blogs that had a post in that 6 week period, and all users that liked something in the period.
The 6 weeks for each data set is split up into a 5 week training set and a 1 week test set.
Data Set Date Ranges
All dates are GMT, and start/end at 00:00:00.
First data set
- start training date: 2012-03-26
- end training date: 2012-04-23
- start test date: 2012-04-23
- end test date: 2012-05-07
- aggregate start date: 2011-11-23
- aggregate end date: 2012-04-23
Final data set
- start training date: 2012-07-09
- end training date: 2012-08-13
- start test date: 2012-08-13
- end test date: 2012-08-20
- aggregate start date: 2012-02-06
- aggregate end date: 2012-08-06
File Descriptions
trainUsers.json: One JSON object per line, where each line corresponds to one WordPress.com user, and the fields are:
- “uid”: ID for user
- “inTestSet” : is this user in the test set (one of the users you’re required to make predictions about)
- “likes” : a list of dictionaries, one for each training like by this user, only containing like by this user during the training period:
- “blog”: blog liked
- “post_id”: post liked (randomly assigned unique identifier)
- “like_dt”: date of like
trainPostsThin.json: One JSON object per line, where each line corresponds to one blog post from the training set (first 5 weeks). The fields are:
- “blog”: blog ID
- “post_id”: post IDs
- “likes”: list of object, one for each like for this post, only containing likes from the training period (first 5 weeks). Later likes from the same post are not included. The fields are:
- “date”
- “uid” (user id)
trainPosts.json: This is like trainPostsThin but with many more fields about the post, including its text, tags, and categories.
testPosts.json: Same as trainPosts but without the “likes”.
testPostsThin.json: Same as trainPostsThin but without the “likes”. (So it’s very thin!)
test.csv: List of users in the test data set. These are the users about whom you should make predictions.
kaggle-stats-users-*-*.json: 6 months of aggregate statistics about each user’s like behavior.
- “user_id”
- “num_likes” (in previous 6 months)
- “like_blog_dist” — which blogs this user liked and how often
kaggle-stats-blogs-*-*.json: 6 months of aggregate statistics about each blog.
- “blog_id”
- “num_likes” (in previous 6 months)
- “num_posts” (in previous 6 months)
testUsers.csv: The set of users to evaluate results on. Restricted to users who have “liked” at least 1 post in the test period and at least 5 posts in the train period. While the test set is restricted like this, no such restriction has been made on the data provided in the training set.
evaluation.csv: The gold standard results for comparing test results against. Evaluation was performed using Mean Average Precision at 100.