Information on WISE 2012 Challenge

WISE 2012 Challenge was based on a dataset collected from one of the most popular micro-blog service (http://weibo.com). The challenge has two tracks: 1) the performance track, and 2) the mining track. The official webpage of the challenge is at:

http://www.wise2012.cs.ucy.ac.cy/challenge.html

The challenge is over (a summary report is here (shared via wuala.com)). This page is maintained for further information on the challenge and the dataset used in the challenge.

1. The dataset

The original data was crawled from Sina Weibo (http://weibo.com), a popular micro-blogging service in China, via the API provided. The dataset distributed in WISE 2012 Challenge is preprocessed as follows:

  1. User IDs and message IDs are anonymized.
  2. Content of tweets are removed, based on Sina Weibo’s Terms of Services.
  3. Some tweets are annotated with events. For each event, the terms that are used to identify the event and a link to Wikipedia (http://wikipedia.org) page containing descriptions to the event are given. The information of events are given in the file events.txt.

The dataset that to be used in both tracks contains two sets of files:

  1. Tweets: It includes basic information about tweets (time, user ID, message ID etc.), mentions (user IDs appearing in tweets), re-tweet paths, and whether containing links.
  2. Followship network: It includes the following network of users (based on user IDs).

In addition, a small testing dataset that should be used in the mining track is provided. It contains one file, which shares the same format of the tweets file introduced above. A small part of re-tweeting activities of thirty-three tweets of six events are given in the testing file.

It should be noted that the dataset is not complete, yet is only a sample of the whole data in the micro-blogging service.

The details of dataset format are given in Appendix 1: Data format.

2. The performance track (T1)

Attendees are required to build a system for evaluating queries over the dataset. Nineteen typical queries should be covered and corresponding interfaces in BSMA performance testing tool (which is a modified version of YCSB) should be implemented. The target is to achieve low response time and high throughput reported by BSMA performance testing tool.

The typical queries are introduced in Appendix 2: T1: Queries.

The BSMA performance testing tool manual is given in Appendix 3: T1: BSMA performance testing tool manual.

3. The mining track (T2)

In T2, it is required to predict the re-tweeting activities of thirty-three tweets of six events. For each of these six events, only tweets (and re-tweets) before a given timestamp are given in the file of Tweets. Thirty-three tweets are given in the file of Tests. For each of them, the event that it belongs to is given. As in Tweets, only information of re-tweeting before the timestamp is given. Attendees are required to predict two measurements at the time that the original tweet is published 30 days. These two measurements are:

  1. M1: The number of times that the original tweet is re-tweeted. If a user re-tweet (or called re-post, or forward) a tweet twice at different timestamps, it should be counted two times.
  2. M2: The number of times of possible-view of the original tweet. The number of possible-view of one re-tweet activity is defined as the number of followers of the user who conduct the re-tweet action. The number of times of possible-view of a tweet is defined as the sum of all possible-view numbers of re-tweet actions.

It should be noted that all re-tweeting actions in a re-tweeting chain should be counted in the root of the chain.

The ground truth, i.e. details on re-tweets of those thirty three tweets, are given in Appendix 4: results.txt

4. Shared data and materials

The dataset is shared in: Wuala.com (You may need to download the shared files by using Wuala client).

  • Appendix 1: Data format: A1.txt
  • Appendix 2: T1: Queries: A2.pdf
  • Appendix 3: T1: BSMA performance testing tool manual: A3.pdf
  • Appendix 4: T2: Ground truth for testing: A4_T2GTruth.zip
  • Tweets: in twelve compressed files (Please note that these files are quite large, and may take quite a long time to download.):
    • finalmicroblogs.zip.001 (1038090240 bytes)
      md5: 92E7D35F90EA8B2D2C142B0F7C214C09
    • finalmicroblogs.zip.002 (1038090240 bytes)
      md5: 35C688228B0929A961D4DB510936ABAB
    • finalmicroblogs.zip.003 (1038090240 bytes)
      md5: 033A8E30E8B05CB086679F64B3B43B00
    • finalmicroblogs.zip.004 (1038090240 bytes)
      md5: FE153B0786341A8059D3DCE2601CA2E1
    • finalmicroblogs.zip.005 (1038090240 bytes)
      md5: F823EE2C2B9C0FF2375E613B177A583D
    • finalmicroblogs.zip.006 (1038090240 bytes)
      md5: 8826C942344E468F2997E467624D407D
    • finalmicroblogs.zip.007 (1038090240 bytes)
      md5: 41DB57B998230435931BFA315F54E711
    • finalmicroblogs.zip.008 (1038090240 bytes)
      md5: 396995C04412EFC8DD3B0469045F8C58
    • finalmicroblogs.zip.009 (1038090240 bytes)
      md5: 6BDAB3F60C99349E355C4A6D62AD6D83
    • finalmicroblogs.zip.010 (1038090240 bytes)
      md5: DF6B13AB8F3A6E0BC372AEA104F587AE
    • finalmicroblogs.zip.011 (1038090240 bytes)
      md5: 80836DD1636B5D12C53EC803CE8E2C25
    • finalmicroblogs.zip.012 (1026428116 bytes)
      md5: 34A90C8B4FD796CDFD35862E278BD090
  • Followships: in three compressed zip files (Please note that the files are quite large and may take quite a long time to download):
    • socialnetwork.zip.001 (1038090240 bytes)
      md5: 789A5C4D182766ED42241B569AFD60FD 
    • socialnetwork.zip.002 (1038090240 bytes)
      md5: 149399C4CC17A4A9E2866183D93B24CC
    • socialnetwork.zip.003 (1024559892 bytes)
      md5: D427F3BB268AA6552BDF34918FEEBA19
  • Events: events.txt
  • Testing:

    eventForTest.zip

  • BSMA performance testing tool (pls. run: patch -p1 < bsma20120321.patch to correct error on Q8)

    • BSMA.zip
    • bsma20120321.patch

5. Notes

  • There are tweets with duplicated MIDs but having different values in other fields. All these records are returned by Sina Weibo API. There is no clue on which record should be correct. Attendants should handle these duplicated MIDs by themselves.
  • There are missing events. There are two types of missing events:
    • Our auto-annotation system cannot identify any corresponding tweets:
      • -Chinese pro-democracy protests
      • -Jiang Zemin disappearance and death rumor
    • They are labeled with different names in events.txt:
      • “Yao Ming retirement” and “Yao Ming retire” are actually the same event.
      • “Motorola was purchased by Google” and “Motorola was acquisitions by Google” are actually the same event.
      • “iphone4S release” and “iphone4s release” are actually the same event.
  • There are event labels in tweets that do not appear in events.txt. They are events that have no Wikipedia links provided. Attendants may omit them. There are also keyword labels that are not listed in events.txt. They are keywords related to above events without Wikipedia links.
  • Event names and keywords are case insensitive.
  • Other information related to the data-set can be found here.

6. Winners

  • Championship on T1: Throughput and Latency
    Ze Tang, Heng Lin, Kaiwei Li, and Wentao Han
    Department of Computer Science and Technology, Tsinghua University, China
  • Championship on T1: Scalability
    Edans F.O. Sandes, Li Weigang, and Alba C. M. A. de Melo
    University of Brasilia, Brasilia, Brazil
  • Championship on T2
    Sayan Unankard, Ling Chen, Peng Li, Sen Wang, Zi Huang, Mohamed Sharaf, and Xue Li
    School of Information Technology and Electrical Engineering, The University of Queensland, Australia
  • Runner-Up on T2
    Zhilin Luo, Yue Wang, and Xintao Wu
    University of North Carolina at Charlotte, USA

The full ranking is reported on the official WISE 2012 Challenge page.

7. Contact

For any further questions or comments, please contact me via:
wise2012challenge AT gmail DOT com

Advertisements