TREC Temporal Summarization 2014 (TREC-TS-2014F)

This corpus is a filtered version of the full TREC StreamCorpus generated for the KBA track. It is is designed for use by participants of the 2014 TREC-TS track, this version is not suitable for TREC KBA participants. TREC-TS-2014F aims to provide a dataset with which groups can participate in the TREC-TS track without having to process the full KBA corpus. The creation of this dataset is described below.

The TREC-TS-2014F dataset is a filtered version of the KBA 2014 corpus. It is stored in the same format, follows the same file structure (ordered into per-hour folders) and is encrypted with the same GPG key (see above). To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 15 events from the TREC-TS 2014 track topics were removed, i.e. only documents with timestamps between the start and end tag for one or more TREC-TS 2014 topics were kept. Second, we filtered the remaining documents, keeping only those which were likely to contain one or more relevant sentences to an event. This filtering was performed as follows:

  1. For each hour within the time period of each event, all documents from the KBA 2014 corpus that were published within that hour were indexed using the open source Terrier IR platform v4.0 (see terrier.org). The title of each document (if available), and any text within the body sentences were indexed. Terrier's stopword list and Porter stemming were applied.
  2. The TREC organisers manually identified a set of queries representing the topics of interest relating to each of the 15 events, creating event-query pairs. (These will not be released until after the final submission of runs)
  3. For each event-query pair, Terrier was used to retrieve the top 1000 documents for each query incrementally from each hour index (for the hours belonging to the associated event). The retrieval model used was BM25 with default parameters. In this way, we aim to create a high-recall set of documents for participants of summarise each event from.
  4. Documents that were not retrieved for one or more queries were then filtered out, forming the final TREC-TS-2014F dataset.

Collection Statistics

General

Property KBA 2014 TREC-TS-2014F
Size on Disk ~16,100Gb 559Gb
# Chunk Files 2,677,758 650,980

Streams

Stream # KBA 2014 Files # TREC-TS-2014F Files
CLASSIFIED 45,848 5,251
linking 12,939 1,115
FORUM 109,567 19,439
MAINSTREAM_NEWS 213,704 38,491
MEMETRACKER 5,167 115
news 280,665 101,636
REVIEW 13,761 215
social 826,251 236,767
WEBLOG 1,264,605 247,951
arxiv 11,851 0

Downloading the corpus

As with the KBA 2014 corpus, a free version of TREC-TS-2014F is hosted by Amazon Web Services. It now available, you can browse it here streamcorpus-2014-v0_3_0-ts-filtered

The list of all file paths is streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz

For linux users, you can download the corpus using the following command sequence from a bash terminal.

  1. wget http://s3.amazonaws.com/aws-publicdatasets/trec/ts/streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz;
  2. xz --decompress streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz;
  3. cat streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt | cut -d ':' -f3 | sed 's/\/\//:\/\/s3.amazonaws.com\//g' | parallel -j 10 'wget --recursive --continue --no-host-directories --no-parent --reject "index.html*" http{}';