TREC Temporal Summarization 2014 (TREC-TS-2014F)

This corpus is a filtered version of the full TREC StreamCorpus generated for the KBA track. It is is designed for use by participants of the 2014 TREC-TS track, this version is not suitable for TREC KBA participants. TREC-TS-2014F aims to provide a dataset with which groups can participate in the TREC-TS track without having to process the full KBA corpus. The creation of this dataset is described below.

The TREC-TS-2014F dataset is a filtered version of the KBA 2014 corpus. It is stored in the same format, follows the same file structure (ordered into per-hour folders) and is encrypted with the same GPG key (see above). To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 15 events from the TREC-TS 2014 track topics were removed, i.e. only documents with timestamps between the start and end tag for one or more TREC-TS 2014 topics were kept. Second, we filtered the remaining documents, keeping only those which were likely to contain one or more relevant sentences to an event. This filtering was performed as follows:

For each hour within the time period of each event, all documents from the KBA 2014 corpus that were published within that hour were indexed using the open source Terrier IR platform v4.0 (see terrier.org). The title of each document (if available), and any text within the body sentences were indexed. Terrier's stopword list and Porter stemming were applied.
The TREC organisers manually identified a set of queries representing the topics of interest relating to each of the 15 events, creating event-query pairs. (These will not be released until after the final submission of runs)
For each event-query pair, Terrier was used to retrieve the top 1000 documents for each query incrementally from each hour index (for the hours belonging to the associated event). The retrieval model used was BM25 with default parameters. In this way, we aim to create a high-recall set of documents for participants of summarise each event from.
Documents that were not retrieved for one or more queries were then filtered out, forming the final TREC-TS-2014F dataset.

Collection Statistics

General

Property	KBA 2014	TREC-TS-2014F
Size on Disk	~16,100Gb	559Gb
# Chunk Files	2,677,758	650,980

Streams

Stream	# KBA 2014 Files	# TREC-TS-2014F Files
CLASSIFIED	45,848	5,251
linking	12,939	1,115
FORUM	109,567	19,439
MAINSTREAM_NEWS	213,704	38,491
MEMETRACKER	5,167	115
news	280,665	101,636
REVIEW	13,761	215
social	826,251	236,767
WEBLOG	1,264,605	247,951
arxiv	11,851	0

Downloading the corpus

As with the KBA 2014 corpus, a free version of TREC-TS-2014F is hosted by Amazon Web Services. It now available, you can browse it here streamcorpus-2014-v0_3_0-ts-filtered

The list of all file paths is streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz

For linux users, you can download the corpus using the following command sequence from a bash terminal.

wget http://s3.amazonaws.com/aws-publicdatasets/trec/ts/streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz;
xz --decompress streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz;
cat streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt | cut -d ':' -f3 | sed 's/\/\//:\/\/s3.amazonaws.com\//g' | parallel -j 10 'wget --recursive --continue --no-host-directories --no-parent --reject "index.html*" http{}';