This corpus is a filtered version of the full TREC StreamCorpus generated for the KBA track. It is is designed for use by participants of the 2014 TREC-TS track, this version is not suitable for TREC KBA participants. TREC-TS-2014F aims to provide a dataset with which groups can participate in the TREC-TS track without having to process the full KBA corpus. The creation of this dataset is described below.
The TREC-TS-2014F dataset is a filtered version of the KBA 2014 corpus. It is stored in the same format, follows the same file structure (ordered into per-hour folders) and is encrypted with the same GPG key (see above). To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 15 events from the TREC-TS 2014 track topics were removed, i.e. only documents with timestamps between the start and end tag for one or more TREC-TS 2014 topics were kept. Second, we filtered the remaining documents, keeping only those which were likely to contain one or more relevant sentences to an event. This filtering was performed as follows:
Property | KBA 2014 | TREC-TS-2014F |
---|---|---|
Size on Disk | ~16,100Gb | 559Gb |
# Chunk Files | 2,677,758 | 650,980 |
Stream | # KBA 2014 Files | # TREC-TS-2014F Files |
---|---|---|
CLASSIFIED | 45,848 | 5,251 |
linking | 12,939 | 1,115 |
FORUM | 109,567 | 19,439 |
MAINSTREAM_NEWS | 213,704 | 38,491 |
MEMETRACKER | 5,167 | 115 |
news | 280,665 | 101,636 |
REVIEW | 13,761 | 215 |
social | 826,251 | 236,767 |
WEBLOG | 1,264,605 | 247,951 |
arxiv | 11,851 | 0 |
As with the KBA 2014 corpus, a free version of TREC-TS-2014F is hosted by Amazon Web Services. It now available, you can browse it here streamcorpus-2014-v0_3_0-ts-filtered
The list of all file paths is streamcorpus-2014-v0_3_0-ts-filtered.s3-paths.txt.xz
For linux users, you can download the corpus using the following command sequence from a bash terminal.