Mirror Data
This page contains links to the data used in [LiMoRo04a] and [LiMoRo04b]. There are three files:
- entities.sort: a list of entities from the corpus, sorted by label. Each line in the file is a record for a single entity. The format is: Document_ID:sentence_ID:word_ID:name:NEType:EntityLabel:Nomeaning. The NEType field takes the value '0' for people, '1' for locations, and '2' for organizations. The Nomeaning field always has the value '1'.
- NYT_txt.tar.gz: a tarball of the raw document texts.
- NYT_column.tar.gz: a tarball of the column-formatted texts. NOTE: this tarball will untar into the current working directory The columns of this format are split by tabs. The important indexes of columns are:
- 0: word number
- 1: sentence number
- 2: text
- 6: named-entity tag