NLP Shared Corpora Resource

This interface allows University of Illinois researchers to download copyrighted corpora for which the University has a license. At present, the collection is limited mainly to text and speech transcript corpora.

Entering values in the fields below and clicking 'submit' will display a list of the corpora we have that match the specified criteria. Selecting the 'Corpus directory' link will take you to the root directory of the corresponding corpus. Each corpus has a directory structure, which can be easily downloaded using wget.

If you have corpora you would like to make available to the wider University of Illinois research community, or if you find errors or omissions in the data here, please contact Mark Sammons at mssammon at illinois dot edu.



Select the type and language of the corpora you would like to access:

Type

Language