Corpora and other Language and Speech Data under DICE

Information about NLP and speech software can be found here.
Conference and workshop proceedings can be found here.

General Information

We have adopted a new, more systematic directory structure for corpora (and other language and speech data) under DICE. All corpora reside under /group/corpora, and there are two subdirectories. Which subdirectory a corpus is in, depends on its licensing agreement:

/group/corpora/public  corpora with University-wide licenses
/group/corpora/restricted  corpora with more restrictive licenses

For corpora with restrictive licenses, read access is limited to certain groups of users. The group names are specified in the list of restricted corpora below. If you need access to any of these corpora, please email corpus-admin@inf.ed.ac.uk.

Directory Structure

Corpora that exist in more than one version live in the same directory (the name of which is in all lowercase and identifies the corpus by name or acronym). Subdirectories identify different versions of a corpus, including annotated (or otherwise processed) versions. In this case, the unmodified version sits under `original'.

Examples:

/group/corpora/public/bnc/1.0 BNC, Version 1.0, unmodified
/group/corpora/public/bnc/2.0 BNC, Version 2.0, unmodified
/group/corpora/public/bnc/parsed_ims BNC, parsed with IMS parser
/group/corpora/public/bnc/parsed_minipar BNC, parsed with Minipar
/group/corpora/public/bllip/original BLLIP, unmodified
/group/corpora/public/bllip/parsed_miniparBLLIP, parsed with Minipar

Ordering Corpora

At the end of this page, you will find a list of the copora that are installed in the DICE corpus space. We have licenses for some other corpora that are currently not installed (often due to space or licensing restrictions).

If you would like to find out if we have a corpus that you need for your work, or order new copora, please email corpus-admin@inf.ed.ac.uk.

LDC Corpora

The University is a member of the Linguistic Data Consortium (LDC) for the following years: 1995, 1996, and 1998-2005. This means that are entitled to a reduced rate for all corpora released by the LDC during these years. Please have a look at the LDC web site for a list of available corpora.

As of 2005, the University has a subscription membership for the LDC. This means that we automatically get two copies of all new corpora released by the LDC in 2005 and subsequent years. Note, however, that not all LDC corpora are being installed automatically in the corpus space (due to constraints on disk space). If you want a new LDC corpus to be installed, please email corpus-admin@inf.ed.ac.uk.

If you are a corpus administrator, and you have an LDC membership account, please follow this link to find out more details about our LDC membership.

Conference Proceedings

We also maintain an archive of conference and workshop proceedings. These can be found at /group/corpora/public/proceedings in the corpus space. There is also a list of proceeding with a web interface for easy access at this link.

List of Corpora

All paths are relative to /group/corpora.

Public Corpora

These are corpora that are licensed to the University of Edinburgh, or to the School of Informatics, or are in the public domain. Access is open to all Informatics users.
name:      ACE 2004, Multilingual Training Corpus
directory: public/ace/ace_mtc/2004
type:      text
size:      34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2004, Time Normalization English Training Data
directory: public/ace/ace_tern
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2005, English SpatialML Annotations
directory: public/ace/ace_spatial
type:      text
size:      23 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2005, Multilingual Training Corpus
directory: public/ace/ace_mtc/2005
type:      text
size:      1.617 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE-2, Version 1.0
directory: public/ace/ace2
type:      text
size:      34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ATIS3 (Air Travel Information Service), NIST Speech Discs 17-1.1 - 17-3.1
directory: public/atis3
type:      speech
size:      1.3 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      An English Dictionary of the Tamil Verb
directory: public/tamil_dictionary
type:      text
size:      0.52 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Broadcast News Transcripts
directory: public/arabic_broadcast_news_transcripts
type:      text
size:      3.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Translation Corpus, Part 1
directory: public/arabic_translation/part1
type:      text
size:      2.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Translation Corpus, Part 2
directory: public/arabic_translation/part2
type:      text
size:      3.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Treebank, Part 1, Version 2.0
directory: public/arabic_treebank/part1v2.0
type:      text
size:      266 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Treebank, Part 1, Version 2.0, English Translation
directory: public/arabic_treebank/part1v2.0/translation
type:      text
size:      0.27 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Treebank, Part 3, Version 2.0
directory: public/arabic_treebank/part3v2.0
type:      text
size:      891 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Aurora Noisy TI Digits Database, Version 2.0
directory: public/aurora
type:      speech
size:      2.629 GB
licenser:  ELRA
licensee:  UoE
webpage:   here

name:      BBN IE/NE-tagged HUB-4 Training Transcripts
directory: public/bbn_ie_ne_tagged
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BBN Pronoun Coreference and Entity Type Corpus
directory: public/bbn_pronoun_coref
type:      text
size:      22 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus
directory: public/bllip/original
type:      text
size:      172 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus, parsed with Minipar
directory: public/bllip/parsed_minipar
type:      text
size:      293 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus, text extracted
directory: public/bllip/text
type:      text
size:      290 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Basic Electricity and Electronics Corpus
directory: public/bee
type:      text
size:      2 MB
licenser:  University of Pittsburgh
licensee:  freely available
webpage:   here

name:      Biomedical Information Extraction Corpus
directory: public/biomedical_ie
type:      text
size:      320 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Blog 06 Test Collection
directory: public/blogs_collection
type:      text
size:      25 GB
licenser:  University of Glasgow
licensee:  ICCS/HCRC
webpage:   here

name:      Boston University Radio Speech Corpus
directory: public/bu_radio
type:      speech
size:      2.424 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      British National Corpus, Version 1.0
directory: public/bnc/1.0
type:      text
size:      2.866 GB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, marked up in XML
directory: public/bnc/xml
type:      text
size:      815 MB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with Charniak parser
directory: public/bnc/parsed_charniak
type:      text
size:      419 MB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with IMS parser
directory: public/bnc/parsed_ims
type:      text
size:      2.088 GB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with Minipar
directory: public/bnc/parsed_minipar
type:      text
size:      448 MB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with RASP parser
directory: public/bnc/parsed_rasp
type:      text
size:      3.520 GB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, raw text without any markup
directory: public/bnc/text
type:      text
size:      579 MB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, various LTG data
directory: public/bnc/data
type:      text
size:      7 MB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 2.0 (World Edition)
directory: public/bnc/2.0
type:      text
size:      1.779 GB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 2.0 (World Edition), indexed for IMS Corpus Workbench
directory: public/bnc/corpus_workbench
type:      text
size:      967 MB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      British National Corpus, Version 3.0 (XML Edition)
directory: public/bnc/3.0
type:      text
size:      4.619 GB
licenser:  BNC Consortium
licensee:  ICCS/HCRC
webpage:   here

name:      Buckwalter Arabic Morphological Analyzer Version 2.0
directory: public/buckwalter
type:      lexicon
size:      4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      CCGbank, Version 1.1
directory: public/ccgbank
type:      text
size:      387 MB
licenser:  LDC
licensee:  UoE
webpage:   CCG group home page, LDC catalog entry

name:      CELEX Lexical Database, Version 2.0
directory: public/celex/2.0
type:      lexcion
size:      288 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      COMLEX English Syntax Corpus
directory: public/comlex/corpus
type:      text
size:      98 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      COMLEX English Syntax Lexicon
directory: public/comlex/lexicon
type:      lexicon
size:      18 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      CSTR TIMIT Sentence Data
directory: public/timit/cstr
type:      speech
size:      478 MB
licenser:  LDC
licensee:  UoE
webpage:   none

name:      CSTR Weather Database for Speech Synthesis
directory: public/synthesis/cstr/weather
type:      text
size:      255 MB
licenser:  CSTR
licensee:  UoE
webpage:   none

name:      Callhome American English, Speech
directory: public/callhome/english/speech
type:      speech
size:      1.830 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Mandarin Chinese Transcripts - XML version
directory: public/callhome/chinese
type:      speech
size:      9.5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Spanish Lexicon
directory: public/callhome/spanish/lexicon
type:      text
size:      3.1 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Spanish Transcripts
directory: public/callhome/spanish/transcripts
type:      text
size:      1.9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Candian Hansard
directory: public/canadian_hansard
type:      text
size:      685 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Childes Child Language Database
directory: public/childes
type:      text
size:      1266 MB
licenser:  Carnegie Mellon University
licensee:  GPL
webpage:   here

name:      Chinese English Name Entity Lists, Version 1.0
directory: public/chinese_english_ne
type:      text
size:      97 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Gigaword second edition
directory: public/chinese_gigaword
type:      text
size:      1.7 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese News Translation Corpus, Part 1
directory: public/chinese_translation
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Proposition Bank 1.0
directory: public/chinese_propbank/1.0
type:      text
size:      21 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Proposition Bank 2.0
directory: public/chinese_propbank/2.0
type:      text
size:      112 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 2.0
directory: public/chinese_treebank/2.0
type:      text
size:      4.3 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 2.0, English Translation
directory: public/chinese_treebank/2.0/translation
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      Chinese Treebank, Version 3.0
directory: public/chinese_treebank/3.0
type:      text
size:      14.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 3.0, English Translation
directory: public/chinese_treebank/3.0/translation
type:      text
size:      1.7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 5.0
directory: public/chinese_treebank/5.0
type:      text
size:      31 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 6.0
directory: public/chinese_treebank/6.0
type:      text
size:      115 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese-English Translation Lexicon, Version 3.0
directory: public/chinese_english_lexicon
type:      lexicon
size:      1.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Christine Corpus of Spoken British English
directory: public/christine
type:      text
size:      4.3 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      CoNNL 2008 Shared Task Data Set
directory: public/shared_tasks/connl08
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Conference Proceedings from CDROMs
directory: public/proceedings
type:      text
size:      18 GB and growing
licenser:  various
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (CSR-III Speech)
directory: public/csr/csr3/speech
type:      speech
size:      1.952 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (CSR-III Text)
directory: public/csr/csr3/text
type:      speech
size:      1.791 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (HUB-4 Language Model)
directory: public/csr/hub4
type:      text
size:      845 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Corpus of IMDB Movie Summaries, indexed for IMS Corpus Workbench
directory: public/imdb
type:      text
size:      168 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Cytology Corpus (Alvey Project)
directory: public/cytol
type:      speech
size:      372 MB
licenser:  CSTR
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2000 Dialogue Act Tagged
directory: public/darpa_communicator/2000/tagged
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2000 Evaluation
directory: public/darpa_communicator/2000/evaluation
type:      speech
size:      4.384 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2001 Dialogue Act Tagged
directory: public/darpa_communicator/2001/tagged
type:      text
size:      88 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2001 Evaluation
directory: public/darpa_communicator/2001/evaluation
type:      speech
size:      3.804 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Resource Management Continuous Speech Database (RM1)
directory: public/resource_management/rm1
type:      speech
size:      387 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Resource Management Continuous Speech Database (RM2)
directory: public/resource_management/rm2
type:      speech
size:      688 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DCIEM Sleep Deprivation Corpus
directory: public/dciem
type:      speech
size:      7.448 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DSO Corpus of Sense-Tagged English
directory: public/dso
type:      text
size:      37 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Dickens Corpus, indexed for IMS Corpus Workbench
directory: public/dickens
type:      text
size:      65 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Diphone Voices for Festival
directory: public/synthesis/diphone_voices
type:      speech
size:      4.477 GB
licenser:  CSTR
licensee:  UoE
webpage:   here

name:      Discourse Graphbank
directory: public/discourse_graphbank
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Dundee Corpus of English and French Eyemovement Data
directory: public/dundee_eyemovement
type:      speech
size:      207 MB
licenser:  Department of Psychology, University of Dundee
licensee:  UoE
webpage:   none

name:      Electromagnetic Articulograph (EMA) Data
directory: public/ema/other
type:      speech and EMA
size:      2.394 GB
licenser:  QMUC/CSTR
licensee:  UoE
webpage:   here

name:      Emotional Prosody Speech and Transcripts
directory: public/emotional_speech
type:      speech
size:      2.845 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, parsed with Minipar
directory: public/english_gigaword/parsed_minipar
type:      text
size:      17.157 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, tokenized and tagged
directory: public/english_gigaword/tagged
type:      text
size:      22.169 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Intonation in the British Isles Corpus
directory: public/ivie
type:      text
size:      2.471 GB
licenser:  University of Oxford
licensee:  freely available
webpage:   here

name:      English-Arabic Parallel Treebank
directory: public/english_arabic_treebank
type:      text
size:      18 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English-Chinese Translation Treebank 1.0
directory: public/chinese_translation_treebank
type:      speech
size:      9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Enron Email Dataset
directory: public/enron/original
type:      text
size:      1.646 GB
licenser:  public domain
licensee:  none
webpage:   here

name:      Enron Email Dataset, prepared for Rainbow
directory: public/enron/rainbow
type:      text
size:      290 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Enron Email Dataset, with Topic Annotations
directory: public/enron/annotations
type:      text
size:      0.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      European Corpus Initiative Multilingual Corpus
directory: public/eci
type:      text
size:      685 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      European News Corpus
directory: public/european_news
type:      text
size:      715 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      European Parliament Proceedings Parallel Corpus, Version 2.0
directory: public/europarl
type:      text
size:      3.809 GB
licenser:  public domain
licensee:  none
webpage:   here

name:      Extended VerbNet
directory: public/verbnet
type:      lexicon
size:      2.5 MB
licenser:  University of Colorado
licensee:  freely available
webpage:   here

name:      Extended WordNet Lexical Database, WordNet Version 2.0, Extension Version 1.1
directory: public/wordnet/xwn
type:      lexicon
size:      154 MB
licenser:  University of Texas at Dallas
licensee:  freely available
webpage:   here

name:      Fisher English Training Speech, Part 1, Speech
directory: public/fisher/speech
type:      speech
size:      29.405 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Fisher English Training Speech, Part 1, Transcripts
directory: public/fisher/transcripts
type:      text
size:      280 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      FrameNet 1.1
directory: public/framenet/1.1
type:      text
size:      1.024 GB
licenser:  University of California at Berkeley
licensee:  freely available
webpage:   here

name:      FrameNet 1.3
directory: public/framenet/1.3
type:      text
size:      783 MB
licenser:  University of California at Berkeley
licensee:  freely available
webpage:   here

name:      Frankfurter Rundschau corpus (part of ECI), tokenized and tagged
directory: public/rundschau
type:      text
size:      1.074 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      French Treebank, Version 1.4
directory: public/french_treebank
type:      text
size:      147 MB
licenser:  LLF, Universite Paris 7
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Arabic Blog Parallel Text
directory: public/gale/galep1_ara_bl_ptxt/
type:      text
size:      5.7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
directory: public/gale/ara_bn_ptext
type:      text
size:      36 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
directory: public/gale/ar_bn_ptxt_p2/
type:      text
size:      2.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Blog Parallel Text
directory: public/gale/gale_p1_ch_blog
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
directory: public/gale/ch_bn_ptxt
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
directory: public/gale/ch_bn_ptxt
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
directory: public/gale/ch_bn_ptxt
type:      text
size:      4.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Distillation Training
directory: public/gale/galep1_distill_tr
type:      text
size:      31 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - MTPlus Pilot
directory: restricted/gale/GALE-P3-MTPlus_Pilot
group:     smt
type:      text
size:      0.86 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - Transcripts
directory: restricted/gale/GALE-P3R2/transcription
group:     smt
type:      text
size:      117 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - Translations
directory: restricted/gale/GALE-P3R2/translationGALE-P3R1
group:     smt
type:      text
size:      11.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      German Law Corpus, indexed for IMS Corpus Workbench
directory: public/german_law
type:      text
size:      40 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      GlobalPhone
directory: public/global_phone
type:      speech
size:      18 GB
licenser:  UoE
licensee:  ELDA
webpage:   GlobalPhone

name:      Google Book Corpus
directory: public/google_books
type:      text
size:      8.119 GB
licenser:  ICCS/HCRC
licensee:  LDC
webpage:   n/a

name:      Google n-Gram Corpus
directory: public/google_ngrams
type:      text
size:      25 GB
licenser:  UoE
licensee:  LDC
webpage:   here

name:      Gulf Arabic Conversational Telephone Speech, Transcripts
directory: public/arabic_telephone
type:      text
size:      11 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      HARD 2004 Topics and Annotations
directory: public/hard
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hebrew Treebank
directory: public/hebrew_treebank
type:      text
size:      20 MB
licenser:  Technion
licensee:  public domain
webpage:   here

name:      Hidi Wordnet
directory: public/hindi_wordnet
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong Hansard Parallel Text, Alignments
directory: public/hong_kong_hansard/alignments
type:      text
size:      91 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong Hansard Parallel Text, Text
directory: public/hong_kong_hansard/text
type:      text
size:      110 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong Laws Parallel Text
directory: public/hong_kong_laws
type:      text
size:      75 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong News Parallel Text, Alignments
directory: public/hong_kong_news/alignments
type:      text
size:      107 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong News Parallel Text, Text
directory: public/hong_kong_news/text
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ICSI Meeting Speech
directory: public/icsi_meeting/speech
type:      text
size:      33.4 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ICSI Meeting Transcripts
directory: public/icsi_meeting/transcripts
type:      text
size:      3.51 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      IMS Corpus Workbench (corpus registry files only)
directory: public/corpus_workbench
type:      text/speech
size:      0 MB
licenser:  IMS Stuttgart
licensee:  ICCS/HCRC
webpage:   here

name:      ISL Meeting Speech
directory: public/isl_meeting/speech
type:      speech
size:      5.975 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ISL Meeting Transcripts
directory: public/isl_meeting/transcripts
type:      text
size:      1.81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Instruction-based Learning for Mobile Robots Corpus
directory: public/ibl
type:      speech
size:      123 MB
licenser:  University of Edinburgh/University of Plymouth
licensee:  freely available
webpage:   here

name:      KAIST Korean Speech Database
directory: public/kaist
type:      speech
size:      3711 MB
licenser:  The Korean Advanced Institute of Science and Technology
licensee:  UoE
webpage:   none

name:      Korean Broadcast News Transcripts 
directory: public/korean_news_transcripts
type:      text
size:      1.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Korean Propbank
directory: public/korean_propbank
type:      text
size:      24 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Korean Treebank, Version 1.0
directory: public/korean_treebank/1.0
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Korean Treebank, Version 2.0
directory: public/korean_treebank/2.0
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Lancaster Corpus of Mandarin (LCMC)
directory: public/lcmc
type:      text
size:      46 MB
licenser:  ELRA
licensee:  UoE
webpage:   here

name:      Levantine Arabic QT Training Data Set 5, Transcripts
directory: public/arabic_qt_data
type:      text
size:      27 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Lucy Corpus of Written British English
directory: public/lucy
type:      text
size:      4.7 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      MDE RT-02 Rich Transcription Broadcast News and Conversational Telephone Speech 2002
directory: public/mde/rt-02
type:      speech
size:      815 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-03 Training Data, Speech
directory: public/mde/rt-03/speech
type:      speech
size:      5256 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-03 Training Data, Text and Annotations
directory: public/mde/rt-03/text
type:      text
size:      723 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-04 Training Data, Speech
directory: public/mde/rt-04/speech
type:      speech
size:      4829 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-04 Training Data, Text and Annotations
directory: public/mde/rt-04/text
type:      text
size:      567 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MITRE 1997 Mandarin Broadcast News Speech Translations (Hub-4NE)
directory: public/mandarin_transcripts/hub4-ne
type:      text
size:      2.35 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MOCHA Electromagnetic Articulograph (EMA) Corpus
directory: public/ema/mocha
type:      speech
size:      2221 MB
licenser:  QMUC/CSTR
licensee:  UoE
webpage:   here

name:      MRC Psycholinguistic Database
directory: public/mrc
type:      lexicon
size:      11 MB
licenser:  MRC
licensee:  freely available
webpage:   here

name:      Machine-readable Spoken English Corpus
directory: public/marsec
type:      speech
size:      2 MB
licenser:  Reading University
licensee:  UoE
webpage:   here

name:      Macrophone: American English Segment of the Polyphone Corpus
directory: public/macrophone
type:      speech
size:      3809 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Mandarin Transcripts (HUB-5, 2001)
directory: public/mandarin_transcripts/hub5
type:      text
size:      0.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Mandarin Transcripts, HKUST Telephone Data, Part 1
directory: public/mandarin_transcripts/hkust
type:      text
size:      11 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Maptask Corpus
directory: public/maptask
type:      speech
size:      13665 MB
licenser:  LDC and UoE/LDC
licensee:  UoE
webpage:   Maptask home page, LDC catalog entry

name:      Mawukakan Lexicon
directory: public/mawukakan_lexicon
type:      lexicon
size:      4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 6
directory: public/muc/muc6
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 6, Additional News Text 
directory: public/muc/muc6/additional_text
type:      text
size:      0.67 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 7
directory: public/muc/muc7
type:      text
size:      45.7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multext East
directory: public/multext_east
type:      text
size:      295 MB
licenser:  ICCS/HCRC
licensee:  Jozef Stefan Institute, Ljubljana
webpage:   here

name:      Multilingual Corpora for Cooperation
directory: public/mlcc
type:      text
size:      1223 MB
licenser:  internal
licensee:  internal
webpage:   here

name:      Multilingual Semcor, Version 1.1
directory: public/semcor/multisemcor
type:      text
size:      142 MB
licenser:  ITC/IRST
licensee:  University of Edinburgh
webpage:   here

name:      Multiple-Translation Arabic Corpus, Part 1
directory: public/mt_arabic/part1
type:      text
size:      5.0 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Arabic Corpus, Part 2
directory: public/mt_arabic/part2
type:      text
size:      2.5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 1, Version 1.0
directory: public/mt_chinese/part1/1.0
type:      text
size:      4.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 1, Version 2.0
directory: public/mt_chinese/part1/2.0
type:      text
size:      2.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 2
directory: public/mt_chinese/part2
type:      text
size:      3.5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 3
directory: public/mt_chinese/part3
type:      text
size:      1.1 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 4
directory: public/mt_chinese/part4
type:      text
size:      5.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NIST Meeting Pilot Corpus Transcripts and Metadata
directory: public/nist_meeting_pilot
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NIST Speaker Recognition Evaluation 2002
directory: public/nist_speaker_rec
type:      speech
size:      4724 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NIST TI Digits
directory: public/tidigits
type:      speech
size:      786 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NTIMIT Acoustic-Phonetic Continuous Speech Corpus
directory: public/ntimit
type:      speech
size:      1146 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      New York Times Annotated Corpus
directory: public/nyt_annotated
type:      text
size:      3202 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NYNEX Phonebook
directory: public/phonebook
type:      speech
size:      1.4 GB
licenser:  LDC
licensee:  UoE
webpage:   ?

name:      Newsgroup Corpus
directory: public/newsgroups
type:      text
size:      55 MB
licenser:  public domain
licensee:  none
webpage:   various newsgroups

name:      NomBank
directory: public/nombank
type:      speech
size:      56 MB
licenser:  NYU
licensee:  none
webpage:   here

name:      North American News Text Corpus
directory: public/american_news/original
type:      text
size:      2342 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      North American Newstext Corpus, parsed with Minipar
directory: public/american_news/parsed_minipar
type:      text
size:      3392 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OntoNotes Release 1.0
directory: public/ontonotes/1.0
type:      text
size:      750 MB
licenser:  LDC
licensee:  UoE
webpage:   here
name:      OntoNotes Release 2.0
directory: public/ontonotes/2.0
type:      text
size:      1299 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OHSUMED Corpus (also used for the TREC 9 Filtering Track)
directory: public/ohsumed
type:      text
size:      1176 MB
licenser:  NIST
licensee:  freely available
webpage:   here

name:      Penn Discourse Treebank, Version 1.0
directory: public/penn_discourse_treebank/1.0
type:      text
size:      10 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Discourse Treebank, Version 2.0
directory: public/penn_discourse_treebank/2.0
type:      text
size:      38 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Treebank, Version 2.0
directory: public/penn_treebank/2.0
type:      text
size:      655 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Penn Treebank, Version 3.0
directory: public/penn_treebank/3.0
type:      text
size:      256 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Prague Czech-English Dependency Treebank, Version 1.0
directory: public/prague_treebank/
type:      text
size:      587 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Proposition Bank, Version 1.0
directory: public/propbank
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      RST Discourse Treebank
directory: public/rst_treebank
type:      text
size:      26 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Research Cyc
directory: public/cyc
type:      text
size:      4118 MB
licenser:  Cycorp
licensee:  UoE
webpage:   here

name:      Reuters Text Categorization Corpus 21578
directory: public/reuters/21578
type:      text
size:      28 MB
licenser:  Reuters
licensee:  freely available
webpage:   here

name:      Roget's Thesaurus from 1911
directory: public/roget
type:      text
size:      12 MB
licenser:  public domain
licensee:  freely available
webpage:   here

name:      SAID (Syntactically Annotated Idiom Dataset)
directory: public/said
type:      text
size:      3 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      SOLE Project Corpus
directory: public/synthesis/cstr/sole
type:      speech
size:      895 MB
licenser:  HCRC
licensee:  UoE
webpage:   here

name:      Search Engine Logs (Alltheweb, Excite, Altavista)
directory: public/searchengine_logs
type:      text
size:      440 MB
licenser:  Jim Jansen, Penn State University
licensee:  ICCS/HCRC
webpage:   ?

name:      Semcor Semantically Annotated Corpus, Version 1.6
directory: public/semcor/1.6
type:      text
size:      39 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Semcor Semantically Annotated Corpus, Version 2.0
directory: public/semcor/2.0
type:      text
size:      34 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Sinorama Chinese English Parallel Text
directory: public/sinorama
type:      text
size:      64 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Broadcast News
directory: public/spanish_broadcast_news
type:      speech
size:      5.2 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Gigaword, First Edition
directory: public/spanish_gigaword
type:      text
size:      1775 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Newswire, vols 1 and 2
directory: public/spanish_newswire
type:      text
size:      556 MB + 624 MB
licenser:  LDC
licensee:  UoE
webpage:   here, here

name:      Spanish Treebank
directory: public/spanish_treebank
type:      text
size:      8 MB
licenser:  University of Barcelona
licensee:  freely available
webpage:   here


name:      Susanne Corpus of Written American English, Version 1.0
directory: public/susanne/1.0
type:      text
size:      5 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Susanne Corpus of Written American English, Version 5.0
directory: public/susanne/5.0
type:      text
size:      6 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Switchboard 1 Telephone Speech Corpus, Release 2
directory: public/switchboard/switchboard1
type:      speech
size:      1485 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard 2 Telephone Speech Corpus, Phases 1-3
directory: public/switchboard/switchboard2
type:      speech
size:      50643 MB
licenser:  LDC
licensee:  UoE
webpage:   here and here and here

name:      Switchboard Cellular Telephone Speech Corpus, Part 1, Audio
directory: public/switchboard/cellular/part1/audio
type:      speech
size:      1401 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard Cellular Telephone Speech Corpus, Part 1, Transcripts
directory: public/switchboard/cellular/part1/transcripts
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard Cellular Telephone Speech Corpus, Part 2, Audio
directory: public/switchboard/cellular/part2/audio
type:      speech
size:      11364 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT2 Careful Transcription Text
directory: public/tdt2_careful_text
type:      text
size:      1.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT5 topics and annotations
directory: public/tdt5_topics_and_annotations
type:      text
size:      80 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TIMIT Acoustic-Phonetic Continuous Speech Corpus
directory: public/timit/original
type:      speech
size:      668 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Tageszeitung (TAZ) Corpus
directory: public/taz
type:      text
size:      1439 MB
licenser:  ICCS/HCRC
licensee:  Contrapress Media GmbH
webpage:   here

name:      Talbanken05 Swedish Treebank
directory: public/talbanken
type:      speech
size:      144 MB
licenser:  University of Växjö and University of Lund 
licensee:  freely available
webpage:   here

name:      Timebank 1.2
directory: public/timebank
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ToBI Guidelines and Examples
directory: public/tobi_course
type:      text
size:      19 MB
licenser:  Ohio State University 
licensee:  UoE
webpage:   here

name:      Translanguage English Database (TED), Speech
directory: public/ted/speech
type:      speech
size:      2903 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Translanguage English Database (TED), Transcripts
directory: public/ted/transcripts
type:      text
size:      1.3 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Ummah Arabic English Parallel News Text
directory: public/ummah
type:      text
size:      6.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Underspecified Rhetorical Markup Language (URML) Corpus aka
           Potsdam Commentary Corpus
directory: public/urml
type:      text
size:      1.7 MB
licenser:  University of Potsdam
licensee:  HCRC/ICCS
webpage:   here

name:      WSJCAM0 Cambridge Read News
directory: public/wsjcam0/original
type:      speech
size:      3848 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      WSJCAM0 Cambridge Read News, processed data
directory: public/wsjcam0/data
type:      speech
size:      13571 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Wall Street Journal Corpus (1991 version distributed by the
           ACL data collection initiative)
directory: public/wsj
type:      text
size:      17 MB
licenser:  LDC
licensee:  UoE
webpage:   n/a

name:      Wikipedia Corpus (INEX 2006 Corpus)
directory: public/wikipedia/original
type:      text
size:      4959 MB
licenser:  various
licensee:  ICCS/HCRC
webpage:   here

name:      Wikipedia Corpus (INEX 2006 Corpus), Question Answering Version
directory: public/wikipedia/qa
type:      text
size:      5143 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 1.6
directory: public/wordnet/1.6
type:      lexicon
size:      40 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 1.7.1
directory: public/wordnet/1.7.1
type:      lexicon
size:      40 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 2.0
directory: public/wordnet/2.0
type:      lexicon
size:      41 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 2.1
directory: public/wordnet/2.1
type:      lexicon
size:      38 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Wordlists for various languages
directory: public/wordlists
type:      lexicon
size:      2 MB
licenser:  n/a
licensee:  freely available
webpage:   n/a

name:      Global Yoruba Lexical Database 1.0
directory: public/yoruba
type:      text
size:      183 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Xinhua Chinese English Parallel News Text, Version 1.0 beta
directory: public/xinhua
type:      text
size:      40 MB
licenser:  LDC
licensee:  UoE
webpage:   here

Restricted Corpora

These are corpora which are licenced to a paticular institute, project, or a group of individuals. Access is limited to a specific Unix groups consisting of the correct set of users.
Name:      Reuters Corpus, various supporting data
directory: restricted/reuters/data
group:     reuters01
type:      text
size:      2 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      AQUAINT-2 Information-Retrieval Text Research Collection
directory: restricted/aquaint
group:     trec
type:      text
size:      2498 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (HUB-4)
directory: restricted/csr
group:     corpman
type:      speech
size:      908 MB
licenser:  LDC
licensee:  UoE
webpage:   here, here, here
note:      contains a mixture of LDC and propriatory data, thus restricted

name:      DMM German Morphological Database
directory: restricted/dmm
group:     dmm
type:      lexicon
size:      21 MB
licenser:  University of Erlangen-Nuernberg
licensee:  ICCS/HCRC
webpage:   here

name:      GALE Kickoff
directory: restricted/gale/kickoff
group:     smt
type:      text
size:      106 MB
licenser:  LDC
licensee:  UoE
webpage:   here and here

name:      GALE Phase 2 Releases 1, 2 and 3
directory: restricted/gale/GALE-P2*
group:     smt
type:      text
size:      1.5 GB
licenser:  LDC
licensee:  UoE
webpage:   release 1: here and here; release 2: here and here; release 3: here and here

name:      GALE Phase 3 DevTest - Source Text, Transcripts and Translations
directory: restricted/gale/GALE-P3-DevTest-V1_0
group:     smt
type:      text
size:      5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Distillation
directory: restricted/gale/GALE-Phase3-Distillation-TrainingData-V1_0
group:     smt
type:      text
size:      1.34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - English Translation Treebank
directory: restricted/gale/GALE-P3R1-EBNTT-Sep07
group:     smt
type:      text
size:      3.76 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Found Parallel Text
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      222.13 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Transcripts
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      70.57 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Translations
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      7.91 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 - IBM Arabic-English Word Alignment Corpus
directory: restricted/gale/Y1
group:     smt
type:      text
size:      25 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 Q3
directory: restricted/gale/GALE-Y1Q3
group:     smt
type:      text
size:      14 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 Q4
directory: restricted/gale/GALE-Y1Q4
group:     smt
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 4 Release 1 - Transcripts V1.0
directory: restricted/gale/GALE-P4R1
group:     smt
type:      text
size:      72 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 4 Release 1 - Translations V1.0
directory: restricted/gale/P4R1
group:     smt
type:      text
size:      72 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GermaNet (German WordNet) 4.0
directory: restricted/germanet
group:     dmm
type:      lexicon
size:      11 MB
licenser:  University of Tuebingen
licensee:  ICCS/HCRC
webpage:   here

name:      Lancaster-Oslo-Bergen Corpus of British English
directory: restricted/lob
group:     corpman
type:      text
size:      8 MB
licenser:  ?
licensee:  LTG?
webpage:   here

name:      London-Lund Corpus of Spoken English
directory: restricted/london_lund
group:     corpman
type:      text
size:      10 MB
licenser:  ?
licensee:  LTG?
webpage:   here

name:      Maptask corpora for different languages and situations
directory: restricted/maptask
group:     corpman
type:      text
size:      622 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      Medline Corpus
directory: restricted/medline
group:     umls
type:      text
size:      7602 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      NEGRA Parsed Corpus of German
directory: restricted/negra
group:     negra
type:      text
size:      55 MB
licenser:  Saarland University
licensee:  ICCS/HCRC
webpage:   here

name:      Reuters Corpus Volume 1 (English), Release 2000-11-03
directory: restricted/reuters/english
group:     reuters01
type:      text
size:      1012 MB
licenser:  NIST/Reuters
licensee:  Informatics/CSTR
webpage:   here

name:      Reuters Corpus Volume 2 (Multilingual), Release 2000-05-31
directory: restricted/reuters/multilingual
group:     reuters01
type:      text
size:      622 MB
licenser:  NIST/Reuters
licensee:  Informatics/CSTR
webpage:   here

name:      Search Engine Logs (AOL)
directory: restricted/searchengine_logs/aol
group:     querylogs
type:      text
size:      449 MB
licenser:  AOL
licensee:  freely available [but privacy concerns, hence restricted]
webpage:   ?

name:      Search Engine Logs (Excite)
directory: restricted/searchengine_logs/excite
group:     querylogs
type:      text
size:      52 MB
licenser:  Excite
licensee:  freely available [but privacy concerns, hence restricted]
webpage:   ?

name:      TIGER Parsed Corpus of German
directory: restricted/tiger
group:     negra
type:      text
size:      140 MB
licenser:  IMS Stuttgart
licensee:  ICCS/HCRC
webpage:   here

name:      TREC-9 Question Answering Track Corpus
directory: restricted/trec/trec9/question_answering
group:     corpman
type:      text
size:      62 MB
licenser:  NIST
licensee:  LTG?
webpage:   here

name:      Tuebingen Partially Parsed Corpus of German, Newspaper (TüPP-D/Z) [based on TAZ corpus]
directory: restricted/tuebingen/tueppdz
group:     negra
type:      text
size:      7651 MB
licenser:  Tuebingen University
licensee:  ICCS/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 2 [based on TAZ corpus]
directory: restricted/tuebingen/tuebadz
group:     negra
type:      text
size:      185 MB
licenser:  Tuebingen University
licensee:  ICCS/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Speech (TüBa-D/S) [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebads
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ICCS/HCRC
webpage:   here

name:      UMLS Metathesaurus 2005AC
directory: restricted/umls
group:     umls
type:      text
size:      7780 MB
licenser:  American Medical Association
licensee:  ICCS/HCRC
webpage:   here


Home : Resources 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 650 2690, Fax: +44 131 651 1426, E-mail: hod@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh