Steve Isard's information on the Cytology Corpus

That's data from the original Alvey project, intended for training an automatic system that pathologists at the Western could use for dictating reports while looking at slides. It was a natural application for automatic recognition because the pathologists already spoke their reports into dictaphones, to be transcribed by typists. So the pathologist could carry on dictating as before, and there was already an error rate which mean that the pathologist had to check the transcription that came back from the typist. The advantage of automatic transcription was going to be speed, because the hospital was concerned about the length of time it took to get the report back to the patient (or gp or whoever needed to see it to decide whether treatment was called for). That meant that there was a realistic-looking goal of just getting the error rate low enough to be a good tradeoff against the extra speed. Unfortunately the error rate stayed up around 100%. I wonder whether the pathologists have adopted Dragon or Naturally Speaking in the meantime. It might be interesting to try running that data though a modern recognition system to see how hard the task now appears to be. The dataset was produced by pathologists taking the time to blank out names and identifying details from old reports for us, and then having Gordon Watson (and maybe others) read the transcripts. Also see Henry Thompson / Ellen Bard.


Home : Resources : Corpora 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 650 2690, Fax: +44 131 651 1426, E-mail: hod@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh