© Mikhail Tolstoy - Fotolia.com

Speech recognition converts spoken words to machine-readable input. It is known and used in the field of health care, the military sector, telephony and other domains and is also a helpful application for disabled people. But what does speech recognition mean in the field of captioning? What does it mean in the field of language transfer?

Dr Mike Wald, researcher in this field and speaker at the Languages and the Media Conference, makes clear that speech recognition can help reduce the cost of captioning, particularly live captioning, making it available to more people. For the use of the wide range of multimedia content in particular, synchronising captions with speech enables the multimedia to be searched through the synchronised text. Another advantage of text captions could be providing help in the process of readers’ language development - an advantage to both hearing impaired and non-native speakers.

In a short interview with Languages and the Media, Dr Wald provides insight into his research and gives an overview of what speech recognition means in the field of captioning.

L&M: Dr Wald, could you please briefly describe how speech recognition captioning works?
Mike Wald: Speech recognition captioning automatically changes what is being said into writing. This allows people who cannot hear or - for other reasons - understand what is being said to be able to read it instead. This technique works by analysing sounds and then determining what words have been uttered. The system makes a best guess from knowing what sounds are normally used when speaking a particular language, how words are normally pronounced and how words are written in that language. The system also knows what the most probable word to have occurred in a sequence of words is through having analysed a large inventory many sentences.

L&M: Speech recognition is easy to associate with automatic translation. What are the advantages and what are the problems of this technique?
Mike Wald: Automatic translation of speech from one language to another would be very useful, but so far, accurate machine translation of text from one language to another is difficult to achieve. Converting speech into text using speech recognition adds additional errors into the system. It is more successful if the system only has to identify what has been said from a limited possible number of phrases.

L&M: This topic quickly leads to the question of quality and usability. Where do you see its efficiency and usability?
Mike Wald: As a member of the Liberated Learning Consortium, for the past ten years we have been researching and developing the efficiency and usability of speech recognition technologies. Our main focus is on the presentation of live captions in classrooms as well as transcripts and synchronised captions on the web to help disabled students and non-native speakers.

Speech recognition can be most accurate when there is no background noise and there is only one speaker, who pronounces words clearly. The system has been trained to know how a particular person pronounces words, what words they use most and what they are most likely to be talking about. Using a trained person to repeat what is being said into a trained speech recognition system is one way to try and overcome the limitations speech recognition faces with background noise, multiple people speaking or people speaking in an unclear way.

A human operator can correct errors in the captions created by speech recognition, but this takes time and a substantial delay is unacceptable in some situations.

L&M: Do you have a concrete example where you can exemplify the usability of speech recognition captioning to enhance multimedia learning?
Mike Wald: We are developing technologies to make multimedia easier to access and use through the creation of synchronised notes, bookmarks, tags, links, images and text captions. While users can easily bookmark, search, link to, or tag the whole of a podcast or video recording available on the web, they cannot easily find or associate their notes with part of that recording. As an analogy, users would clearly find a textbook difficult to use if it had no contents page, index or page numbers. I have come up with the new term ‘Synnotations’ for these synchronised annotations, as there would appear to be no existing appropriate single word.

The provision of synchronised text captions with audio and video enables all their communication qualities and strengths to be available. Text can reduce the memory demands of spoken language; speech can better express subtle emotions, while images can communicate moods, relationships and complex information holistically.

Deaf learners and non-native speakers may be particularly disadvantaged if multimedia involving speech is not captioned.

L&M: Dr Wald, thank you very much for your time.

Dr Wald is currently working on a two-year Net4Voice European funded project with Liberated Learning colleagues in Italy, Germany and England to use speech recognition in classrooms to enhance learning for disabled students and others. He is also working on a fifteen-month UK JISC funded project developing Synote, a web-based application to create synchronised annotations.

Mike Wald will give his presentation Using Speech Recognition Captioning to Enhance Multimedia Learning on Friday, October 31st, from 11:45 to 13:15 in the session Tools.

September 25, 2008