British Universities Film & Video Council

moving image and sound, knowledge and access

Intelligent Speech

Speech-to-text technologies could transform how academic research is conducted. Such technologies take a digital audio speech file and convert it into word-searchable text, with varying degrees of accuracy, comparable to uncorrected OCR (optical character recognition) for text. What was pioneering science a few years ago has entered the mainstream, with speech recognition services becoming increasingly familiar to the general public as smart phone applications. However the technological challenge is far greater when it comes to tackling large-scale speech archives.

Over 2012-23 the British Library has hosted a research project, funded by the Arts & Humanities Research Council, entitled Opening up Speech Archives. Paul Wilson was co-investigator with myself, and Mari King was the project researcher. This has looked not so much at the technical solutions on offer but rather at what value such technologies might bring to researchers across a range of academic disciplines. The project involved interviewing researchers in groups and individually, getting them to test out trial applications and demonstrator services. We have surveyed the opinions of specialists in the field, asking where they think the technology is going, and we have built a test service, entitled Searching Speech, developed by GreenButton Inc. using the Microsoft MAVIS system (marketed under the name inCus). This offers 8,000 hours of audio and video, featuring television and radio news from 2011 (Al-Jazeera English, CNN, NHK World and BBC Radio 4), historic radio programmes and oral history interviews. Rights considerations prevent us from making this freely-available online, but the service can be consulted onsite by appointment. We also organised the Opening up Speech Archives conference in February 2013, which product developers, service providers, archivists, curators, librarians, technicians and researchers from various disciplines; and the Semantic Media @ The British Library workshop, organised with Queen Mary, University of London, in September 2013.

RR_1There are different kinds of speech-to-text systems, but essentially they work in one of two ways. Dictionary-based systems match the sounds they ‘hear’ to a corpus of words that they understand. If they do not recognise a particular word, as happens in particular with proper names, they will suggest the nearest word in their dictionary that matches it. This leads to all manner of comic results, such as one service which confidently reported that French troops had entered the city of Tim Buckley (meaning Timbuktu) or another service which read ‘Croydon’ as ‘poison’. There is nothing like the results one gets with uncorrected OCR for text, where gobbledegook is produced when a word is not fully understood. Dictionary-based speech-to-text systems (which form the majority) only understand complete words, and so frequently produce misleading answers. For example, one system we tried out picked up the phrase “no tax breaks for married couples” from a BBC TV news report. But the actual words used were “new tax breaks for married couples”. If the researcher uses the text results only and does not bother to listen to the actual recording, they will go away with the wrong answer.

Other speech-to-text systems are phoneme-based. These recognise the building blocks of words rather than words themselves. For example, if one types in ‘Barack Obama’ into a phoneme-based service, it will search for ‘BUH-RUK-OH-BAH-MA’, bringing back results where that combination of phonemes can be discovered, or anything close to it. Such systems will always find something because they will bring up the best results they can find, no matter what the query. This leads to many ‘false positives’ – essentially, incorrect answers. We have had fun testing a demonstration service using recent American news programmes and searching for unlikely terms such as ‘turnips’. Inevitably the service found something (usually the words ‘turn it’ or ‘turn in’). Such systems do not present researchers with a transcript, or pseudo-transcript, because they do not store words, only units of sound. This is frustrating for the researcher who wants to browse search results, but the compensatory advantage is that a phoneme-based system can deal that much better with unusual accents or words. Most speech-to-text systems currently on the market have been trained on American English, and tend to be less effective with other accents.

« previous     1 2 3 4