|
OpenSpeech Recognizer [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
Most speech applications allow "barge-in" so that callers can interrupt long prompts and speed completion of their tasks. This capability is implemented by a speech detector, or "endpointer," which determines when the caller starts speaking and signals that the prompt playing should be stopped. Unless this process is completed within a fraction of a second the prompt will continue to play while the caller speaks, resulting in disfluencies that can reduce speech recognition accuracy. However, the speech detector must not be triggered by background noise, which would allow a door slam or even lip smack in anticipation of speech to silence the prompt and leave the caller puzzled as they listen to dead air.
OpenSpeech Recognizer includes a robust endpointer that efficiently analyzes the acoustic signal to identify speech-like sounds based on amplitude and spectral characteristics, robustly discriminating between background noise and the caller's speech. The OSR endpointer automatically adapts during each call to accommodate quiet and noisy environments. In addition, the OSR endpointer "sensitivity," as required by VoiceXML, can be manually adjusted over a wide range to fine-tune performance in unusual applications. The OSR endpointer can be run independently of the speech recognizer itself for distributed applications and can even be replaced for complete integration flexibility.
In comparing speech recognition systems evaluators often turn to accuracy. Unfortunately, it is extremely difficult to generate meaningful accuracy figures because so many factors can affect the accuracy achieved in a particular application. For example:
Accuracy tends to decrease as vocabulary size increases. However, even small vocabularies can prove to be difficult to recognize if they contain words that sound alike.
The most reliable test data comes from actual callers, but this data may not be available until the application is built. Data recorded from friendly participants, perhaps calling from a quiet office and reading a script, won't include the disfluencies and background noise a live application must contend with.
Tests are often conducted using only a single channel, allowing the speech recognition system to access computing resources and memory that may not be available on a fully loaded deployment system. The additional resources can be applied to boost accuracy.
OpenSpeech Recognizer is very accurate. Its accuracy is comparable to the best commercial systems available, and leads the industry by some measures. In live deployments on fully-loaded systems, OSR has been shown to return the correct response more than 98% of the time for small English vocabularies and more than 95% of the time for very large English vocabularies. While these figures are reliable, remember that accuracy on other tasks, in other languages, on other systems, and with other callers may vary.
OpenSpeech Recognizer is designed for extremely efficient operation to allow cost effective deployment of high capacity applications.
OSR incorporates multiple techniques to achieve its efficiency, including the explicit segmentation approach already mentioned. It also includes patented Finite State Transducer (FST) technology that compactly represents grammars by sharing redundant segments. Removing this redundancy saves memory and computation by reducing the number of phonemes that must be processed to determine the recognition result. It also allows grammars to be compiled and loaded more quickly, up to 5 times faster. The savings can be dramatic: a 40,000-word grammar consuming 170 Mbytes of memory is reduced to just 15 Mbytes with FST technology.
Every speech recognizer can deliver higher accuracy with the application of more computing resources. OSR is less sensitive to such changes than competing recognizers because of its efficient design. Nonetheless, OSR incorporates load-sensitive algorithms that use all computing resources available to best advantage. In fact, SpeechWorks was the first company to develop such techniques.
OSR automatically allows one copy of a grammar loaded in memory to be shared by speech recognition processes on all channels. This provides a significant reduction in memory footprint for large-scale deployments where the same application runs on dozens of channels.
GOTO=> [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
|