Login
You have 0 item(s) in your Shopping Cart  
Search:
Language:
Currency:
VAT Mode:
Help links coming soon! In the mean time visit our FAQ area.

Technical Discussion of Misrecognitions
in Layman Terms

The following was posted on the VoiceGroup@yahoogroups.com on 11/12/2004. The author, Chuck Runquist, is a former employee of ScanSoft. He is well respected in the speech recognition community for his inside, and in depth, knowledge of SR software.

First, DNS uses two statistical models: one for vocabulary alone; and the context probability that evolves though use, document/writing style analysis, and corrections.

Second, the Mendez Language Translation Group, which was acquired by L&H in 1996 and which was the largest number of L&H employees (over 3000), was the largest language translation organization in the world; translating more than 3 million documents a year for corporations world wide in over 30 languages. The work of this group formed the foundation of the language models used by Dragon Systems, IBM, and L&H in the mid to late 90's. The initial work involved the analysis of over 10 million documents in English alone, and based on the careful selection of everything from everyday personal correspondence to classical and contemporary literature, as well as news, technical, professional (medical, legal, etc.), and many other forms of writing to ensure that these probabilities were the most up to date and accurate word usage statistics. Every year over 3 to 4 million documents are analyzed and the statistical probability tables are updated. Therefore, the current language models contain the most current overall probabilities that specific words will occur together in context (i.e., digrams and trigrams, or two word and three word combinations). This work was, and is, all done by professional linguists using the most current research tools and information. The problem is not the accuracy of the language models.

Third, the real problem is time. SR applications only have so much time to analyze dictation and select the correct word or phrase. I forget the exact range of CPU cycles allotted to this analysis, but you control it when you set the Speed vs. Accuracy slider in DNS Options. Regardless, for real time transcription, the amount of time available is limited to a maximum possible amount so as to provide both the "Best Match"

analysis, while giving the fastest possible results display. This process is immensely complex and requires very precise timing, execution, and performance balancing. If any SR application is given enough time to analyze dictation, the accuracy would always be 100%, or at least 99.999999999999%.However, in order to do this the user would have to wait about 4 minutes for SR applications to analyze all the possible combinations in order to be able to select the correct word(s).

Lastly, even if the correct word or words are at the top of the final analysis list at the end of this time period, SR engines still have to make a judgment call based on the "Best Match" of the user's Acoustical Model (recorded enunciation patterns) and the comparison of the phonemes. Therefore, even if the correct word(s) are in the first slot in this final listing, the user's recorded enunciation is what makes for the final selection when compared to the selections in that list. This makes the user's Acoustical Model the single most important condition for accuracy. And, as I pointed out in a previous e-mail, two people conversing with one another over the telephone where one cannot see the other, or hear as clearly as when they are face to face, there is an increased tendency to misunderstand what the other is saying (human misrecognition). People are thousands of times better at interpreting (transcribing) speech than computers are, just as computers are thousands of times better at crunching numbers. If people misrecognize words, it is absurd to think that computers will be any better.

So, when you think of S