Wednesday, August 23, 2006

You can't tune silence

Tuning refers to the post-deployment tweaking of grammars in a speech app. Your QA department can test the features, but there's nothing like real callers in all their variations of speaking styles to really probe your app's grammars. If the open-source mantra is "with enough eyeballs all bugs are shallow", then the speech app mantra should be with enough speakers all grammar bugs are shallow.

Tuning is time-consuming because a human has to listen to the audio recordings of hundreds of phone calls. If the system can log speech rec errors, then the person doing the tuning can zero in on those utterances. If not, and we will see how this can happen, tuning becomes very difficult.

Out-of-grammar errors are the easiest to detect because the speech recognizer returns a NoRec error. These may be due to coughs, background noise or other audio problems that grammar changes can't fix. Others are due to variations in pronounciation that your grammar doesn't know about. People's names are notoriously hard this way because: (a) names are multi-lingual in origin, and (b) many people will mis-pronounce a name they've only read. We had a Mr Biber here, pronounced bee-ber, but many people would say bye-ber. Tuning revealed this, as well as the fact that some people would say only the last name. The grammar needed to be changed to cover all these possibilites.

The more difficult problem is what we call the Rumplestiltskin error. Most speech engines want to recognize. Try saying "Rumplestiltskin" to a speech-rec auto-attendant. You'll be suprised at how often it will find a (wrong) match. Saying something completely outside the grammar will cause a medium-confidence recognition of some wrong word in the grammar. Of course you can affect this by raising the recognition and confidence thresholds, but that may cause other unwanted recognition problems. Confirmation, for example, is an added step in the dialog that callers can grow to resent. The problem with a Rumplestiltskin error is that it's invisible; the system really has no knowledge that a recognition error has ocurred. This makes finding the error a time-consuming task for the tuner. The caller may be saying an acceptable alternative for a word, but until we discover this and add it to the grammar, the system will not be pleasing this caller.

Tuning becomes a big problem when dyamic grammars are used, because you can't do any pre-deployment tuning. The app will read a list of phrases from a database, say Little League team names, and generate a grammar. Consider the "Nowell Gnats". It's not unusual for the TTS engine to pronounce a word differently from what the speech rec engine accepts as a pronounciation. This is bad because your prompt may say it as "you can say Nole Nats..." whereas the speech rec engine wants "Now-well Nats".

It's easy to end up with a situation where the app won't accept any sensible pronounciation for a word, and even tells callers the wrong pronounciation to use! All without generating any recognition errors. Silent failure.

Can the speech rec platform help? Yes, here are a couple of ideas. First, the app should have "tuning mode" that can be set temporarily. It increases reco thresholds to force more confirmations, and logs all rejected confirmations as candidates for tuning. Secondly, the system should have a batch process that studies patterns of calls. If the same person calls back in several times in a row and never seems to complete a transaction, then these calls are also tuning candidates.


Marshall Harrison - "the gotspeech guy" said...

I really like the idea up changing the reco threshold at runtime to force logging of what the user is saying,

Eduardo Olvera said...

One more thing that's pretty powerful is what some people call "evaluative usability", which is nothing more than to spend some time after deployment listening to actual calls taking place. This not only helps identify usability problems but can also help find trouble spots, areas where misrecognitions may be taking place, and even changes in behavior over time (mostly due to callers learning how to interact with the system)