Wednesday, August 23, 2006

You can't tune silence

Tuning refers to the post-deployment tweaking of grammars in a speech app. Your QA department can test the features, but there's nothing like real callers in all their variations of speaking styles to really probe your app's grammars. If the open-source mantra is "with enough eyeballs all bugs are shallow", then the speech app mantra should be with enough speakers all grammar bugs are shallow.

Tuning is time-consuming because a human has to listen to the audio recordings of hundreds of phone calls. If the system can log speech rec errors, then the person doing the tuning can zero in on those utterances. If not, and we will see how this can happen, tuning becomes very difficult.

Out-of-grammar errors are the easiest to detect because the speech recognizer returns a NoRec error. These may be due to coughs, background noise or other audio problems that grammar changes can't fix. Others are due to variations in pronounciation that your grammar doesn't know about. People's names are notoriously hard this way because: (a) names are multi-lingual in origin, and (b) many people will mis-pronounce a name they've only read. We had a Mr Biber here, pronounced bee-ber, but many people would say bye-ber. Tuning revealed this, as well as the fact that some people would say only the last name. The grammar needed to be changed to cover all these possibilites.

The more difficult problem is what we call the Rumplestiltskin error. Most speech engines want to recognize. Try saying "Rumplestiltskin" to a speech-rec auto-attendant. You'll be suprised at how often it will find a (wrong) match. Saying something completely outside the grammar will cause a medium-confidence recognition of some wrong word in the grammar. Of course you can affect this by raising the recognition and confidence thresholds, but that may cause other unwanted recognition problems. Confirmation, for example, is an added step in the dialog that callers can grow to resent. The problem with a Rumplestiltskin error is that it's invisible; the system really has no knowledge that a recognition error has ocurred. This makes finding the error a time-consuming task for the tuner. The caller may be saying an acceptable alternative for a word, but until we discover this and add it to the grammar, the system will not be pleasing this caller.

Tuning becomes a big problem when dyamic grammars are used, because you can't do any pre-deployment tuning. The app will read a list of phrases from a database, say Little League team names, and generate a grammar. Consider the "Nowell Gnats". It's not unusual for the TTS engine to pronounce a word differently from what the speech rec engine accepts as a pronounciation. This is bad because your prompt may say it as "you can say Nole Nats..." whereas the speech rec engine wants "Now-well Nats".

It's easy to end up with a situation where the app won't accept any sensible pronounciation for a word, and even tells callers the wrong pronounciation to use! All without generating any recognition errors. Silent failure.

Can the speech rec platform help? Yes, here are a couple of ideas. First, the app should have "tuning mode" that can be set temporarily. It increases reco thresholds to force more confirmations, and logs all rejected confirmations as candidates for tuning. Secondly, the system should have a batch process that studies patterns of calls. If the same person calls back in several times in a row and never seems to complete a transaction, then these calls are also tuning candidates.

Tuesday, August 08, 2006

Bye Bye Speech Server -- Hello SPS

So, Microsoft Speech Server is being folded into Office Communication Server, and will be known as Speech Platform Services. This was announced at today's SpeechTek conference. Office Communication Server (OCS) is basically Live Communication Server, which does presence and SIP call routing.

I have fairly mixed views about this move. First, the negatives. Yet again, Microsoft is making a mid-course correction with their speech offering. At the beginning the vision was multimodal speech. Then it was telephony-based speech rec by web developers (by sprinkling "SALT" on their web pages to speech-enable them). Then it telephony-based speech rec using more standard methods, such as VoiceXML. Now it's speech rec as part of an enterprise information system. For developers actually trying to build applications, all these course corrections are unnerving, to say the least. If the overall goal, as Microsoft says, is to create an "ecosystem" then they need to realize that stability is a key factor. Wandering climate change isn't helping.

Now the positives. OCS is a strategic product for Microsoft. Communication is becoming a key part of every organization. Communication across devices. Synchronous and ayschronous communication. Features such as: presence, IM, VoIP, video, Find-Me, Follow-Me, and ad-hoc conferencing. Think EMail++. For speech rec to be bundled into such a strategic product is a huge win. It also shows that the speech folks at Microsoft have been listening and learning. Changing course often works out better than not changing course!

Lets hope the transition to OCS is as seamless as possible for existing MSS speech developers.

Nice Clutch Play by Microsoft

Anyone who does speech recognition demos knows how they can go bad. This one by Microsoft's Rob Chambers (YouTube) is about as bad as it could get. After mis-recognizing the presenter's first utterance, the recognizer got even worse at understanding his correction attempts like "select all and delete".

Turns out it was a Longhorn bug with its audio software. Today Rob did the demo again, at SpeechTek no less! Eight minutes long and it worked fabulously. It was a nice comeback and a gutsy move -- way to go guys!