Rambling about Speech Recognition

As I mentioned last time around, after getting hired at Edify due to my experience with telephony in general, and ISDN in particular… both of which were probably over estimated in my favor… I was asked if I wanted to work on our new speech recognition integration.

Edify Corporation

This is a recurring theme in my career.  When asked if I want to do something new or take on some project, I just say “yes.”  This has occasionally gotten me into some unenviable positions, but has for the most part worked out for me.

I have volunteered myself both into and out of projects that were disasters by stepping up for the next new thing.  Combined with my ability to ignore pending disasters and focus on what needs to get done today, I have ended up doing some interesting things I probably had no business being involved in.  I have succeeded, depending on your definition of success, as a generalist in an industry that hates the idea of generalists.  It works once you have been hired, but getting hired as one is a tall order.

Anyway, my signing on for speech recognition, a technology I had no knowledge of and no experience with when I jumped in, found me sitting in a conference room at Nuance Communications learning about the architecture and configuration of their speech recognition engine.  We were going to integrate Edify Electronic Workforce with Nuance’s speech recognition engine.

Fun facts.

This was not the first Edify speech recognition integration.   Some work had been done around this with OS/2 version of EWF in support of a deal with Sears.  That resulted in every Sears store in the US and Canada having an OS/2 EWF box somewhere on the premises.  That box with an NMS AG/4 or AG/8 analog phone card installed, would answer the main line, then do a loop back transfer to a data center in Golden, Colorado.  We would pass in the store identifier and pipe the speech to the data center, which would recognize the utterance (something like “sporting goods” or “appliances”) and then would return to us the extension for the appropriate department for that store.  We would then close the loop back transfer, tell the party we were transferring them, do a flash transfer to hand them off to that extension, and be done with things.

This was all because an MBA somewhere at Sears came up with a return on investment model that said this monstrosity… and what else can I call it… would be less expensive over time than having somebody making minimum wage and answering the phone and transferring people.  It was also assumed that the new system would be more accurate with its transfers.  This was before my time at the company, so I couldn’t tell you if either metric was met, I only heard about the support calls as the system was still live when I started in 1998.

The support calls were always with some poor, hapless individual with little or no computer savvy who was trying to figure out why people were coming in and complaining that they were not answering the phone.  Our tech support people would have to walk them through diagnostics which, according to legend, involved one tech asking a store employee to open a window only to be told they were in the basement.  Perhaps apocryphal, but those are the tales that sustain support teams when trying to teach somebody how in Topeka how to use a computer over the phone.

Anyway, that was not exactly a tight integration with our product, as we were planning now.

Also, the company in whose conference room I was sitting in 1999 is pretty well divorced from Nuance Communications today, having been acquired, merged, and ending up as a part of Microsoft in 2021.

The Nuance in Menlo Park was at one point purchased by competitor ScanSoft, a Xerox spin-off that was bought by the onetime Ray Kurzweil venture Visioneer.  They took the ScanSoft name, then went off and acquired SpeechWorks, another speech reco vendor went by both names for a while, and then, on buying Nuance, took over the Nuance company name, it having the best reputation and brand recognition of the lot.  Somewhere along the way they also purchased Lernout & Hauspie, a Belgian speech technology vendor which had purchase Dragon, who made Dragon Naturally Speaking, a name which you might recognize. This is legacy is all owned by Microsoft now.

I mention this only because all of the entries around these companies on Wikipedia, if not wrong, are written from a perspective of a specific time slice where what they say is true but dramatically and fundamentally incomplete.  Also, I ended up dealing with products from all of these companies, along with a speech recognition engine from a company called BBN (now owned by Raytheon) which was, among other things, used to be used by the Department of Defense (or maybe it was the NSA) to transcribe news broadcasts in real time, foreign and domestic.  (This seemed impressive when somebody told me, then I remembered that this isn’t the tough part of speech reco.  We’ll get to that.)

My job, for a while, was literally managing the integrations with all of these companies even as their existences collapsed and contracted like a probability wave after a bad decision.  I had a developer on my team, a smart and professional guy named Dan, who worked on all these integrations and with whom I had to repeatedly have conversations that began with something like “you know all that work you did on the L&H integration… yeah, ScanSoft bought them and are killing that product and rolling its tech into their product, so you’ll need to update that integration instead of shipping the one you’ve been working on.”

I swear, he spent 18 months working for me, did five major projects, and not one of them saw the light of day because of the ongoing consolidation of the market.  I think the company cancelled the BBN integration just on the assumption that it, too, would somehow be submerged into the growing technological tar ball  that was becoming whatever passed for Nuance at the time.

But that is getting way ahead of myself.  I am still sitting in a conference room in Menlo Park in early 1999 learning about speech recognition.

Nuance offered three services; speech recognition, text to speech, and speaker verification.  I’ll get to the last two in their own posts as speech recognition was where we went first.

What is speech recognition?  It is the ability for a computer to take spoken voice and convert it into something that had meaning to the computer.   At the time… and my knowledge of the technology ends somewhere in 2003 or so, when I volunteered for some other project, because that is my ongoing MO… the speech engine itself would take what you said and converted it into sound that made up parts of speech which it could then assemble into words based on a language dictionary.

It helped if you spoke at a consistent pace, stuck to core dictionary words, and your voice was in a specific frequency range.  As a male in the tenor range, my voice was very recognizable.  As is often the case with technologies put together by a bunch of men in a lab, women often found recognition more difficult to attain.  Somebody once told me that it was because the female voice range moved into the DTMF spectrum, which also had to be recognized, but that sounded more like a theory thrown out as an excuse.

Anyway, speech goes in and words come out.  But words don’t mean anything to your application on their own any more than pressing the pound key on your keypad means anything unless you have set your program to do something based on that input.

In comes the concept of a speech recognition grammar, which is a translation table that converts what words are said into what they should mean for your app… and here the fun begins.

We had a VAR in New York absolutely irate at us early on because, after doing DTMF based apps for years, they found speech recognition to be extremely unreliable.  If you tell somebody to press one to confirm a choice, that button press and its corresponding tone gets caught correctly 99.99% of the time.

But they were asking people to say “yes” on a speech test app and it was failing a lot.  Well, a lot relative to DTMF.  This is because there is a lot more going on with speech recognition.

First, think of all the ways people say “yes” to something.  If you train your app to just move forward on that exact, single response, what happens to people who say “uh, yes” or “yeah” or “okay” or “yes sir” or half a hundred other variations I’ve heard when we’re captured utterances in an app to see why we’re getting failures on a specific prompt.

Your grammar has to account for all the possibilities.  If you leave the question open to some interpretation, people will respond in all sorts of crazy ways.  There is a whole art to crafting a prompt to somebody to get them to answer in a specific way.  Our professional services team used to call all sorts of systems and record horribly bad prompts that would lead to errors just to help educate customers.

And even then you have to account for a lot of possibilities.  If you are doing speech apps as a profession, you probably have a starter grammar that will deal with all the things people say in the middle of their sentence… the “ahs,” “uhs,” “hrmms,” and whatever… around which you will build the grammar to actually return useful responses.  A million ways to a affirm something, but your app probably only cares if it is “yes,” “no,” or “operator.”  (Good apps will just send you to a live agent after three requests.  Bad apps, and I remember one that United Airlines had, clips from which were part of the “bad apps” presentation our speech team would give, will hold you in reco hell trying to force you to use the app because somebody’s bonus depends on the percentage of calls handled by automation versus live agents.)

Something else you get with your word response is a confidence score, a percentage value that the system assigns to how accurate it thinks the response it provides is what the individual said.  Our VAR had set up their app to fail on anything less that 95%, that being the margin of error they felt they were willing to accept relative to the reliability of DTMF input.

That wasn’t going to work.  The scoring system isn’t that tight and even my voice, well tuned for the engine, couldn’t yield 95s on demand.  We had to get them to dial that back and then teach them how to use the confidence score for the different levels of confirmation.

If you get a good score, you just go ahead.

If you get a marginal score, you might repeat back, “I heard you say X, is that correct?”

And if you’re out in the weeds with a bad score, you re-rompt with more specific guidance.

Or maybe you ask the server to bring back a list of results with scores, a feature I recall Nuance named “N-Best” or some such.

We did an app with a major delivery service… there are only two, or three if you count DHL… that allowed a caller to speak their tracking number over the phone in order to get delivery status.  Talk about something prone to errors.

What we did was get back a ranked list of possible responses then went to the database to validate them to see if they were active tracking numbers.  If it isn’t valid you discard it.  When you get to the first valid one and the confidence score is high, you go with it.  If the confidence score is low you confirm by asking if the delivery is for a city or street or something, if you run out of viable options, you carefully reprompt.

This, by the way, was all more than 20 years ago.  I was sitting in that conference room in Menlo Park 25 years ago… almost exactly as I recall… and went back to our lab in Santa Clara with the dev I had gone with, CDs, instructions, and license keys in hand, and setup a reco server for us to start working with.

He developed the code to integrate it into the telephone subsystem, an intern with us for the semester built up a grammar manager that was so good the company offered her a job to keep working with us.  She, likely wisely, declined.  Not that we didn’t have good work environment… this was probably one of the best times in my career… but in hindsight we were doomed.

There was, as noted above, a lot of push back on speech reco… and a lot of really dumb implementations.  Companies wanted to make the transition on the cheap, so would try to update their DTMF apps.  I saw more than a couple of such apps where you could “say or press 1” for something, which was a really piss poor way to go about things.

On the flip side, some companies ran with this and tried to do “say anything” apps early on, which were often worse experiences than the DTMF conversions.

The technology was still a work in progress and the acoustical models around basic recognition were still being refined.  At that time you couldn’t be in a loud location with background noise or in your car on a cell phone… there was no prohibition on talking on your cell phone while driving at the time… and there were still issues with women’s voices.  It used to be a hack to say “fuck” with every other word when using speech reco because the model would recognize and remove that word, so it would effectively put start/stop points around what you were trying to say.

A lot of this was worked out over time.  Cell phones became the key use case.  While it is easy to key things in via DTMF while sitting at your desk, with the cell phone up to your ear it is much better to simple be able to say a response.

By the time the iPhone came out I was no longer working directly on this tech and by the time Siri voice assistant was launched in 2011, powered by the company that was called Nuance at the time, Edify has pretty much ceased to be after multiple acquisitions and a massive layoff.

But for a stretch of a few years I was happy in the lab, dabbling with this tech, setting up test cases, writing “how to” guides for our professional services people and VARs, duplicating and writing up issues found in the field, validating updates, writing release notes, and generally not being worried about much else.

Except, of course, Y2K.  We were ALL worried about that.

Voice your opinion... but be nice about it...