Voice Recognition Software System That Works
How to build accurate speech recognition software systems for smartphone applications



Bookmark and Share

For outdoor and mobile use, voice recognition software is not very accurate. But we can still build a smartphone product and a system that delivers 100% accurate performance, all the time. This article focuses on that proposition from the perspective of mobile applications.

(Learn more about our flagship product, MyCaption for BlackBerry® smartphone -- one of the best BlackBerry apps for business.)

Achieving Robust Speech Recognition Performance for Smartphone Applications

Smarphones applications are where voice recognition software would be the most useful, because typing on small keyboards is a major obstacle to mobile productivity. But making voice recognition software work natively on smartphones is difficult because of factors like background noise, limited fidelity of microphones, and limited computing power of mobile devices. However, one can work around these limitations by using four techniques.

  1. Send audio files to servers.
  2. Use data lines, not phone lines.
  3. Leverage speaker dependence.
  4. Implement smartphone centered architecture.
1. Send audio files to servers

Speech recognition software demands a lot of computing power and memory. Smartphones are getting smarter, but they are still not powerful enough to produce accurate speech recognition software results for free-flowing, natural-language dictation applications like business-length emails and memos. Sending the audio files to servers is the appropriate division of labor -- servers run the voice recognition software and coordinate other processes chosen by the system implementer, while smarthphones provide the user interaction and audio recording. With this arrangement, system implementers can also go beyond software, and add value with processes that improve accuracy (using human editors), and with proprietary techniques for confidentiality protection. The cost-performance tradeoff depends on the system goals.

2. Use Data Lines, Not Phone Lines

When audio is sent to from smartphones to servers over a wireless phone line, it can suffer from dropped calls. In contrast, sending it over data lines (Secure HTTP over TCP/IP) guarantees reliable delivery of the audio file, and keeps the file intact. If the user is passing through an area of poor wireless coverage, the audio file is held at the smartphone and the transmission automatically resumes when the wireless coverage improves. As a result, the audio that arrives at the server is as clean as it was recorded on the smartphone. This makes it much easier for speech recognition software at the server to do its job.

3. Leverage Speaker Dependence

Smartphones are personal devices, i.e., we can safely assume that the recorded audio messages on each smartphone will be uttered by the same person every time. This information is very useful to voice recognition software systems, and over time, it can build a profile for that user, progressively improving the accuracy of voice recognition. In addition, the system can be designed to give you some control over customizing the vocabulary. Certain phrases specific to your profession can be learned by the system more quickly if you can "feed" those words to the system by logging into your account.

4. Implement Smartphone Centered Architecture

In a smartphone centered architecture, the user's smartphone is at the center of message flow. Audio messages are produced by the smartphone and sent to the server for processing by voice recognition software. When the messages are converted to text (data), they are returned to the smartphone. The smartphone then decides what to do with that data -- for example, to send it as email, or to integrate it into CRM.

smartphone centered architecture

This is an important feature, because this Smartphone-Centered Architecture gives you the power to review and edit the text of the message before it is sent, ensuring that you achieve 100% accuracy. Of course, you have the option to send the transcribed message without reviewing, but the key point of smartphone centered architecture is that the message makes a "home run", and is sent out from your smartphone, by your smartphone, whether you choose to edit it or not. If you want to ensure that all your messages always go out with 100% accuracy of transcription, now you can do so.


While there are limits to how well a voice recognition software can perform, there are system level design decisions that implementers can make to substantially improve the user's experience in terms of robustness and accuracy. We outlined four such design decisions.