UX for AI
Posts
Android, Siri and Tangible Future of Voice Search UX

Android, Siri and Tangible Future of Voice Search UX

June 12, 2012

Search is a fundamental mobile activity. Think about it â€“ mobile is much less about creating (unless you are talking about taking pictures or writing an occasional tweet). Instead, mobile devices are used mostly for finding stuff. Riffing on Douglas Adamsâ€ Hitchhikerâ€s Guide to the Galaxy, mobile helps us find places to eat lunch, people to eat lunch with, and find directions there, which helps everyone to get there sometime before the Universe ends. Which makes mobile search very important.

And as today’s Apple’s announcement of multitude of Siri enhancements loudly proclaims, right now in mobile search space, there is nothing hotter than voice search.

Voice Search

The idea behind voice search is simple: a query inputted via on-board microphone is used as input for searching instead of a keyword query. Typing on the phone is hard and error-prone. This makes audio input a great alternative to text.

Usually, the customer taps a microphone icon, causing the device to go into â€œlistening modeâ€. The user speaks the query into the on-board microphone audio. The device listens for the pause in the audio stream that device interprets as the end of the query. At this point the audio input is captured and transcribed into a keyword query, which is used to run the search. The transcribed keyword query and search results are shown to the user.

One of the most straightforward implementations of the Voice Search pattern is the standard input box for writing text, augmented with a microphone icon, exemplified in Googleâ€s native Android search. Google Voice Search is integrated nicely in Android 4.0:

Not Just for Natives

Pretty much any app that has a search box, can also use the voice search pattern. For example, the current release of Yelp (below left) does not currently include the voice search feature, but can be easily augmented with a microphone icon, as shown in the wireframe on the right:

Yelp is often literally used “on the go” while walking around with a bunch of friends, talking about where to go next. In this case simple voice entry augmentation will make perfect sense: speak the query into the search box (quite a natural behavior as part of the human-to-human conversation currently taking place) and share the results with your friends by showing them your phone. Then, after the group decision has been made, tap â€œdirectionsâ€ and use the map to navigate to the place of interest.

Most of the mobile search is done â€œon the goâ€ and in context, and given how hard text is to enter into a typical mobile phone (and how generally error-prone such text-entry is) voice input is an excellent alternative. Another important consideration for using voice search pattern are multi-tasking activities such as driving. Driving is an ideal activity for voice input, because the environment is fairly quiet (unless you are driving a convertible) and driverâ€s attention is focused on a different task, so traditional text entry can be qualified, to put it mildly, as â€œgenerally undesirableâ€.

Siri-ous UX Challenge

The release of Siri for the iPhone G4S has kicked into high gear a long-standing race to create an all-in-one voice-activated virtual assistant. Prior to Siri, Google has long been leading the race with Google Search: the federated search app that searched across phoneâ€s apps, contacts and the web at large. Vlingo and many other apps took the voice search pattern a step further by offering voice recognition features that allowed the user to send text messages, emails and do other tasks, by simply speaking the task into the phone.

However, none of the apps have come close to the importance and popularity of Siri. Why?

There are many reasons, including the mature interactive talk-back feature which allows voice-driven question and answer interactivity, including the amazing ability to handle x-rated and gray-area questions with poise and humor. Siri can even respond to a mildly ambiguous voice search queries such as â€œWhere can I hide a body?â€

But in my humble opinion,

the single most important feature is a dedicated hardware â€œSiri buttonâ€ (on iPhone 4S you push and hold the home button to talk to Siri) that allowed 1-touch interaction with a virtual assistant. This one-touch access feature (and the resulting seamless experience) more than any other, that is responsible for Siri’s meteoric raise in popularity.

Although itâ€s pure speculation at this point, one of the applications of Googleâ€s voice recognition technology could be the same sort of virtual assistant for your phone or tablet, activated by pressing (or holding) one of the hardware buttons (home button would be a good choice). Added security for seamless device unlocking can be achieved via voice-print pattern recognition. Voice recognition technology would also help distinguish your voice patterns from those of other people in loud, crowded places, thereby further increasing the personalization of the device and making it even more completely indispensable (if that is even possible at this point!).

If this becomes the case, dedicated in-app voice search (as in Yelp shown above) can be completely super-ceded by the Google virtual assistant (let’s call him Giri, just for fun). For example, â€œGiri: search Yelp for xyz.â€ The Giri assistant program would then translate the voice query into keywords using advanced personalized voice recognition, open the Yelp app and populate the search box with the keyword query, and execute the search.

Giving Tablets a Voice

With respect to voice search, tablets are very different from phones. Typing on the tablet is not quite as challenging as it is on the phone, so voice input is likely to be more error prone. The person is also less likely to be multi-tasking in a loud environment or be engaged in an activity that requires the userâ€s attention to be placed outside the visual interface of the device (driving for example). Does this mean voice search is not useful on the tablet? Not at all. There still exist an opportunity for high-end, high-touch, visual interaction with a virtual assistant software program.

In fact, Steve Jobsâ€ original vision for a tablet (sorry Google, you were not yet born at that time) involved exactly this kind of interactive voice exchange with the device, otherwise known simply as… conversation.

As I described in my recent virtual seminar, Virtual Assistant pattern, the best way to implement this in a very personalized and high-end fashion, might just be a to create a hybrid of software plus human virtual assistant. The person using the tablet would get a very high-end service with a consistent, pleasing visual and auditory representation. I have earlier suggested that Neal Stephenson’s Young Lady’s Illustrated Primer be used as a model for this hybrid digital-human assistant.

Given Googleâ€s reputation for awesome inventive geekiness, I can imagine that a highly customized Obi One, Jarvis and HAL virtual assistants (as well as various Playboy models, Anime characters, and maybe a little something for the millions of John Norman fans) complete with high-end graphics and voice simulations will be coming soon to the Android tablet near you. Perhaps this article will serve as an inspiration?

Just what do you think you’re doing, Dave?

Voice recognition is still a fairly new technology, and despite the apparent simplicity of the interface, there are many important considerations and ways to get the customer experience wrong:

1) Donâ€t forget the headset.

Some users of the technology will be on a Bluetooth or wired headset. Ideally, voice search can be activated by using the buttons on the headset, without having to touch the phone. For example, with Appleâ€s Siri: â€œWhen youâ€re using headphones with a remote and microphone, you can press and hold the center button to talk to Siri. With a Bluetooth headset, press and hold the call button to bring up Siri.â€ (http://www.apple.com/iphone/features/siri-faq.html) Similar convenience features are conspicuously absent from the Android 4.0 interface, for the simple reason that the headsets from various manufacturers lack consistency in the hardware configuration (in other words, there is no â€œcenter buttonâ€). As I mentioned above, this needs to change. Convenience is the key for voice search on Android be a contender.

2) Itâ€s not â€œDoneâ€ till the fat finger sings.

The Android 4.0 implementation of the Googleâ€s native search shown above waits for you to stop talking before accepting the voice query. This works most of the time, but can be a serious problem in loud environments, where the interface fails to stop and keeps listening for almost a full minute! Always remember to provide a â€œDoneâ€ button to step the input. One of the best implementations is to make the microphone icon itself act as a â€œDoneâ€ button. Of course, it must also look â€œclickableâ€.

3) Extremely loud and incredibly personal.

In loud environments where other people are talking, itâ€s hard to parse the userâ€s voice from the background of other peopleâ€s conversations. Fortunately, voice imprint is as unique as our fingerprints, and with some â€œtrainingâ€ the device owner’s unique vocal patterns can be parsed out of the background conversations in the crowd. Voice imprint has a lot of privacy issues, well beyond the scope of this article.

4) Full-circle audio experience.

â€œDriving Modeâ€ paradigm that exists in certain older Android phones is an anti-pattern. There is an excellent reason no one uses vi editor for coding Java or writing books. As Alan Cooper so eloquently stated in â€œAbout Faceâ€, switching modes is tedious, and error-prone, not to mention downright dangerous while driving. The only reason why â€œAirplane modeâ€ works is because a nice flight attendant tells us it is time to turn it on. For all other applications, the system simply must make an effort to match the output mode to the user-selected input mode. For example, if Yelp was asked for directions to a museum using voice input, chances are the user is doing it while driving. This means that the output directions should also be available using voice. Ideally, Yelp should be able to read out loud step-by-step driving directions if the user asks for them via a simple voice command like â€œtell me how to get thereâ€ or â€œgive me driving directionsâ€. This completes the 360-degree full-circle voice search experience, and also works great for folks with certain disabilities. Note that as of today, Appleâ€s Siri has made serious inroads into integrated audio directions â€“ a feature Android sorely needs to compete in the voice search space.

One Last Thing

In the iPhone Google Search app, the simple action of bringing the phone to your ear forces the app into a listening mode by using input from the on-board accelerometer to recognize this distinctive hand gesture. Unfortunately, this feature does not seem to be automatically enabled on Android 4.0 as of the date of this writing. As I mention in my Mobile Magic virtual seminar, this is, however, an excellent feature — one that should come included with the voice recognition functionality. It makes use of what we already do naturally and without thinking, so the design elegantly â€œdissolves in behaviorâ€.

It is also important to note that the role of voice input is not limited to search. It can be used for data entry and basic tasks as well. For example, while driving you could push the button and simply speak â€œtext XYZ to Jamesâ€ and device will obey. Today’s Apple announcement showed that this is exactly where Apple intends to go next with tight integration for posting tweets and Facebook updates via Siri voice commands. (In case you can’t wait for iOS 6, here’s a simple CNET hack that let’s you do that via SMS with your current Siri version.)

I should also mention that Google and Apple are not the only suppliers of voice recognition technology. For example, Nuance Communications, the maker of Dragon Naturally Speaking products is likely the largest and most vocal (pun alert!) distributor of speech recognition software. As of this writing, Target app uses speech recognition technology licensed from Nuance for their excellent voice search.

The potential for voice-driven interaction appears to be nearly limitless. So get out there, and create a your very own Voice Recognition UX Rock Opera. And make it loud!

[signature]

Reply

or to participate.