Why AI needs to learn to understand what we're not saying

May 23, 2018

This article is published in collaboration with The Conversation.

A Lynx robot with Amazon Alexa integration is on display at the Robotics Marketplace at CES in Las Vegas, U.S., January 5, 2017. REUTERS/Rick Wilking

Tech giants are set to create a more 'advanced' next generation of smart speakers so they understand tone of voice. Image: REUTERS/Rick Wilking

Trevor Cox

Voice control gadgets – such as Amazon’s Alexa, Google’s Home or Apple’s Homepod – are becoming increasingly popular, but people should pause for thought about advances in machine learning that could lead to applications understanding different emotions in speech.

The CEO of Google, Sundar Pichai, recently said that 20% of the company’s searches are initiated by voice via mobile phones. And, at the end of 2017, analysis of the US market suggested that a total of 44m Amazon Alexa and Google Home devices had been sold.

The technology has increasingly impressive abilities to recognise words, but – as an expert on acoustics – it is clear to me that verbal communication is far more complex. How things are said can be just as important as the words themselves. When someone says “I’m alright”, the tone of their voice might tell you their mood is the opposite to what they claim.

Voice control gadgets, also known as smart speakers or virtual assistants, can be frustrating to use because they only pay attention to the words, and mostly ignore how speech is expressed. Tech giants hope that the next frontier for devices, such as Amazon Echo, will be to detect how a person is feeling from their voice to make interactions more natural.

Image: Grand View Research

The human voice can give away information about who that person is, where they come from and how they are feeling. When a stranger talks, people immediately pick up on their accent and intonation and make assumptions about their class, background and education.

If voice control gadgets pick up on such information, speech interfaces could be improved. But it’s worth remaining wary of unintended consequences. The technology relies on machine learning – a branch of artificial intelligence that brings together algorithms and statistics learnt by a machine that has been fed reams of data – and so its behaviour is not entirely predictable.

Is the future smart or dumb?

Research shows that speech examples used to train the machine learning application is likely to lead to bias. Such problems with the technology have been evident in popular tools such as Google Translate.

When used, for example, to translate the Turkish phrases “o bir doktor” and “o bir hemşire” into English, Google’s service returns the results “he is a doctor” and “she is a nurse”. But “o” is a gender-neutral third-person pronoun in Turkish. The presumption that a doctor is male and a nurse is female reflects cultural prejudices and the skewed distribution of gender in the medical profession.

Have you read?

Google Translate picked up a human cultural bias that was in the data the algorithms were trained on and the end result is a sexist translation system.

Google’s machine learning translation displays cultural biases. Image: The Conversation/Screengrab

It is not an easy task to solve such problems because machine learning echoes human stereotypes. When humans listen to voices, they simplify the job of working out how to respond to someone by using rules of thumb.

Research shows that when people hear a woman ending lots of her sentences with an upward inflexion, known as uptalk, the typical assumption is that they are young. If a man speaks with a deep voice, there’s an assumption that he is big and strong. Such simplified assumptions about speech can lead to prejudiced judgements.

Criminal suspects with a Birmingham accent, one study found, were more likely to be rated as guilty compared to those with a more neutral accent. Research has also revealed that a non-native accent was perceived to be more untruthful.

Working out whether a person is angry, happy or sad from their speech could be really useful for anyone who uses voice control devices. But the vocal cues people emit vary from person to person, and across languages and cultures. Humans don’t always recognise emotions correctly, as anyone who has ever been in a relationship will testify, so why should it be expected that machines can do a better job?

Research into people’s aural “gaydar” – a colloquial term used by some who claim they can intuitively tell if someone is gay, lesbian or bisexual – offers a good example of ambiguous and even bogus signals. Listeners make assumptions, for example, about how a gay male should sound, such as having a higher pitched voice, but these are often wrong. Actors playing up to incorrect stereotypes in apparent responses to audience expectations became something of a cultural norm on TV screens, research shows.

The individualised, natural ambiguity of vocal signals is likely to lead to mistakes unless technology companies learn from their mishaps. Ingrained prejudices could be learned by applications that attempt to interpret human voices, given that the technology relies so heavily on learning from the data it is fed.

Technology firms developing voice control devices and services may already talk to acoustic experts. But they need to listen closely to the warnings to better understand the pitfalls to avoid, before applying machine learning to decoding the human voice.

Don't miss any update on this topic

Create a free account and access your personalized content collection with our latest publications and analyses.

License and Republishing

World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.

The views expressed in this article are those of the author alone and not the World Economic Forum.