(Inside Science) -- Automatic speech recognition is an important technology for many people. They ask Alexa to play them music or Siri to call their mother. Sometimes the technology doesn't understand the users or provide the answers they want. With some technologies, that's because artificial intelligence just isn't as adaptable and responsive as an actual human. With others, there can be unintended biases within either the data used to train the technology or the software's interpretation of the data. And sometimes, the weaknesses of the technology aren't immediately obvious.
So when computational linguist Emily M. Bender from the University of Washington in Seattle spoke with Inside Science's Chris Gorski earlier this month at a meeting of the American Association for the Advancement of Science in Seattle, there was a lot to talk about. The conversation below, which has been edited and condensed for clarity and brevity, began with introductions and then quickly moved into a pretty meta place. That's where the text below begins.
Emily M. Bender
Emily M. Bender: Once you turned on the recorder, both you and I changed a little bit how we're speaking -- we sort of said, "Now we're doing the interview." ... I might be doing it more than you and you might just be accommodating me, but these are sort of sociolinguistic facts.
Chris Gorski, Inside Science: I'm certainly performing in some way anytime that I'm trying to learn more about what someone does.
Right. You're going to talk to me differently than you're going to talk to your friends, or even when you're talking to me about the subway or your kids because [an interview] is something different. We do that all without thinking about it.
This weekend you were part of a group of researchers discussing the ethical risks of voice technology, which includes technologies like Alexa and Siri. In what ways can these efforts go wrong?
If the sample [used to train the technology] isn't representative of the broad population, then it's not going to work as well for people who are not the ones who are represented. And what tends to happen is the people who speak the language variety that has been anointed the standard are the ones who are best represented. ... So if you design something that works only for people who have the privilege of being raised to speak the standard variety [of a language], and then deploy it in the world without thinking carefully about that, you could end up just exacerbating current inequities in society because life just gets that much harder for someone for whom that is not their own dialect. ... It's not that that variety of English is any harder for machines, it's just it's not the one that the machines have been trained on.
My nightmare is that someone's going to try to embed automatic speech recognition in the 911 response system. Everybody in the community needs to have access to 911 and that needs to be equitable. To my knowledge, no one's done this, it's just something that I worry about. But if you put a computer in the way there, then are you hampering people's access to emergency first responders.
It does happen now with things like trying to change your flights or reach your health insurance company.
I just heard about a trans man who can no longer access his bank account, which is in another country, because, as part of his transition, he's done hormones that have changed his voice. The computer system of the bank is basically saying you're not the same person anymore.
Is this a new frontier that doesn't have laws? Can it be solved from the tech side by being more encouraging of diverse datasets, having people from all kinds of backgrounds? Or do we need laws plus all of the above?
I'm not a lawyer, so I can't really tell if the old frameworks are just inadequate or we haven't learned how to apply the old frameworks. But what's new is something about scale, big data. The amount of information that can be gathered and processed -- that's operating at a scale where you cannot go through and do quality control. The whole point is that it's so big you need a computer and so you can't go in and say, "All right, what kind of garbage data should we avoid using?"
What are some other areas where these issues are appearing?
An example of this is the voice-based interview screening, where you have a computer either listening in on people talking to each other or actually talking to the job candidate. If that job candidate is coming in with a language variety that the computer is not prepared for, then the computer is going to give completely spurious answers. And depending on how it's calibrated it could just be, we didn't understand you, so you don't get to come work here -- further exacerbating marginalization. If you don't have transparency and accountability around that, then you're going to get worse and worse. ... If the laws are protecting against harms, does it matter how the harms are being carried out? Or can we still use the same legal framework to try to prevent or at least provide consequences for those harms? Hopefully that was somewhat coherent.
I'll have to listen to my recording and see what it understands and spits out.
Oh, you have some automated transcription?
I do. Usually I listen to the recording and compare the audio to the text output. Maybe there's a little found poetry thing that happens where it misunderstands something and relays this sort of flighty, wonderful collection of words that fit together in a funny way. But, then I can listen and correct it. There's a big difference between what I'm going to do with this interview, and a hiring decision or something like what you're talking about.
Many of the mistakes that speech tech makes are of the type that you're talking about where it just completely goes off the rails and you can tell that it got it wrong, because there will be some funny sequence of words that shows up. But there are places where it makes mistakes that are really important to the meaning and harder to notice. Machine translation is notoriously bad with negation [Editor's note: The automated transcription program indicated that the word was not "negation," but "litigation"]. It's the difference between "I did go to the subway; I didn't go to the subway." Short little sound in there. They are very close acoustically, and I'm sure you've had the experience of misunderstanding someone or being like, "Did they just say can or can't?"