You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

62 KiB

title status
Kathy Reid Auto-transcribed by with minor edits by James Parker

Kathy Reid (00:00:00) - Uh, hi, I’m Kathy Reid. Um, so I have my background in open-source systems. I came to Voice from open source rather than from say, Congress, uh, computational linguistics or other disciplines. So I’ve worked for, uh, I’ve worked for a university for a lot of my professional life, where I was taking emerging technologies like video conferencing and bending them down and operationalizing the mean to, uh, into normal sort of roles, taking them from that edge and that emerging space into an operational space. Um, from there, I went into developer relations at Mycroft AI and, uh, for me that was a natural intersection cause my first undergrad was in languages. Um, I majored in Indonesia language and it was this beautiful intersection of, um, being able to take my technical skills and some language skills into, uh, into a company. And one of the things that I did when I was at Microsoft was set up their Microsoft translate platform.

Kathy Reid (00:01:00) - One of the challenges that I think we see time and time again with voice assistants is that they’re only assisting certain, certain people, certain cohorts of people. And what we wanted to do at Microsoft was try and open up voice assistance a little bit more to the world and what the translate platform allowed people to do. It was a crowdsourcing platform, uh, and what that allowed people to do was to translate, uh, voice command skill commands into their language. Uh, and it’s one of the necessary things that’s required for voice assistance to be able to support, uh, many different languages. So while we might have voice models for some of the major languages, we don’t necessarily have some of those command translations for, for many others. So it was a very interesting time. Um, I’m in their voice team and, uh, I’ve worked on various projects with a lot of their different voice technologies, like common voice, which again is a crowdsourcing platform for, uh, recording, uh, voices and other answers in different languages.

Kathy Reid (00:02:07) - It has one of the most language diverse voice datasets in the world. I work a little bit with deep speech, which is the speech speech to text offering from Mozilla and occasionally with their TTS system as well. So, uh, uh, that’s a little bit of the through line and, uh, now I’m embarking on a PhD, um, and what’s the PhD on as it related funnily enough. So just started PhD I’m with the three H Institute, uh, within the college of engineering. And I should know this computer science within ANU. I’m going to get marks deducted for that. I’m sure. And, uh, at this stage, cause we all know how PhDs evolve and how they change over time. Um, but at this stage, what I’m really interested in is how voice assistants got a scale, what are the trajectories through which they go to scale and what opportunities do we have to influence or to, uh, change the course of that? The course of that evolution are primarily for the benefit of more people. So that’s, that’s where I’m at at the moment. Now I want to see how they go to scale, why they go to scale and can we shift some of the things around some of the axes around how they go to scale?

Sean Dockray (00:03:29) - Um, what is the sort of landscape of voice assistance? Cause I’m sort of familiar with obviously, you know, the big ones and Microsoft, is there, is there more, is there some other smaller ones?

Kathy Reid (00:03:42) - Uh, so if I, if I cover off the major ones that you, you probably know it, uh, at the moment, uh, so you, you have Amazon with Alexa, uh, which has a significant amount of market share. Um, you have Google with their Google home. Uh, you will have seen Siri, uh, in the Chinese market. Baidu Baidu is the largest, uh, uh, the largest voice assistant. I’m gonna forget the name. I think ... is the name of Baidu’s voice assistant coming out of China, uh, in the African market. There’s not really anything at the moment. And, uh, in, as far as I’m aware, there’s not a lot that’s coming out of South America either. So we have that divide between the global North and the global South in the open source space. Uh, I would characterize that as a lot more fragmented. So with the voice assistance, Amazon and Google, uh, they, they are quite mature. They have reasonable speech recognition capabilities. They have thousands and thousands of skills, which give them as an ecosystem, significant utility in the open source space. That’s a lot more fragmented. Mycroft is a key player we’ve seen, um, Mozilla released their, um, uh, Firefox voice, uh, with bass voice assistant recently.

Kathy Reid (00:05:03) - Uh, the other open source tools all tend to suffer from one of two problems. One is that they’re a single person trying to develop something which is an ecosystem, and they haven’t quite figured that out. Or you’ll have a group of people who, uh, tend to take the technology, do what they need to do with it, but there’s no impetus to sustain or maintain or evolve that technology. So we’ve seen several, uh, uh, open source projects come and die like, uh, Julius pocket’s thinks. And the Sikhs project that originated out of the CMU development on that is now dead. One person, Daniel Povey is primarily doing most of the County development at the moment. And Cody was a flagship, uh, ASR tool for several years, you know, over a decade, but they’ve now, uh, those projects are now to cutting because, uh, we don’t have a way to sustain them. So I think the next question you had for me was can I explain automatic speech recognition in layman’s terms? Uh, that would be great.

James Parker (00:06:08) - Thanks for doing it. Thanks for doing our job for us. We’re a victim of our own scale. Aren’t we, there’s four of us and everyone’s being too polite.

Joel Stern (00:06:17) - I mean, I think I was just taking a moment to sort of take it, to take it all in because it was, it was, um, you know, very extremely helpful for me to get a sense of the same outside of the major, um, corporate voice assistance, you know, so, you know, that’s, that was the, that was the pause, but yes, please do explain, um, ASR

Kathy Reid (00:06:40) - And, and please do interrupt me cause this is, you can tell I’m quite passionate about this area and, and like most nerds felt passionate about an area and just talk until you, until you make me stop. Not now, Kathy, you sure she can’t be. Um, so there’s, uh, basically two different approaches to automatic speech recognition. We have the more traditional approach and we have a newer approach that’s merging. The traditional approach is there’s two different models, a language model and an acoustic model. And what happens is that when somebody, uh, says a phrase, I mean the technical term, that phrases an utterance, what happens is that the language model tries to extract from that recording, uh, the basic building blocks of speech and we call those phonemes. Um, so they’re the sounds that you might hear in words like baby has two phonemes, you know, has Bay and B.

Kathy Reid (00:07:37) - Um, you might think of these as syllables. It’s a, it’s a good sort of point of reference. They’re not quite syllables, but it’s a helpful way to think of them. So the, the acoustic model, sorry, I’ve got this totally wrong. I’m going to have to start again. I’ve mixed up my acoustic model with my language model I knew was going to kick me out. Totally. Um, I’m sure you can do some fancy sort of editing tricks, but basically what happens is that when somebody says something and they say a phrase, the utterance, the it’s, the job of the acoustic model to match that to the basic building blocks of speech, to, to finance. Then what happens is that once the findings are extracted, the language model needs to figure out which of those phonemes goes into which words. And so it’s the job of the language model to reconstruct words from findings. And this is why you need different speech recognition models for different languages, because in different languages you have different phonemes. So the findings in French, uh, somewhat different to English and the findings in Japanese are quite different to English. Um, and so this was one of the reasons why, uh, voice assistance in different languages is, is quite a difficult problem because languages are so different

Joel Stern (00:08:57) - At the acoustic model, um, is common across different languages, but a separate language model is then applied.

Kathy Reid (00:09:04) - No, that’s not true. So you need, I’m sorry for being so direct. You need a different acoustic model per language, different file names are pronounced differently, different languages. So we might have a phone-in called car in English, but in Indonesia and the case zone is a lot softer. So it would be, um, and in Arabic that, because sound is different. Again, it’s a much more guttural sound, which is why you need a different acoustic model per language.

James Parker (00:09:33) - And presumably that immediately gets complicated by regional dialect and, uh, and so on. So you sort of have an infinitely expanding number of acoustic and language models. The, the sort of the more widespread you want the technology to become.

Kathy Reid (00:09:54) - Absolutely. So imagine, imagine that you speak English, but you have a heavy Welsh accent. Your acoustic model is going to be incredibly different from somebody who speaks with, um, a very proper British accent. That acoustic model is going to be very, very different. On top of that, you also have slang.

Kathy Reid (00:10:14) - So, not only does the acoustic model have to match the speaker, the language model has to recognize, um, idiot, idiosyncrasies and idioms in the language. So, uh, well I know a couple of us in the voice coach, uh, are Australian. Uh, so imagine the sentence, uh, there’s been a bingle, uh, up at Broadie and the traffic’s chockers back to the servo and I’m going to be late for bevies at Tomo’s. Now, if you’re Australian, um, and you, you, or you’re not, then you’ve been here a few years.

James Parker (00:10:49) - It’s been here 10 years. I’ve still got no idea what you said.

Joel Stern (00:10:53) - I got it,

Kathy Reid (00:10:55) - Points for Joel, he’s going to bevies at Tomo’s. Um, but so you see the issue here that we have with slang. It’s not just the acoustic model, which, um, w which needs to be cognizant of accents. It’s also the language model as well, and idioms of language.

James Parker (00:11:12) - And I’m wondering if age and I mean, you know, the, the moment, the moment you, you, you probate, presumably it gets increasingly complex again. And my son was speaking to, uh, uh, my phone the other day. So saying something like, you know, show me poor patrol or something like that. And, and I was amazed that it could understand him and I, I, I couldn’t help, but think there must have been some specific modeling done for four year olds. And that, you know, politically, it’s kind of important kind of important for organizations like, you know, Google, Amazon, so on to be able to get that, get that sort of, you know, young market early, right? Like if I was going to invest in modeling, I’d probably, um, I’d probably go there because if you can get people to adapt to voice user interfaces before they can tie, um, you’re onto a win.

James Parker (00:12:07) - And I just, I mean, I said poor patrol, but, you know, paw patrol is uses voice, internet interfaces in the show. Right. So that the dogs are literally like, you know, God, I can’t even think what they do. Like, you know, helicopter go or whatever, and it does it right. And so I just, I can’t help, but think that there’s a kind of a process of habituation that sort of going on very explicitly across cultural, technical, um, domains. And that, that must be that that must be something that the kind of the industry is conscious of.

Kathy Reid (00:12:39) - Uh, so yes and no. Uh, I agree. And I disagree on that point. So I think you’re absolutely right in that the industry is trying to make voice user interface a default, uh, or make voice user interface, uh, a widely used, uh, interface mechanism, particularly in a time of COVID where touching surfaces is actually a dangerous form of interaction. Um, I don’t know how many of us have been to an ATM and typed in our pin number since COVID, you know, w we’re not using cash anymore because it’s a tactile physical. So I think, yes, absolutely. The voice industry is trying to get people used to voice just in the same way that when the iPhone came out in 2007, there was a process of habituation where we had to acclimate to using a touch screen on a mobile phone. Um, I’m, I’m the generation.

Kathy Reid (00:13:32) - And as you can tell before I phones. And so I had to go through a process of, you know, going from my Nakia, you know, 30 to 10 to something that had a touch touching to face that I wasn’t used to. So you’re seeing industry is absolutely trying to get people to make, uh, to make voice the default way and to get very used to using voice, uh, by the same token, this isn’t anything new, right? So socially and culturally, we have a history of expecting to be able, we have an imaginary of being able to speak with machines. So if we go back to star Trek in the 1960s, um, you know, computer do this, or, you know, Jean-Luc Picard, you know, T O gray hot, or my personal favorite, Kate Mulgrew coffee black. So we have this long cultural history of expecting computers, uh, to listen to us and to do our bidding. You know, we see it with kit and Nightrider, we saw it in time tracks, uh, with Selma. So we have a very, very long cultural history of expecting to be able to speak with computers, but it’s only now that the technology has been able to deliver to that imaginary, if that makes sense.

Sean Dockray (00:14:49) - Yeah, definitely makes sense. Um, do you think though, um, that James, uh, that, that there’s, um, an attention to children specifically, like, uh, James was hypothesizing just then I’m sort of curious. Cause I was thinking about how, um, also like having children not have to ask their parents to enable some piece of technology to do something to, um, they don’t have to ask for permission that they can go directly to the, you know, that this also, um, well it saves me time, so then I don’t have to get exasperated. Right. So it kind of functions to, to, you know, allow this kind of like working at home, working from home parents to not be distracted too much, like, um, it would be quite a clever thing for a voice companies to be sort of like aiming for in particular. Do you have any sense that uh, children, language models are being developed, uh, in particular or

Kathy Reid (00:15:55) - So? I, I really don’t know for sure. It’s not an area that I, I work in specifically from a technical point of view. It’s absolutely plausible that voice models trained on children are being deployed. So if we look at, for example, changes or differences between women speaking and men speaking, um, not to cast gender as a binary and I recognize agendas are not binary. If we look at some of those different characteristics, we tend to have different fundamental frequencies. When my speaking, so men tend to speak at a different, a lower, fundamental frequency from women. And so children speak at a, sorry. Women tend to speak at a higher fundamental frequency than men. Men tend to operate at a lower vocal range. Kids tend to have a higher range again than women. And so being able to capture a voice recordings or samples that have that higher range, then being able to, to train, uh, speech recognition models on those samples is absolutely something that I think voice companies are doing in order that children can be heard.

Kathy Reid (00:17:04) - I know when I was at Microsoft, one of the big issues that we had in training, some of their models was that they didn’t respond and recognize as well to women and children. And we found that was primarily because we didn’t have samples of a women’s voices and children’s voices in the data set that we were training from. So I think that’s where, that’s where that problem starts. But if we tie this to the broader ecosystem view and we think critically about what it is that voice assistants are doing, what is the intent of a voice assistant? What does a voice assistant want? If we think critically about what commercial voice assistants want their funnels to something else? Right. So if you’re speaking to Alexa, part of Alexis job is to get you to order more stuff from Amazon so that Amazon can take a cut of that sale.

Kathy Reid (00:17:57) - If you’re speaking to Google, part of the Google assistant is, uh, is trying to be able to put advertisers products and services in front of you via that interface so that they can get a cut from their, their partners who advertise on the platform. So if we think critically about the intent of a voice assistant and how that might intersect with children using that voice assistant, then absolutely. Um, absolutely these companies will be looking to see how they can, uh, how they can commercialize commercialize the use by that cohort. So I think there’s another element here too, with children using voice interfaces.

Kathy Reid (00:18:38) - If, if I cast my mind back more decades and like here to imagine now, and I think about when I was learning to drive a car. So I learned to drive a manual because automatics were fancy new and shiny. And so I had to learn how to operate a clutch. And I had to learn, you know, where all the gears were now, automatic cars are a lot more common. And I suspect that we’ve seen the last generation that will ever need to learn how to drive because in another generations time, autonomous vehicles will have increased in their maturity and our underlying infrastructure and regulations will have caught up to where the technology is. And so we won’t have to learn how to drive a car anymore. And I think what we’re starting to see in the user interface space in the HCI space is a similar evolution.

Kathy Reid (00:19:26) - So, uh, I learned to type on a manual typewriter because that’s how old I am. Uh, and then I shifted to an electric typewriter and, you know, I can type at 130 words a minute and that’s fantastic for working in tech, but if you’re trying to learn to type at the moment and you can talk to your computer instead and have it transcribed faster than you can type, then why wouldn’t you speak to your computer and have it transcribed the words instead. And so I think what we’re starting to see is the inculcation of a different default way of interacting with computers.

Kathy Reid (00:20:01) - Rather than a keyboard and mouse. And so I think the keyboard and mouse has been the default for so long that this is really a seismic shift and the generation that’s coming through now that is much more comfortable talking to a voice assistant than they are typing, uh, is going to find voice and much more fluent experience than typing.

Sean Dockray (00:20:20) - Yeah. One interesting thing about that, I think too is, um, I mean I realized that the, the sort of like trend models are becoming small. Like it’s possible to miniaturize them and to place them on the devices, but they require even then they require constant updating, because like you said, the language model is always being sort of evolving by the day, but that when the interface sort of is only about kind of like touching or, you know, entering in text that it’s something that can happen on the device. But, um, but now, like, as we’ve moved to a kind of like voice enabled interface sort of means that the, the, you sort of depend on an external computational power, your EDW depends on, you know, um, remote servers, you depend on the cloud in order to either deliver the language model to you for you, or to actually do the speech to text.

Sean Dockray (00:21:12) - And so that dependency on the remote. Yeah. Can I ask a variation on, on, uh, on, on that, that exact theme? Um, you know, because I was also thinking, as you were speaking, then you know, that a keyboard isn’t owned, uh, and sort of corporatized in the same way that a voice user interface is now, obviously, you know, you’ve been working in the open source space, but in order to interact with Siri, I’m interacting with, you know, an enormous, enormous corporation, one of the biggest in the history of the world. So, so it’s not, you know, it’s not just a sort of infrastructural dependency, but a corporate dependency as well. And so it’s interesting, you know, you, you said before about, you know, what, what voice assistants want, and one of the things that many of them want is to be no dependency on, on corporate infrastructure specifically. So it’s, to me, those Sean and Sean’s question is related to that, that question of kind of corporate path dependency and interface design.

Kathy Reid (00:22:21) - Absolutely. So I think there are two threads to this discussion, no pun intended. I think the first thread here is, um, sort of exploring what that dependency is and the dichotomy between a cloud enabled voice assistant and what affordances that has versus the affordance of something that is offline or embedded. Um, uh, very happy to speak to that. And then I think the second part of that question is really getting to the type of relationship that a voice assistant user has is no longer with a device, but it’s with an ecosystem. And what are some of the emergent properties of that changing relationship? Um, so first let cover off on the cloud versus the embedded cause I think that’s a very, very, uh, uh, very, uh, prescient, uh, dichotomy at the moment previously, the key reason why we couldn’t have, Uh, Functionality of a voice assistant in something that was offline and not connected to the internet was that those devices generally like the computational power to be able to do speech recognition, intent, passing, natural language passing, and then to generate voice what we’re now seeing as hardware accelerates and improves is that it, and, uh, also the intersection of that with advances in machine learning algorithms and the ability to condense our machine learning models onto embedded devices. What we’re finding is that it is now possible to have comparable, uh, comparable algorithms. So comparable speech recognition, comparable language, passing, comparable speech generation on commodity class hardware. So as examples, deep speech, which is Muscillo speech recognition technology can run at real time. So that means it can sort of keep up with what people are saying. It can run it real time on Russ, before raspberry PI for hardware. Uh, and we haven’t seen that anything near that in the last couple of years. So embedded embedded hardware is getting better. Algorithms are getting better. So we were removing some of those technical barriers to having a voice assistants work in an offline capacity that doesn’t solve the problem of updates and it doesn’t solve the problem of the services and, uh, skills that those voice assistants connect to. You need to still be online to be able to access their services and skills, but it means that your speech and your utterances Stay can stay on device.

Kathy Reid (00:24:59) - Of course, that doesn’t serve the purpose of some voice assistant companies and ecosystems, right? Because it’s that voice data that utterance state of the commands that you’re giving to your voice assistant that are actually, uh,

Kathy Reid (00:25:12) - The, the source of wealth or the source of revenue for four people, that’s how they, that’s the data that they use to train their models with. So that’s another line of dependency between, uh, voice assistant users and the ecosystem that produces voice assistance. So that’s a little bit about the, the dichotomy between sort of cloud and the embedded. If I look to the relationship between users and voice assistants, and we see how that relationship has changing compared to other, uh, interfaces. And I think that’s really, really interesting. So one of the, one of the approaches to market, or one of the plays to market for voice assistant companies is not as a voice assistant, but as an interface to the rest of their walled garden or rest of their ecosystem. So for example, you might use your voice assistant to turn off the lights or to turn on the lights, or you might use your voice assistant to play music or stop music.

Kathy Reid (00:26:15) - Um, what voice assistant companies want to do is not just I’m that voice assistant experience, but I’m, uh, that entire smart home, that entire connected experience. And then if we go outside to the garage and where we might have our semi-autonomous car, you can see how the voice assistant paradigm is also extending to not just out inside the home, but out of the home, into the car. And then you, it’s not too, too much of an imaginary to imagine that when you get to work, if we ever go back to the office again, you might also have a corporate assistant that assist you at work in your work context, voice companies want to own as much of that experience as possible because the more they can own the more services they can sell you, the more, uh, the more other things they can get you to buy voice assistance themselves, have very little, uh, have very little revenue generation capability other than as entry points or as funnels for other systems and services. And so that’s changing the relationship between the voice user and, and the ecosystem that they’re interacting with.

Sean Dockray (00:27:31) - Yeah. So maybe following up on that, can you talk a little bit about the Microsoft ecosystem then? Just because obviously my then make my craft, isn’t just an alternative voice assistant and alternative device. It’s a, it’s obviously seen an entry point into some alternative ecosystem.

Kathy Reid (00:27:49) - So, and so the ecosystem that Microsoft is the entry point for is again, an open source ecosystem. So people write, um, people write skills and make them available through the Microsoft skill, uh, skill page in very, in a very, very similar way that people write Alexa skills or in similar way to, uh, they write Google skills. The key difference is that that’s not monetized or incentivized at all. And I think that’s actually a, a benefit and a problem. So for example, there’s no financial benefit to a skill developer in writing your skill for the Microsoft platform. They get no revenue from it. Um, they need to maintain and sustain that skill. Whereas if A, uh, Well, let’s take the example of Spotify. You can only access the Spotify API, which is required to create a skill if you have a premium account with Spotify. So if you use the free account, you can actually use our Spotify with your, uh, with your voices system, because you need to have a premium account to be able to do that. So that’s one way in which the, the service or the skill that the voice assistant links to is trying to drive revenue from the voice assistant, but the voice assistant itself gets no revenue from that. The voice assistant ecosystem gets no revenue. What we also see is that is a pattern that’s common to many other open source endeavors as well. And that’s open source is generally created to scratch in it. Somebody creates something because it solves a single point problems for them. Somebody else uses it, modifies it, um, you know, gives it back to the community.

Kathy Reid (00:29:43) - And so what we see is a very different, a very different development of skills in the Microsoft ecosystem. So we have skills, for example, for diagnosing, uh, wireless and network problems. That was one of the first skills that went into the Microsoft skill store there, the aircraft skill, um, because the people who are creating micro skills are also network administrators and system administrators. And so they use this to diagnose their networks. So that’s a very different pattern as well. So I think if we think about how skills are incentivized, that sort of cracks open this problem a little bit as well at the moment, there aren’t very many incentives for skill developers, unless they’re employed by a company that gets a revenue stream from having a voice assistant skill.

Kathy Reid (00:30:33) - If we think about this and compare it to the mobile phone, a mobile phone app market, then we don’t have, we don’t have the sort of market economics in play at the moment. And one of the dangers is that if we follow the mobile app store sort of paradigm, you have developers there who put a huge amount of time and effort into developing apps. Studies have shown that people tend not to buy apps. So people are quite, um, you know, they might buy a thousand dollar mobile fine, but they’re really stingy about actually buying apps to put on the phone. And so trying to generate revenue as a mobile app developers very difficult, unless that mobile app is actually non-trained to sort of platform as a service or a software as a service offering. And I think we’re starting to see the same thing play out in the voice space, especially cause I think, uh, Apple is an Apple that takes 30% of the revenue from each, uh, app store sale. So Apple takes a cut from the app and the app developers left struggling to, to try and find a revenue stream.

Joel Stern (00:31:42) - I mean, this takes us really neatly into the, um, next question, uh, I suppose, which is around the politics of open source voice assistance and, you know, um, having so eloquently described the kind of, um, incentivization in the, in the commercial sphere for developers, what motivates, um, the production of an open source voice assistant or, you know, to put another way, what, what do you think are the sort of political imperatives at work, um, in, um, an endeavor like that?

Kathy Reid (00:32:18) - So, so I think what I might do here is talk a little bit about the challenges of an open-source voice assistant and then sort of relate that back to some of the politics that are in play that open source might be able to, uh, to, to have a, uh, some influence on, uh, so thinking about the challenges of open source voice there’s, many of them, uh, some of them are common to open-source endeavors across the world, like sustainability, maintainability. How do you derive revenue in order to fund some of the infrastructure that open source is working on? We’ve talked a little bit about, um, skill stores and the lack of incentive for voice skills. Uh, but the, the trick there though, is that the utility of a voice assistant is really a function of how many skills it’s able to do, except that all the user studies are showing that, uh, when people, when people use voice assistance, they only use a very small handful of skills like weather or time or setting a timer, those sorts of things.

Kathy Reid (00:33:27) - That’s a very, very small amount of skills. And so even if you start to increase the number skills that a voice assistant can do until you get people discovering those skills and using them, uh, you’re not actually increasing the utility. So it’s not just a pure numbers game of how many skills can be developed. You actually have to get people using the skills, um, and actually getting them to riding some utility from those skills. So that’s one of the challenges of open-source, you know, not just having the skills available, but how do we show people what skills are available that discovery layer. And I think that’s common to most water systems, not just I can source ones, one of the other challenges that we having open source, um, and keep in mind that open sources sort of sell to the market is about privacy and not using your data for, um, uh, not using your data for commercial gain, you know, not impugning your privacy is that people don’t care about privacy.

Kathy Reid (00:34:27) - People are not willing to pay to have their privacy protected or only a small proportion of people are. So we’re, we’re all too willing to give up our personal information and our privacy for access to a service like, um, you know, different social media services. So as a society, we don’t value privacy. And, uh, I’ve already mentioned that sustainability piece that sustainability and the ability to derive a revenue stream from open source voice is very, very difficult, which in turn makes the sustainability and maintainability very difficult as well. The, the revenue opportunities, just as a voice assistant on there, until you connect it to an ecosystem like an e-commerce platform or an advertising platform, Linking that back through to the question of politics, that there are huge politics in voice assistance. Firstly, I think, uh, wherever there is politics, there is power or lack of power.

Kathy Reid (00:35:29) - And I think that plays out in voice assistance as well. So the first question that comes to mind here is which voices are recognized. So there’s a very famous study which was done recently, uh, by Aaron co Nikki. I’m going to have to double check her name, but basically her study found that voice assistants, commercial voice assistants are much less likely to recognize African-American speakers. And there’s been some work done on that to try and find out why. But at the, at the core of the issue is that voice assistant language models are not trained on people who speak with African-American vernacular. And they’re trained on people who have white speech. And so there’s all this sort of power tied up in voice assistance, where you have a history of inequity, history of marginalization. And this again is manifesting itself in a voice assistant. We see politics agenda playing out in voice assistance.

Kathy Reid (00:36:30) - So most of your voice assistants have a female voice. So if you think of a voice assistant as a machine that is ready to listen to your commands and do your bidding here, we have the manifestation of yet another, yet another thing in society that expects women to be subservient and to take orders and to do, to do the bidding of others. So, uh, a began cost women in a subservient service oriented role. Uh, I’d love to see a voice assistant that is very authoritative, uh, that has a man’s voice instead. And so that I can give a command to a voice assistant. That sounds like a man, like a man. And it says in a very authoritative tone, thank you. I’ll get to that immediately. I’d love to have a voice assistant like that. So there are politics agenda that play out here as well.

Kathy Reid (00:37:22) - And we see this in conversational design as well. If you, if you interact with the voice assistant and you’re rude or you’re abusive, then the way that the voice assistant handles that from a conversational design perspective has politics as well. If you have a voice assistant that is gendered to be female and the voice assistant, uh, deals with that sort of, uh, dialogue in a way that’s very subservient and very passive, what message does that send to the user? You know, does it normalize patterns of behavior that we’re trying to Dean normalize and problematize in society? Uh, so yes, I think there’s a huge amount of different politics tied up in voice and system.

James Parker (00:38:06) - And do you think that open source is particularly well suited to, uh, uh, confronting or dealing with those problems? Or is that, uh, is that something that sort of, is that a problem that skates across all the different forms of voices system?

Kathy Reid (00:38:23) - It’s a great question. Uh, and again, I’d have to answer with yes, no. So from the perspective of technology where open source is available and you can alter and modify and bend the open source code and hardware to your will. Yes, it does very much lend itself to challenging some of the, some of the orthodoxies. Some of the established patterns I’d also have to answer no because less than 10% of the open-source community are women. And so you have a predominantly male community building open source, excuse me, building open source software and building open source voice assistance. And as humans, we tend to build things like we are. And so I think the lack of diversity and open source communities, not just along gender lines, but along, uh, uh, racial diversity as well is also a problem for open source because we don’t have that diversity to draw from, to build from. So yes and no,

James Parker (00:39:28) - As a sort of follow-up to that, you know, so some of the things that you’ve been saying about, uh, source or the, the, the political problems that you you’ve you’ve you’ve mentioned, uh, are basically, uh, almost like a problems of access or the completion of the project of voice. So you said, you know, well, the problem is that Africa doesn’t have voice assistance in various African languages or South America. And their problem is that, um, African-Americans also can’t access their voice assistance. And so I’m immediately thinking, you know, all of that’s true. And of course, if they’re going to be voice assistance, you should have access to them. But on the other hand, there’s a politics, a very serious politics to the, the project of, cause there’s a kind of, uh, uh, colonial or expansionist dimension to, to, you know, to always treating the problem as, um, we need more of the thing that, that we started off with, you know, it, it, there’s something about that, that.

James Parker (00:40:35) - I mean, maybe it’s just because I kind of, I’m just like a little bit, I have a bit of an aversion to voice assistants somehow that, that, that if the problem is always there not being enough of it, um, then that is, it strikes me, there might be other political, maybe. I mean, maybe another way of sorry to interrupt teams. I was just, um, Kathy, when you said before that, you know, the, the, the political question, um, you know, is, is who, who gets to be heard or who, you know, is audible to a voice assistant. And one of the questions we’ve been sort of exploring is, um, who, who is allowed, um, not to be heard, you know, who has the option to evade, um, the capture of these always on Listening devices, which are increasingly pervasive. And, um, obviously there’s a lot of people who, who, um, in our communities and, you know, who, um, would feel threatened and insecure, um, by H having a sort of, um, device that captures the, the Sonic environment and what they’re saying and which, um, that they might feel, they have sort of limited control over how that information is used.

Joel Stern (00:41:54) - So I suppose, you know, this, um, political question is both about, you know, access to these devices and the benefits, but also, um, protection from them. Um, in some sense, just to offer a further extension, you know, um, it’s not just voice assistance, don’t just listen to voices. You know, so increasingly audio event detection, audio scene analysis, and whatever are all being integrated in as well. So, and, and sometimes by means of the, kind of the Trojan horse of voice. So in other words, to understand your voice, we need to better understand the Sonic environment. And so they kind of, they, they get sort of entangled up in each other and, and again, it just, it just, just seems like the kind of the horizon is always more, or, you know, we, we must, we must listen to more capture more first, it’s the voice, then it’s the context then it’s, you know, a kind of total ambient sensory. So if you’re a, if you’re, you know, um, let’s say an activist involved in a campaign where you feel sort of, um, very insecure or sort of threatened by, um, not just the authorities, but, you know, the, the, the major, um, industrial partners who, who, who work with them, I suppose we’re sort of thinking about what are the, both the positive and negative horizons of these technology with that kind of thing in mind.

Kathy Reid (00:43:25) - So, Oh, wow. A huge can of worms here. So I think, let me, let me tackle the ambiance and the Sonic environment one first, and then let me talk a little bit about how do I protect myself from being heard, um, you know, how do I, how do we treat silence as a value as well? How do we treat privacy and silence voices as valuable as well? So if we think about learning more about the Sonic and acoustic environment in that might be noise, uh, doors slamming, what is the traffic outside? We look at more than more than human design principles. Can I hear birds outside? Can I hear a dog barking? What is the level of traffic outside wasn’t industrial noises? We can start to get that context where that starts to getting creative, incredibly scary is not disappointing time, but where we’re able to gather that data to give us a picture over, over a temporal context.

Kathy Reid (00:44:24) - So how does that change week by week or month by month or year by year or season by season? So if we think about scale along that index line, then if we start to think about what other data could that be combined with. So we have a huge smart city. Isn’t a huge open data movement at the moment in a lot of cities in order to better control things like waste and smart parking and those sorts of things. How do those data sets interact? And what are the affordances from those either sort of collisions or collaborations. And I don’t think we’ve done a lot of thinking about what that might look like or what imaginaries might come out of that, except what we do know is that there’s going to be an intent behind those. So if we look at the intent of smart parking, you know, it’s so that people can find a car park wherever they want. It’s not so that we reduce our dependence on cars. If we look at things like industrial noise, is the intent there to regulate, or is the intent there to, to do something else altogether.

Kathy Reid (00:45:34) - So I think we need to be really careful about the collisions and the collaborations of those datasets. Uh, I don’t know if anything, specifically in the literature that looks at those at how voice assistant sound data can be combined with other forms of data, um, to, to infer much more about a person than what we already know. But what we do know is that the ecosystems to which these voice assistants are entry points, the portals know a huge amount about us from our web browsing activity, from our phone activity, from our location activity. Um, and so it’s not, it’s not inconceivable that voice and acoustic data would be used to augment things like geo location data. So for example, you don’t just know the time I was at a park because you have my GPS and you have a timestamp. You also get to know where the ducks quacking in the park, where the other dogs were feeding was the wind howling. And so we start to add these different layers of context onto existing data sets as well. And that might be incredibly scary.

Joel Stern (00:46:45) - Yeah, we already have. Yeah. Yeah. I was just going to invite you to say what, what, what scares you about that?

Kathy Reid (00:46:53) - So I’m scared about voice assistance in many different ways. Uh, so the ability now for law enforcement to subpoena people in the United States to have their voice assistant recordings used in legal proceedings, you don’t consent to that when you get a voice assistant that might be barked down or might be bundled with another product, and then suddenly you have a microphone in your house, and we’re not just talking about a microphone in your lounge room or your kitchen or your bathroom. If we think about context and I, I go back to Jonathan Habers classification of context of spaces, uh, public, uh, shared private and intimate. We’re starting to see voice assistants move along that spectrum. So they’re not just in lounge rooms or in kitchens or in bathrooms, which is private space. We now have voice assistance at our bedside table in one of the most intimate places in the home.

Kathy Reid (00:47:54) - And what might it record there? How could that be used against us in ways that we haven’t imagined yet by the same token, it might protect us. It might enable us to, uh, have our truths told and to have those truths believed in a way, uh, that people and witnesses often aren’t. So I can see a, I can see a dichotomy there as well. Uh, the other things that scare me incredibly about voice assistance, and we go back to, uh, the political, the political angle here. If we start to think about acoustic models and language models, and we start to think about people who have accents, and if those accents are able to be detected and voice assistance and another mechanism to do racial profiling and over-policing of marginalized groups. Absolutely. So if you think about Portland and the protests that are going on at the moment, they’re one of the ways that police could possibly target protestors is to try and determine whether the voice assistant that you’re speaking to can detect.

Kathy Reid (00:49:01) - Whether you speak with African-American vernacular, that’s incredibly scary, or we don’t have to look that far away, right? We have an indigenous population with the lifespan that is still 20, 20 years, less than white people in Australia. What if we did racial profiling of Aboriginal Australians, because it’s not like that hasn’t been done before. So these are some incredibly, incredibly scary things that voice assistant technology might do and regulators aren’t anywhere to be seen. Right? And I think there are some major problems there. This technology is difficult to get your head around from a technical perspective. It’s difficult to get your head around from an affordances benefits and risk perspective. And it’s also difficult to see what trajectories or through lines the technology might have used in certain ways. It might be very beneficial to certain cohorts of the population, but it might also have a devastating impact to other cohorts. And I, I don’t think regulators and lawmakers are really grappling with those issues yet,

James Parker (00:50:09) - Which is something that I’ve been grappling with recently. That to me, I’m also, I also find smart assistance or voice assistance, quite scary. One of the reasons I find it scary is because of the, kind of the sort of completionist, um, dynamic or sort of impulse that sort of seems to be sort of implicit in it, you know, more, more, we want more, more sound, more analytics in it and the way that that’s tied to a specific political paradigm or kind of, uh, capitalist paradigm of data extraction. And so, so to me, like the it’s not so much, um, is my privacy at stake in terms of, you know, will somebody hear something specific, but just the idea that like, there’s just the, the data collection data extraction imperative is stroke so strong for these companies that we’re going to continue to feed the beast, basically on the, kind of the, the, the data colonization of our entire auditory worlds. So I find that really scary the trouble is that it’s quite hard to specify that.

James Parker (00:51:19) - And it’s interesting that you, when, when you started talking about, you know, the context of policing and things immediately got speculative, and one of the problems I find is that I can’t point to a specific example right now of, you know, the use of, um, you know, Machine Listening in over policing. And, you know, it feels like the pushback in relationship to facial recognition is strong because it’s being used in that way. And it’s a real challenge because if we let it get that far, we’ve already lost. So I don’t know how to, I don’t know, like rhetorically or politically illegally how to confront that problem, because it’s easy to sound like a paranoid maniac, you know, that, you know, imagine it could be used or this, to me, it seems empirically true. It’s obviously going to be used later that I can point to, you know, papers where site, where scientists and engineers are literally developing the applications right now, see it specifically in relation to COVID actually, you know, scientists saying, well, you know, you could use it in the enforcement of social distancing and this way and the other, and here’s how you would do it, but there’s a kind of a speculative mode at the moment about, think about addressing the politics of voice and Machine, this name that I find challenging.

James Parker (00:52:34) - Um, so I, I don’t know if that’s a comment or a question actually.

Kathy Reid (00:52:39) - Um, I think it’s, uh, a provocation, it’s a, it’s a great provocation. So I think we are now with Machine Listening, where we were with facial recognition five to 10 years ago, just because you wear a tinfoil hat doesn’t mean you’re wrong. So if we think about these exact same arguments were being made with facial recognition and image recognition in artificial intelligence five years ago, and in that intervening time, those technologies have been productized and platform attires and now weaponized against cohorts of the population. And I think the point that you make is, uh, how can we get the regulators attention and the attention of consumer rights people in the attention of lawmakers now before, before the technology is already in place. And we’re trying to put Pandora back in the box again. Uh, and I think you’re absolutely right. We, we are not thinking critically about some of the speculative and what I would call quasi speculative because we know that technically it’s possible, which means that for me, that puts it firmly outside the boundary of purely speculative. So I think that the problem is how do we get the attention now? And I think part of that bigger problem is that we don’t have a history in legislation or consumer rights about thinking preemptively about thinking speculatively into the future about what are the principles and ethical axes that we’re going to need in order to harness the benefits of emerging technology without exposing particularly vulnerable populations to the damages and harms that can cause. And I think that actually requires a fundamental shift in technology regulation, because we always seem to regulate after the technology has been built. We don’t seem to proactively regulate as the technology is about to, about to hit the market.

Joel Stern (00:54:46) - W uh, I just want to, um, remind you of, um, what, what, something that you was sort of started saying, um, at, at, at the outset to, to the answer about valuing silence and valuing, you know, not being heard. And it’d be great to hear you just expand on that a little bit.

Kathy Reid (00:55:07) - So I think one of the benefits of having a voice assistant not recognize your language is the same benefit that we’ve seen throughout history and throughout culture, where language has been a signifier or a marker of an exclusive group that perhaps doesn’t welcome outsiders or people who are not part of that group as easily. So we see this with Romani, people who speak a dialect. That’s often not the language of the country in which they’re in which they’re living. We saw this in world war two with Navajo speakers who use never whole language that was unintelligible to the opposing forces that were listening in on those conversations. So we’ve seen language as cryptography. One of the benefits of having a language that a voice assistant doesn’t speak is that it can’t hear you. I can’t be recorded. The downside of that silence might be the death of your language though.

Kathy Reid (00:56:03) - So if we think about, for example, the 700 or so Aboriginal languages in Australia that are still active voice assistant technology might actually be one way to, uh, help preserve, uh, and help those languages, those sleeping languages come to life and be reanimated and reborn. Again. If at the moment there are about 7,100 languages in the world, but 23 of those languages account for 50% of the world’s population. And we’re going to see languages decline, uh, as speakers age, and, uh, and die out, quote, voice assistants actually be used for language preservation. And that for me is the flip side of that silence. So I think, uh, I think absolutely people do have a right to silence as they have a right to privacy, particularly in their own homes. But I think we also need to be aware that if a voice assistant can’t hear you and doesn’t understand your language, then that’s one less mechanism we have for preserving some of those languages. And because language is a marker of culture and history, it’s a way of preserving culture and history as well. So I think, uh, yes, we all have a right to silence, but silence in a voice assistant can also have other consequences as well.

Sean Dockray (00:57:28) - Thank you. Yeah, my question was going to it’s you just gave an a great example for the question I was about to ask him, which I will ask now, but I was thinking that so important to do that work of paranoid speculation, you know, that you are talking about with James, um, you know, especially imagining kind of like imminent threat of the proliferation of these voice assistants. Uh, and I’d wanted you to kind of do the opposite work of the opposite kind of speculation, which you kind of hinted out. You said there are all these possible, um, you know, uses, and it’s such a shame that, that this technology is being sort of misused and abused and so likely to trouble down the road of kind of like surveillance and, you know, mass marketing and all that kind of stuff, because it has all this other potential. And I was just wondering, yeah. What, what do you have in mind, like have you seen through my craft, like some, you know, really interesting possibilities. And then I think that that language preservation example is a fantastic example, but, um, I’m just wondering if you could even continue, uh, just sort of thinking about what you have had the ability to stay. Yep.

Kathy Reid (00:58:42) - Yep. And again, a great provocation. What does, uh, what does a voice intent? What does a voice assistant with good and pure intent look like? What does a voice assistant for all of humanity look like? I think there’s some incredibly beautiful provocations

Joel Stern (00:58:57) - we’ve been Calling it socialist Siri

Kathy Reid (00:59:00) - Like that, too. Um, so many options at the risk of boring and go to sleep. Cause I know we’ve gone a little bit over time, so if you you’re all happy to stay with me and listen to, um, the nerd rant. Um, so one of my favorite books is a book by Neil Stephen center. That’s called the diamond age, a young lady’s illustrated primer and the voice assistant in that book. Isn’t actually a voice assistant. It’s a real human and it guides the books protectiveness now and teaches her leadership skills and helps her grow, uh, from a really socio, uh, very poor socioeconomic, uh, cohort that really has no options into the leader of her community at the end of the book. And that for me is a role that I see voice assistants being able to play. Could voice assistance, being mentors, be guides, be leaders.

Kathy Reid (00:59:54) - Particularly in, especially distanced way. No, we’re protecting and isolating a lot of our older and more vulnerable people for very, very good reason, but it means the mentorship and the leadership and the coaching roles that a lot of these people play and not able to be played as much with younger people. So could, could we have a voice assistant as coach, as mentors guide as a young ladies in Australia primer? I think that would be lovely. I also like the, the idea of the subversive Siri, um, the, the Siri, the questions, the status quo, the calls bullshit on fake news that provides access to alternative use alternative content, alternative paradigms, alternative ways of thinking instead of a voice assistant that got you to buy things from the store that pays the most money to that voice assistance ecosystem. What about a voice assistant that let you uncover, uh, the, the unknowns, the gyms in your local area?

Kathy Reid (01:00:58) - You know, teaching know that the best place for burritos is actually this tiny little corner around the street or that the best coffee in Brunswick doesn’t actually have a webpage because that’s Brunswick, you know, remember the hipster map of Melbourne that was drawing, you know, about five or six years ago where they, they plotted all the hip stop locations of Melbourne on a map I’m from Jalong. So coffee’s about as hipster as I get. I’m sorry. Um, but imagine a voice assistant that could do that, or a voice assistant that, uh, upheld the ideological values of the community or the house or the space in which it was placed, you know, um, imagine a voice assistant that was supportive or, you know, uh, in various social situations that Siri or Google or Alexa completely back away from, you know, imagine trying to have a conversation with your voices system about how to navigate a polyamorous relationship. That’s something that’s, uh, that your commercial voice assistance to sidestep, but there are, there are interventions and there are niches here that I think a voice assistant could actually play

James Parker (01:02:08) - Really interested that you didn’t mention that not suggesting that it’s not part of your thinking, but does disability applications, um, been, you know, as I said, I I’m a little bit sort of, uh, um, sort of instinctively tuned attuned against, um, um, some voice assistance. And then I just keep thinking to myself, but the, the, the, the D there’s one’s context where, you know, the D the word assistant just really rings true, and that’s, you know, blind communities or people with severe speech impairments. I was reading about a new company that was, um, trying to develop, um, uh, that voice assistance for speech. Um, so people with severe sort of, um, uh, physical impairments of their voice box, you know, the sort of escalation of the throat and so on and so on, I just, I can never go past them as, uh, as something that seems really, really valuable.

James Parker (01:03:13) - Um, but then I immediately get into my sort of creepy radar kind of turns on because then, you know, I’ve read about things like an ambient assisted living in elder care, and the kind of the way in which it seems like elder care is kind of pie sort of seems to be a kind of a laboratory almost for kind of smart home and sort of extreme surveillance techniques, including Machine Listening and so on, you know, again, using ability disability as a kind of a framework for doing that. And it’s so, so, so, you know, it’s not to say that there’s some sort of pure space of voice assistance in disability, or, uh, elder context is something that’s kind of unimaginable that there would be dark political consequences, but anyway, yeah,

Kathy Reid (01:04:04) - So taking that as a, as a provocation, you know, what are the affordances or, or what are the trajectories of voice assistance with people who are, who have disabilities? Again, that’s not an area that I know a lot about, so I don’t feel particularly comfortable speaking about that area specifically, simply because I don’t have a lot of the, the background it’s, it’s not my specialist area. I’m going to have to sort of say, no, I don’t know about that area. Okay. Sorry.

James Parker (01:04:33) - No, no, no, no worries. No worries. It’s admirable to, to not go wildly speculate, like I would be 10 or that would have a tendency to not, I mean, I think we’ve, we’ve we run you through the ringer. That’s right. I, I think I was just going to say that we sort of, um, what you’ve given has been really generous and has covered a lot of ground and.

Joel Stern (01:04:59) - It’s been, it’s been really valuable for us. So, you know, thank you so much for that.

Sean Dockray (01:05:06) - Your last answer on the not positive speculation, so many possibilities,

Joel Stern (01:05:13) - Because I wasn’t sort of thinking about, it’s like a really, I mean, one thing, um, not to ask you another question, but just to sort of add to that, it’s, um, one of the things that the, the festival that’s hosting us in, in Poland has continually asked of us is to sort of, um, survey the, the positive and negative imaginaries, you know, so it’s, it’s really valuable to kind of start to collect those in these interviews. Um, and I imagine that over the course of a few interviews, there’ll be some wildly different imaginaries that work for different people and what they think the future holds for this technology.

Kathy Reid (01:05:58) - And I guess my parting for it is those imaginaries, they’re not mutually exclusive, you know, the same voice assistant that can help me order my groceries. I’d also be telling the government that I’m in a minority population and assisting in over policing. So I think that that’s something that we also need to grapple with. None of this is black and white. A lot of this is overlapping and complimentary and, and there are collisions it’s, it’s not a mutually exclusive cut and drawn area.

Joel Stern (01:06:28) - That’s a great note to end on. I mean, actually one of the questions that we didn’t ask is, um, who else should we speak to? One of the things we do kind of want to do is develop a bit of a network of interesting. I mean, I don’t mean just for the purposes of this interview of interviewing them, but just developing a kind of a community around those sort of these kinds of political questions.

Kathy Reid (01:06:49) - I have a list prepared. Um, so what I might do, uh, cause I, I know that Sean has to go, what I might do is pop that in a list and sort of, um, let you know what their context and sort of, so, um, I know that Mozilla, I think, would be very interested in this because Mizzou’s huge on privacy and huge on sort of, uh, chipping away at some of those established paradigms. Um, the person who took over me over from me when I left at Microsoft is an indigenous language person. Uh, and he will have a political perspective as well. He’s based in Darwin. So his context is again, different to mine,

James Parker (01:07:27) - Is he Indigenous?

Kathy Reid (01:07:28) - Uh, no, he’s not. Um, I, I wish we had more Indigenous people in voice technology because it means we might be building it differently. Um, but it’s, it’s like every other sort of participation and um, every other field as well. Uh, unfortunately, uh, the folks at Opal at Stanford who are working on an open-source voice assistant there, I think they would be interesting to talk to because I think they are trying to partner with commercial providers and they’re at a very nascent emerging embryonic stage and their through line is going to be influenced by who they can partner with commercially. So is there a way to intervene at that sort of early stage and then at ANU? I think, uh, if you don’t already know him then Swift, um, I hear from the creative computing people where he’s, uh, just had a baby boy and Nick ... now Nick is I think one of the deputy deans of computer science and engineering at AAU, and the reason I think he would have, uh, an interesting perspective on this is a materials engineer by training. Uh, but as a part-time geeky DJs, uh, so he comes into this from, um, the music background as well and as a materials engineer. So I think he would be fascinating to talk to him as well.

James Parker (01:08:56) - Amazing. So, I mean, if you would send us that list, that would be fantastic. And you know, we’ll just keep the conversation going, I guess,