You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

31 KiB

title status
Jùnchéng Billy Lì Auto-transcribed by

James Parker (00:00:00) - Thanks so much for speaking to us, Billy. I mean, would you, would you maybe introduce yourself and just say a little bit about the kind of work you do, how you ended up doing that work, where you do it and so on?

Jùnchéng Billy Lì (00:00:10) - Absolutely. I am currently a PhD student and Carnegie Mellon, computer science departments, language and technology Institute. So our departments, uh, like a main focus on research Lising NLP and then speech and audio analysis. So, which also is what I focus on. Um, you know, as the, my major research goal, which is about, um, audio and machine learning, basically. So, uh, previously for a year, from, I think a year ago, I, from 2015, til 2019, I was a research engineer working for Bosch research technology center in their AI department. So I was, that was a joint collaboration between Bosch and, uh, CMU. So I, I worked for them in a very similar domain where I worked on exactly Machine Listening and sound event recognition for the use cases Bosch had for their industrialization of say Machine health monitoring, and also, um, smart, like a smart ear project, which they define us to launch potentially a Bosch device that can listen, uh, L listen for ambient sonnet event.

Jùnchéng Billy Lì (00:01:35) - I don’t know what exactly that is going, but I left Bosch in 2019, but the adversarial robustness aspect started in 2019 where we were generally interested in exploring whether, as I said, you know, the current machine learning techniques for sound event recognition and for speech, um, automatic speech recognition is actually robust in a real world setting where we are free from, you know, the, uh, the influence generated by ambient noise. That’s ambient noise is what we cared about in our work to really, um, try to answer sort of question. Can we demonstrate that the current mesh and machine learning technology or the machine learning models, um, are they actually robust against these ambient noise in a real room setting where we can demonstrate a specific case that they can break down in certain scenarios or if the best they are actually robust to any type of a part of patients.

Jùnchéng Billy Lì (00:02:45) - But what we found is that, you know, was a certain decibel of sound, which is the artifact that we generated using the techniques that we presented in a paper, which is basically the projective grade in descent methods. So we can actually replay this piece of audio and then trick the machine learning classifier to think that the, uh, so the way court in our case does not exist in like in a real room setting. So to kind of like in a word, basically it shows that the current state of art machine learning technology or machine learning model that we deploy in a, in a smart speaker or smart devices are not robust against a certain type of a curated noise that is aimed on breaking these models. That’s well, what our research show as an early stage, finding how to improve it, it’s still an open question.

Jùnchéng Billy Lì (00:03:48) - We can train a model to lead these devices. Remember these are the adversarial or the audios, or these are the potential noises that breaks us so that it doesn’t get trick the game, but it certainly doesn’t guarantee these models are robust against other type of noises where we can change the threat model. So basically as an attacker, you have an afford information of a model, or you have a, you are more flexible than the defenders. The defenders can only defend against known attacks, which are actually very predictable, but as an attacker, you have a freedom to do many different things where the defender are not necessarily, you know, able to foresee this, and then they are less flexible in terms of defending it where it comes to like specific security, uh, sensitive applications. It could be a real trouble for these type of applications.

Sean Dockray (00:04:48) - Could you talk a little bit more about what some of those threats might be? Just, um, just to give some context as to why people might be testing. I mean, why people might be attacking, um, some of these devices or some of these machine learning models.

Jùnchéng Billy Lì (00:05:07) - Yes. So what we like the motivation of our work, or because as of now, now I can speak as a total, like a CMU student because I left Bosch. So basically from a research standpoint, the motivation comes pretty strong. And then, so as we can observe in the general tech world or tech society, basically Google, Amazon, Apple, they all have voice assistant stored was a bunch of your own credentials, which is a huge, like a liability of, uh, your kind of a personal information stored under sound enabled devices. It means specifically, for example, Amazon, you can order your goods from Amazon. Alexa was only two piece of audio, um, commands basically say, Alexa, can you order this and put on it, put it in a car. And these things will directly be linked with your credit card info or your personal account. If you’re you got a annoying child, for example, at home, basically they will order like a bunch of random stuff and you don’t expect these things to happen.

Jùnchéng Billy Lì (00:06:24) - But so here’s like a, these are the extreme cases, but in general, like, uh, audio monitoring is happening more and more. If you guys have looked into it, like basically we have these monitors, like almost everywhere. It’s kind of ubiquitous with cell phones was, uh, was, uh, was smart. Speakers was nest. Webcams was, uh, home security, doorbells. All these things have sound recording. If you guys have looked into it. And then these certainly created a huge, like a research Corpus for people like me, where we can collect huge amount of a personal audio sources where we can train our, you know, fancy models on to tell whether these are, you know, events of interest. So to say, because like, for example, for the audio event, detection is like, if you have environmental recording from a doorbell for like a month or say three months, and you can basically tell what are the ambient sound going to be like, how is it going to sound when someone knock on the door or approach the approach, the, uh, you know, the garage or, or basically walk down the floor, like all these things have its own signature.

Jùnchéng Billy Lì (00:07:45) - And of course everyone’s speech has his own signature. If you’re a model can pick up like a recording of somebody long enough, you can train a model to totally mimic that person was speech synthesis. So the general rule of some right now is like, if you have a hundred hour clean data of someone, I can pretty much use taco troll or like Wavenet to regenerate your son and then basically fake your voice for a phone call. And you will not be able to tell I am actually a fake person. Whereas, you know, as long as the video doesn’t show up or you don’t get other modality of input to verify that you are actually yourself. So in my opinion, these things are very, very safety critical where people, the public, as you guys observed have not actually looked into this because audio has not yet become like a such hype as autonomous driving because you know, these things are less conspicuous, but is actually like scary in a way that is ubiquitous because the recordings are happening.

Jùnchéng Billy Lì (00:09:00) - They’re all around us. And then, but we’re not very aware of like, they are actually listening to us. Like for example, the us case they are, I think they have an algorithm like of recording constantly so that they can actually wait for the way court. The wake word detection model is actually Listening all the time. So I am not sure whether these data gets sent to the cloud of Alexa or Google home constantly, but I’m sure they definitely kind of opened a back door to record for a period of time to collect these datas. Otherwise they wouldn’t have enough training data to train the wake word, which is very troubled for some, for their application. So from a, yeah, from a legal perspective, I’m not an expert, but.

Jùnchéng Billy Lì (00:09:52) - I’m sure like, you know, whatever they claim, they need to collect training data for training, uh, to, to train their, their model to make, to improve their model. So there’s gotta be some way for them to acquire these data, basically that’s up to like, you know, to, to the public, to, to really look into their terms of the con and conditions about using their smart devices, which I haven’t looked into it, but from a technical aspect, I’m pretty sure that they have to acquire at least thousands of hours of good quality audio recordings in order to train a model that is of a satisfactory performance. So, yeah, to sum up, I think like these things are very sensitive right now, which not many people actually have put their attention onto these sound related issues. So I’m really glad that you guys have caught these things. So, um, and then we can talk about these things if you guys are interested.

Joel Stern (00:10:55) - Yeah, absolutely. I mean, I think you’ve, um, um, covered many of the things that we’re specifically interested in, so, and it’s, it’s great. And it’s also, um, great, great to hear them from, um, the perspective of, of, of a research, uh, you know, we’re working on the technical side of, um, these questions just, just wanted to, but I think we want to move into sort of talking about more specifically about adversarial or, or do you think in a minute, but I just wanted to ask you when, when you were saying, um, with Wavenet and sort of other applications, you know, there’s the potential to reproduce a human voice in a way that, um, is indistinguishable from, from that person’s voice, is that, is that indistinguishable to, to a human or indistinguishable to, to sort of other Machine listeners or, uh, I wonder if that there, sort of that question sort of continually comes up about the difference between the way in which these sounds, uh, kind of, um, heard by humans as opposed to the, to, to the Machine listeners.

Jùnchéng Billy Lì (00:12:00) - So I also can tell you my experience of a Wavenet. So if you have like 100 an hour of a recording of you say, like, you know, in your daily life, then I can pretty much guarantee that for a simple sentence. I don’t think you would be able to tell, like from a wave next generator sun, then like your real human voice, there’s only a very subtle differences between these twos because Wavenet is, uh, driven by the same techniques as, uh, audio events or the specific audio events where they break down each element of recording or the generation into phones phonemes specifically. So basically whatever you pronounce each full names were in Alexa case, it has six, four means of, uh, a cup saw. So it’s like all these six phonemes get picked up differently. And then the machine learning model is able to tell or regenerate in wave that’s case like regenerate, how exactly you pronounce these things.

Jùnchéng Billy Lì (00:13:20) - It might sound mechanical if you don’t have enough training data to train these things, but given enough data, basically it can sort of, uh, compute a language model. In our case, basically language model stands for a probabilistic model of predicting what is the next full name in the prediction. So yeah, it captures the likelihood of the next outer ends of your sound. So it’s kind of intuitive for the public to understand that, you know, basically if you capture long enough, your daily words like, Oh, daily active vocabulary, then it will be able to regenerate or whatever you say, things in a certain way, as long as you don’t exert a very distinguishable emotions or like was a exaggerated way of speaking. So it’s kind of a, yeah, it’s kind of dangerous that, you know, if someone get access to hundreds of hundreds of hours of speaking data of yourself to store it on their cloud, they can actually, you know, if people really do have a, a malicious intent, then they can really replicate these. And, uh, I think if you guys are interested, I can point you guys to like the Wavenet demonstrations that have their official website. And also some of the experiments that my colleague and I have run. So you can try to tell if.

Jùnchéng Billy Lì (00:14:55) - They, if you guys can actually tell the difference between the human beings and then the Wavenet generated,

James Parker (00:15:01) - That would be great. I mean, I’ve spent some time with the Wavenet, uh, the, you know, the official website, um, but you never had know exactly how much that’s curated, you know, like that, the famous example where they demonstrate, you know, the, uh, the way Google demonstrates, you know, booking a haircut live. Yeah. Like I just, you know, obviously that’s sort of amazing in the video, but you never quite know how, how curated that is. Right. So, I mean, it would be fascinating to listen to sort of an independent researchers, uh, you know, attempt. So

Jùnchéng Billy Lì (00:15:36) - I think I can definitely look into several related links that you guys can potentially play around with and then see the real effect of a Wavenet. And then the taco troll, if you guys are familiar, was that paper. So it’s basically the techniques behind speech synthesis at Google. So, and then many people are doing similar things to regenerate, like a piece of audio in that way. So it really like from the role, like, uh, audio feature perspective, they’re really captures the role raw features of your sound and a regenerate bunch of the similar features of, uh, your speech and the, basically the source feature gets perfect match was the target generator features. So you can’t really tell these things are actually different because humans, son, like a human sound perception is built around the cochlear nerves, which the sound recognition systems are built to mimic those things as a filter banks where they mimicked like were more sensitive to like low frequency sound, but less sensitive for the high frequency sound. So these features get perfectly captured by, by the synthesizer where you can actually kind of get tricked to classify, like if this is Machine generator of sound or, or human generated sound,

Joel Stern (00:17:10) - I’m glad you, you mentioned, um, trickery, uh, because, uh, I think it would be good if you could sort of give a kind of just a general introduction to adversarial audio, you know, what it, what it means, um, you know, what, what it sort of does, how, how it works and, you know, because, uh, when I, when I look at the demonstrations and the examples online, it, it sort of is a form of trickery. It’s, it’s sort of tricking the Machine it’s, it’s, it’s tricking the human, but of course underlying that trickery is sort of complex computation and modeling, but w w would you maybe be able to sort of, um, give a broad Def definition of what, what adversarial

Jùnchéng Billy Lì (00:17:54) - Audio is and how it works? Absolutely. So basically, so in a nutshell like adversarial audio by means is basically trying to be adversarial against a machine learning model rather than human beings. So we are targeted to trick a machine learning model, which is deployed on smart speakers or a piece of a computer software that is able to recognize certain sounds as they claim. So, for example, in our case is Alexa echo devices, where they claim to be able to listen to the keywords, Alexa all the time and, uh, respond to your human comment. So that model is our target of interest for attack. So the way we attack them is basically we try to understand how that model works first. So we kind of looked into the publication of a, the Alexa team where they, they, they had revealed a certain technical parameters of their models over the years where they exposed to the public, how they train the model, what kind of dataset did they train on?

Jùnchéng Billy Lì (00:19:13) - And then what kind of techniques, or the specifically the architecture of the neural network models, they have used to train these models to tell, like whether there exists in a wake word of Alexa in a, in a ambient environment or not. So that’s the model that we’re attacking. So the first thing we do is emulate the model. So we build a very similar model using the public available, um, pipelines for machine learnings. And we collected 2000 samples of a wake word that through our own kind of data acquisition pipeline, where we kind of went out to collect these data as ourself, where we have a bunch of friends record Alexa, basically. So we have like about 5,000, um, active cases. And, uh, we synthesize the rest of, uh, inactive cases for the ambient environment classes so that it will fire as a negative case. So this is the model that we emulated for, uh, for a fake Alexa or our own version of Alexa. Basically we.

Jùnchéng Billy Lì (00:20:23) - We, we don’t want to use the word reverse engineer, but it’s, we actually reverse it engineer like a Alexa to some degree. But the key thing we want this fake model or emulated model to do is to really showcase the gradient or the technical parameters of the real laksa model. We expect these gradients of our emulated model will look very similar to the real Alexa models because the behavior of the models are the same where they fire as a true. And, uh, when, uh, when there is a wake word of Alexa and they fire as false when there is no Alexa. So then we can attack our own emulator model as a white box attack instead of totally running blinded as a black box. So tack where we don’t really get to observe what the models are doing halfway. So basically assuming that we have a halfway or a generally good enough white box models for an Amazon Alexa, then we can really look into its propagation of gradient.

Jùnchéng Billy Lì (00:21:34) - So a variable, uh, uh, currently like a state of art machine learning models that are pretty much all. If it’s a deep neural network model, they are all trained, uh, by the same tech sneaks of, uh, back propagations, which means that propagating the arrows of each layers of neural net to the latest two, to the surface layer, to, to recollect, uh, to basically to collect all the sum of arrow that you get from one data sample, where you can train, uh, a target, you can have us a target signal of a NeuroNet to let it converge to the, to the target that you want through the proper loss function that you define. So what we do is we don’t want this loss function to actually decrease because normally when we train on you in an hour, we want the loss function to decrease because we define target.

Jùnchéng Billy Lì (00:22:36) - We want the model to converge to the minimum loss, where we have the minimum arrow, so that we will have minimum ever arrow in this case, detecting whether the true Alexa word is, uh, is it actually, um, existing or not? So when, whenever there is an Alexa in a ambient sound, then we will fire because our loss is very low. And then our model is actually having very little arrow of the, uh, of a finding whether this is actually true or not. But in our case, we’re generating the adversarial audio where it act as an Delta to the original input X. So we kind of like add a piece of noise from, you know, Our own generated data. And what we’re training is actually Training this adversarial Delta as, uh, the same shape or the same type of a feature was the original sun. And I’m confused the model to take in X plus Delta. So this Delta is our adversarial noise and then gets overlapped with the original sound. So this X plus Delta, what we’re training it on is actually to, to maximize the original loss. So we don’t want the original loss to actually go down decrease. We don’t want their arrow to be small. We wanted to go the other direction so that you will amplify the arrow. So will is the existence with the presence of our Delta. Your model will not be able to decrease your arrow anymore. You will only go up because we have curated a bunch of data that we train on was many iterations to make sure that your original loss is not going down, whereas it’s actually going up.

Jùnchéng Billy Lì (00:24:34) - So with our presence of, uh, was the presence of our trained adversarial audio. And next was your original data input. You are actually doing the opposite thing as your original model. So the, the arrow of your original model will not be small. It will be very big. So you get, you will be confused. And the model itself will be misclassifying bunch of the targets where it was supposed to do when it’s clean data. So what we’re doing is basically jamming the model to let him do wrong things because the X plus Delta is not the original acts where the model recognize or remembers to do the right thing.

Jùnchéng Billy Lì (00:25:23) - We are confusing. The model to feed it with a wisdom was the presence of noise. It really gives the model that something they are not good at, or they will tell these are the wrong classes. So they will always misfire. So in Alexa’s case, we feed the Alexa was an X plus Delta. So this X plus Delta is the, we cannot control the X because that’s the ambient sun were in, but we can always control the Delta. So the Delta is our adversarial noise. So the L when Alexa listens from the MBN world, they’re picking up X plus Delta necessarily because our Delta gets overlapped. It was the ambient sound environment that you are in, so that the X plus Delta is Alexa input. And it will get confused because we put our tricks in the Delta S term. So Alexa will misclassify whatever it was supposed to classify, which is the wake word in their case. So they will not be able to recognize the wake word is actually active when there is a way cord and plus our Delta, the way court has gone. So Alexa will be staying silent. The light will not be blinking anymore. So that’s what kind of like a long story, but sorry about my, you know, verboseness, but I just want to make sure that the entire like logic guests, propagated, and then you guys are able to follow what I was saying.

Sean Dockray (00:26:54) - Thanks. That was, that was really thorough and really helpful to, to listen to just some quick follow-up or just to clarify, when you say the kind of ambient environment, um, I just want to picture the kind of process that you’re going through. Are you, you’re not sending the data directly into the model with the perturbations kind of on it, but that you’re, you’re talking about playing audio into a room, right? So the Alexa sitting in the room, so this is something that works in real space. Yes,

Jùnchéng Billy Lì (00:27:24) - Yes. So that is also one of the trick or why our paper gets accepted because these are the technical novelties or like say, I mean, the general idea is also was also novel so that it was, uh, actually a kind of early work that people had raised a bit of a community’s awareness, because when you process audio, we are familiar with all these, um, room impulse transplants or, or response, or like transformations that a song gets process when it, uh, get recorded from the microphone. That goes to the same question. I think Joe asked me through email, whether it’s zoom, we’ll be able to do these things. I’m not really sure because every sound recording, advisor’s doing a sort of a transformation of the original signal. So the rule of thumb is, uh, it always do FFT. We stand for fast-forward transform and breaks down your, your, your signal to kind of stratify your signals to different frequencies.

Jùnchéng Billy Lì (00:28:34) - So you di you, you can keep more information in the lower signals where there are human speeches, where you can kind of neglect the higher frequency, where there are more high-frequency noise. All of these, these things get taken care of in the algorithm space, where we, if we want it to in our world, we wanted to too, with, you know, we wanted to trick Alexa without really happy to it’s hardware, so that we really have to do the same transform as what Alexa does. So basically we can encode this Delta, as I said, this Delta is already taken care of by the transform function that will be defined as the sound transformer. Uh, as a trustee can transform like a function. Basically it will handle the room distortion, and I will handle the echo distortion and also will handle the ambient angle distortion that you get from a real room generator. So when the digital microphone picks up the piece of, uh, audio that we generate, it’s able to do the exact things we tell it to do. And then through the optimization process, we train on the network. We train on a computer. We are as if this Delta was generated in a physical space, so that the Delta, the noise that we train is already trained to wisdom, consideration of a being in a real physical space so that this Delta, when played out.

Jùnchéng Billy Lì (00:30:20) - It’s the real sound that it can real really play in, in the ambient environment where it doesn’t get distorted anymore, because it already has get distorted in the training procedure where we thought that these things will happen when a sound played in an open environment in the, in my living room, in the real case, you know, so it’s like all these things are already taken care of by the bunch of transformation functions that we already define. So when you, when we actually played them noise that we generated, it does the exact same thing as our simulation. So the Alexa will pick up whatever the Delta we think it will get. And then that Delta is adversarial to its original target of the prediction, which can trick it’s all original algorithm. So that’s how these things are done

James Parker (00:31:16) - So much, Billy. Um, can I just ask a, sort of a very basic up question, because you’ve talked a number of times about risk and security and so on. And so could you just say like in, in a line, what, what it is that you’re doing this for is, is this where, when you develop an adversarial audio system, is the purpose, you know, w w what is the w is it about, you know, commercializing selling it to Google, alerting the public as a thought experiment? Just the technical challenge. What’s, what’s the motivation. Could you just say very briefly what the motivation is behind developing an adversarial system like this?

Jùnchéng Billy Lì (00:32:00) - Yeah, absolutely. So I was honestly doing at a kind of, you know, because I was familiar, as I said, was this research domain. So I was, and, um, smart audio is, are really picking up in awareness. So I was doing it in more, like, for fun. I was thinking a specific scenario. So like, say like, if NPR plays my, you know, they can do other ways because now we can get to nullify the piece of, uh, Alexa way quartz. So it will not be able to hear the wake word, but in the other case, if I really curate my son, it will really trick the Alexa to wake up with, without you speaking on Alexa, but it’s a piece of like, say a highly curated guitar music will be able to wake up Alexa in voluntarily, even if you don’t say Alexa, but our adversarial audio will be able to take care of that.

Jùnchéng Billy Lì (00:32:56) - So the motivation or the original motivation for me was saying, Oh, whatever NPR plays this piece of music. And then geez, like a hundreds of thousands of hundreds and thousands of a household in the us is like, if they are playing their, their MPRs in their living room, there’s an Alexa there, all of these will, these devices will wake up. All of a sudden it will kind of be fun to, to think about this type of scenario. And then that is absolutely sort of the things that I kind of, uh, thought about. And then, yeah, it’s, it is a huge concern,

Joel Stern (00:33:37) - Precisely wanted to do the thing, which is the perceived as the biggest threat in the, in, in, in the wrong hands, someone on NPR sort of using the adversarial audio, wouldn’t just wake up the Alexis, but then, you know, get, get them to do something, um, and untoward.

Jùnchéng Billy Lì (00:33:57) - No, no, no, but I mean, that was like part of me, so I didn’t really do that. So, I mean, I was thinking that would be like a kind of awareness and for like a non-malicious purpose, as I sat, like, you know, these things will really be a huge risk in a future. If everybody has one of these things that they’re home, like really Listening all the Listening all the time, and that will really create an involuntary party of all these devices to do random things,

James Parker (00:34:28) - You know, the, the field itself of Machine Listening. I mean, just for example, do you even talk about Machine Listening or do you, is that just not a word that you use? Cause I, we we’ve been using this phrase for various different reasons, but it seems like it crops up in the literature, but there aren’t, you know, textbooks or symposia organized around it. So I’m kind of, I’m kind of interested, uh, in that sort of that end of things. Um, but yeah, so,

Jùnchéng Billy Lì (00:34:53) - So I, I think like, uh, this word is definitely being used by me at least. So it’s like, this is a legit word. What else are we supposed to call it? I think like, that’s a perfect word for this domain, but, um, as you said, it’s, it can be further segmented into automatic speech recognition and then sound event recognition solely because there are different research interest for commercializations.

Jùnchéng Billy Lì (00:35:25) - Some people are more interested in, like some applications are more interested in, in, um, in just purely recognizing the word that you say. And then that is itself a huge task for natural language processing because, um, that itself is its own discipline, but there are more researchers going to the sound events, which are basically environmental sounds of daily events. And these are more of a general like a research studies for how can we better utilize these machine learning models to pick up the daily sound event of not many people as looked into. So these things are more getting more and more usage in say like webcams, like a security applications and Google nest, then you can actually have these sound detection functions activated so that I can pick up dog barks and, um, glass break events and all these security related things. But in my opinion, these, all these things are really getting to becoming like are getting in front of the public eye machines are more and more getting more and more smarter and smarter.

Jùnchéng Billy Lì (00:36:41) - They’re able to hear or understand these sound events. And then of course the speech event that happens around us. So whatever, you know, a lot of people say like, uh, computer vision is privacy intrusive, but in my opinion, the multimodality input or the Sonic input is also privacy intrusive. Because as I said, as if someone gets like 100 hours of your own personal recordings, then you probably get in code, get into deep trouble. If I, or if people like me decide to fake your personal model, then I can fake yourself in a lot of phone calls. And then also detect like, you know, your activities in a sound events. Thanks so much.