|Mara Mills, Xiaochang Li, Jessica Feldman, Michelle Pfeifer||Auto-transcribed by reduct.video and edited by Zoe|
James Parker (00:00:00) - Okay, well, thanks so much for joining us, everybody across all of the many different time zones where, and there’s so many different ways we could begin, but perhaps just, we could start by introducing everybody introducing themselves. Just to begin with maybe Mara would you, would you want to kick things off?
Mara Mills (00:00:19) - Sure. I’m Mara Mills and I’m an Associate Professor of Media, Culture and Communication at NYU. I also co-direct and co-founded the Center for Disability Studies at NYU, and I’ve been in the MCC department for 10 years. I feel like I should say like something about that department, because it’s influenced all four of us. Jessica, Xiaochang, and Michelle, who will also introduce themselves who are either grad students or alums of that department. It’s a department with a really unique history because it was founded by Neil Postman as the Department of Media Ecology in 1971 at the urging of Marshall McLuhan. And Neil Postman was the sole chair of the department until 2002, the year before he died. So it really heavily bears the imprint of this, like McLuhanesque Canadian media studies moment from the seventies, even though we’ve renamed the department and it’s changed in all sorts of ways.
Mara Mills (00:01:22) - So the department has grown quite massively and we do have a sound studies branch that all of us have been part of. And, in fact, we’re even part of a smaller branch, which is voice studies, but I think the McLuhan, even though very few people cite him anymore or Postman, the emphasis on thinking about media technology as much as content is really present. And the department still and Postman used to describe media like Petri dishes in which culture is grown. And I think people are still doing work along those lines, but also thinking about media ecology in bigger terms. Media as existing in environments with each other and with the natural environment and the social world. The four of us today all have incubated it to some degree.
Mara Mills (00:02:15) – In this, what was formerly known as the Media Ecology department, I came to the department with a degree in history of science. I had the training in the Harvard department that was really based on historical epistemology and it was quite a culture shock at first to be part of the media studies department with fewer historians and many more anthropologists and literary theorists. My work focuses, over the last 10 years, either on the question of sound and sound technologies and disability both in terms of literal technologies related to disability, hearing aids, cochlear implants, and the ways people use them, but also the epistemology of sound seen from minoritarian viewpoints. I’m actually on Zoom right now. And for those of you who can see me on Zoom the other interviewees, I’m sitting in front of a framed picture by the Deaf sound artist who lives in Berlin, Christine Sun Kim, who in addition to doing sound art, does charcoal drawings.
Mara Mills (00:03:19) - And she always describes herself as unlearning sound culture and trying to re-imagine the meaning of sound and the politics of sound through things like the image and through tactile vibrations. So one part of my work is along those lines, you know, epistemologies of sound understood through disability communities, or so-called minor technologies of sound, which tend to not to be minor at all. And the other part of my work is been like 10 years of research into the history of the Bell system, the parent organization of that was and still is AT&T, and all of its subsidiaries, like the research branch Bell Labs. AT& T was the largest corporation in the world for most of the 20th century. And, many people describe AT&T as bringing many elements of electro acoustics and present day sound culture into being.
Mara Mills (00:04:14) - So I’ve done work on like tiny components, like some miniature vacuum tubes coming out of AT&T and the beginnings of electronics and the beginnings of amplification all the way through things like the sound spectrograph and the vocoder. And it’s not just the technologies. It’s the user groups that come out of them. It’s techniques of listening that come from these technologies. It’s also techniques like amplification or filtering that are end up being much broader than just the world of sound. And so trying to understand these entirely new modes of thinking that come out of a tiny little component is something I track as well. And it’s in a non-deterministic way, also like culture and politics from which those things arose in the first place.
Mara Mills (00:05:00) - I’ll stop there cause there’s four of us. I want to just say that I’m probably the most ancient historian in this group. And the lovely thing about working with people who have training in sound art and computation and anthropology is that I actually learn a lot more about like what’s happening now and the outcome of some of those components for the present.
James Parker (00:05:20) - Fantastic. Thanks so much. Mara I mean, there’s so much, so many other aspects of your work that I really want to get to, but maybe let’s move on to, I don’t know who to go next. Jessica, maybe, would you like to introduce you?
Jessica Feldman (00:05:34) - Yeah, so I am Jessica Feldman. I’m an assistant professor in the department of Communication, Culture and Media at the American University of Paris based in Paris. And as Mara mentioned, I did my PhD in MCC at NYU studying under Mara I’m so heavily influenced by this science and technology studies way of thinking. I’m also an artist with a background in sound and like weird-techie-new media-robotics stuff, originally trained as a composer. I think this way of thinking about media from the perspective of why are we making what are we making? what are we making? why are we making what we’re making? And thinking about it from the perspective of the creator, the creating and the politics of what we’re making really carries through into my scholarly research right now, my work is sort of at the intersection of sound studies and values and design, and also social movement studies and sort of protest culture in some ways.
Jessica Feldman (00:06:51) - And right now I have two parallel projects. One is maybe more, related to Machine Listening, which we’ll probably talk about more later, but it’s on motion detection by apps, AI IOT devices, and how the sound signal gets sort of monitored and then interpreted psychologically and emotionally by these algorithms. That’s sort of one area of study, more sort of critical. And, the other project I’m working on right now is a more kind of hopeful or proactive project, which is researching how left-wing activists and citizens’ assemblies and sort of grassroots democratic groups listen to each other and design technology in order to do sort of decision-making and coordination and things like that. That’s been super fun and I think provides an interesting sort of, prefigurative model for how we could think about design in a way that is outside of the sort of corporate and surveillance capitalism model that permeates most of what we’re using in fact, what we’re using right now.
Jessica Feldman (00:08:06) - I think that’s really been amplified as something I want to think about right now since COVID and since all forms of assembly have moved online. So thinking about that a lot and think about Zoom a lot. Yeah.
James Parker - Thank you. And maybe Xiaochang?
Xiaochang Li - Sure. I am currently an Assistant Professor in Communication at Stanford University and like everyone here, I sort of have close ties to the Media, Culture and Communication program at NYU. I did my PhD there and Mara was also on my dissertation committee and we’ve written together, Jess and I were in the same cohort. So we’re sort of a pretty tight knit group of folks in terms of my work. I kind of am the sort of accidental sound scholar in which I never really set out to work with or think about sound per se.
Xiaochang Li (00:09:02) - I had started a project on thinking about machine learning and language processing, which is still part of my project now, which is a question about how computation language really gets tethered together as an algorithmic process in which the sort of problem of trying to bring language under purview of algorithmic processing starts to both get us towards what we would recognize as machine learning, right. It really starts to set up some of the technical and epistemic groundwork for thinking about a kind of radical data orthodoxy that we would now recognize now as sort of machine learning in which everything is organized around large scale pattern recognition that has to sift through large volumes of data. But at the same time also opens the door to make it thinkable for computation to be the means by which we come to understand social expression.
Xiaochang Li (00:10:00) - Right. In terms of like making it thinkable that we can not only sort of tally, but also sort of determine meaning within language, by sorting through it. Right. And so the sound piece comes in because part of that history is really about the development of speech recognition and not only how sort of language processing to be, but specifically the challenge of doing acoustic signal processing alongside language. And I accidentally, like I came upon it almost entirely by accident because I made a mistake in my early research. So I was working on text prediction technologies, and I was reading up about T9 and I sort of read in a passing interview with one of the creators of T9 that it originated, as assistive technology. And so I immediately assumed that the assistive technology and question was speech recognition because that sort of made sense to me in terms of like a legacy of assisted technology coming into something like auto correct.
Xiaochang Li (00:10:59) - That was completely incorrect, that it turned out what he had originally designed T9 for and what makes complete sense now, as we think about it was eye tracking, you have nine positions, et cetera. But at that point I had already started down the rabbit hole of the history of speech recognition and started to see the ways in which it was such a crucial piece in the puzzle of making language thinkable as a computational project. And sort of addressing the question of why it is that we keep trying to make computers do language things, even though they’re extremely bad at precisely doing language things, but also in doing perceptual things. Things like listening for instance, because they don’t have the same kind of perceptual coordinates as people do. They don’t recognize sounds as sounds per se. And so I had already gone so deep down that rabbit hole and it turned out that it did in fact have so much to do with this history that I was interested in, that it just sort of worked out.
Mara Mills (00:12:00) - I want to just jump in and say, this was one I didn’t realize you mentioned that Xiaochang and I had written together, but we didn’t realize till very late, I think after you even finished your PhD where our own research intersected, because I knew you as doing work on text prediction with a tiny fragment of it being around speech track, and then all of a sudden our work converged around the sound spectrograph and understanding speech in terms of filtered speech in a spectral view. I had done work on the pre-history of that machine. And you had done work after World War II, and it was just really fortuitous that we were able to like unite, especially around a machine called Audrey. The automatic digit recognizer created at Bell Labs as the first presumably arguably speech recognizer, but it was like, I couldn’t have done that work on my own.
Mara Mills (00:12:52) - So just being able to collaborate with someone who had a totally different time period and with that anchor was really, really quite an amazing experience.
James Parker: Fantastic. I want to talk about that work at some point, uh, as well, but finally, but not leastly Michelle.
Michelle Pfeifer - Yeah. My name is Michelle Pfeifer and I’m a PhD candidate at the department of Media, Culture and Communication at NYU and Mara is also on my dissertation committee. So yeah. And I’m still there. And really broadly, I would say my research is about sort of the relationship between media technology and border and migration policing and like different forms of surveillance that are associated with that. And the dissertation that I’m writing now kind of looks more particularly at sound and different forms of listening that are kind of crucial, or I say a crucial for these forms of like board and migration control.
Michelle Pfeifer (00:13:55) - And I’m mostly like focusing on what’s going on in Germany, but also Europe more broadly. I was very curious how Mara was describing the departments thinking about the media and the environment, because I think for me, it’s also, I kind of like started really thinking about what is called, like referred to as the European border regime, which really describes like different kinds of like infrastructures, laws, technologies, geographies, forms of policing that kind of create these like fortified borders of Europe. And then, I started the PhD in media studies and I became really interested in looking at the role of like media technology. So probably most kind of related to questions of machine listening is one kind of big part of the dissertation that looks at dialect recognition and how it’s used in asylum determination in Germany. And maybe, I mean, I never really described myself like that, but Xiaochang was saying that she like came to sound accidentally. I think for me, it was kind of similar.
Michelle Pfeifer (00:15:08) - Because basically I was coming from Germany, started the PhD just after this moment that was kind of referred to as the refugee crisis in 2015. And I started the PhD in 2016 and I sort of started noticing that there were like all these kind of like different like legal and administrative shifts in asylum determination that were making use of these like automated and semi-automated kind of biometric tools, forensic tools and one of them was the stylist recognition software. And at the time I didn’t really know anyone working on it. Now I know like a few people who are working on it, but I just felt it was really, I don’t know. I was very captivated by it for thinking about, I don’t know, kind of questions of personhood, like who, like, who kind of can make what kind of claims and what happens when the sort of like, if we think about the voices, there’s kind of like, I guess almost like political infrastructure, right?
Michelle Pfeifer (00:16:20) - The thing that you need and like political theory, sort of like this thing you need to like make a claim for something, what happens when that is sort of, reduced from content and then actually is only thought about in terms of phonetics or acoustics. So yeah, maybe I can like describe this technology a little bit more later, but for me, I kind of like stuck with that because I felt that it was really opening these questions about how you can kind of like make political claims how they sort of become inscribed in these technologies, but also this beginning of where I started, like, what is actually like a border and where is it? And you know, how does it exist in bodies in media, but also like in voices.
Mara Mills - Michelle you probably know that Lawrence Abu Hamdan was part of the project - the Machine Listening project early on - I think. And you probably, I mean, James can probably say more about what his contribution was, but I feel like your work is basically at the ethnographic follow-up on some of the stuff he was creating at that time.
James Parker (00:17:28) - Yeah. Lawrence, we know that Lawrence has work on the lotto regime, but that’s pre automation. We actually recorded an interview with him. I mean, and this may be a way into thinking about Machine Listening, but we had, we recorded an interview with him and it went all over the place. It was very sort of wide ranging and stuff. That’s why it’s not up online yet. But you know, he said, well, I was working on accent recognition in relation to refugees or before it was being automated. And to me, the interesting question is not so much the automation, it’s the concept of a voice being a passport. Right. And so that sort of became a kind of a way when we were talking with Lawrence to thinking about, well, what, what is the problem with Machine Listening what is machinic about the listening?
James Parker (00:18:16) - Is it the fact that the machine is involved or the kind of the, the sort of desire to apprehend the voice or, or sound in this kind of very, you know, quantified or sort of, I mean, I don’t, I don’t know exactly how to describe it, but it became a question of what, what is, what, what do we mean by Machine Listening what’s the problem with Machine Listening and I think that’s a really good question for everybody, because, you know, we we’ve been trying to formulate Machine Listening as a kind of a political problem to say, you know, Machine Listening, it seems to be a language that some people use. It seems to be a language that some scientists use, particularly. Composers, computer musicians have used this language of Machine Listening, but it’s not widespread. People don’t already know what Machine Listening is necessarily, you know, it sounds a little bit like machine learning and the scientists also talk about computer audition and, you know, auditory scene analysis and, you know, the machine in some ways is a kind of a very old figure that sort of doesn’t really suggest AI or machine learning in some ways, you know?
James Parker (00:19:27) - So, what, what are the histories of Machine Listening what do we understand by it? Do you recognize your work as being about Machine Listening in some way, perhaps that’d be a good opening to everybody, you know, what, if anything, does your work have to do with Machine Listening or like Lawrence, do you sort of want to reject that that’s really the problem and that we should be thinking about it, you know, in other kinds of ways.
Mara Mills (00:19:56) - I feel like Xiaochang is the one who is working the most directly on the internalist definition of Machine Listening. I think we all can like expand that category into many broader realms in fact, possibly even to like legal protocols as Michelle was mentioning, because those protocols can be almost algorithmic making every bit as inflexible as machine learning, but Xiaochang, you actually, I mean, and you have a computer science background, so I feel like you have a sense of like the internalist definition as well as, and you mentioned the phrase even just now, am I putting you on the spot?
Mara Mills (00:20:35) - Machine Listening it conjures a few things for me. I don’t know that this is a term that appears historically very often, right? It’s certainly a term that’s in circulation now. And I think what is happening is that it animates a very particular fantasy about how sort of sound and listening and automation come together, where on one hand, Machine Listening really taps into what is the kind of machine learning, big data fantasy of understanding and knowledge at a scale that is beyond the horizon of human scrutiny and this sort of promise of what that would entail. But at the same time, like comes into this kind of queasy combination and alongside the sort of popular fears about surveillance and the sort of pervasive surveillance that is necessitated by the very promise of kind of machine learning and big data, right? That you need to constantly produce more data that sort of Machine Listening both captures the sort of dream of the promise of what that has to offer, and also the fear of the voracious appetite for data that that requires. And I think that for me is maybe why it has such a clear circulation right now, though I think it also in my own work taps into something that I think doesn’t get articulated that often when we think about machine learning and the sorts of technologies around sound that we’re thinking about today, especially in relation to voice, which is that Machine Listening implies something that is happening in terms of apprehension and interpretation on the end of the machine, as opposed to something like machine hearing. And so, and so central to what we’re looking at now is about the development of kind of what sort of historically was called like new perceptual coordinates, right? We’re not simply digitizing the voice, right? We’re not simply sort of reformatting it in computational form; we are actually creating a kind of new perceptual calculus, right. That we’re creating computationally differentiated categories for how we think about sound and the voice in ways that aren’t actually mappable onto what we think as human categories. And I think that is kind of what Machine Listening conjures up for me.
James Parker (00:23:05) - So Machine Listening is a new kind of listening not the machinic version attempt to emulate listening?
Mara Mills (00:23:15) - Right. Yeah. So that, that what Machine Listening is something different from like simply putting sound through a machine right. Or simply reformatting sound in machine processable form that in doing so you have to do something to what it is that we think listening is.
Mara Mills (00:23:34) - Yeah. I mean, your work touching, especially the post-war post-World War II histories you tell, makes me realize that my pre-World War II histories are often more about machine hearing than machine listening or about replicating in some sort of machinic form how human hearing works, but without the processing part, without the perception part, without the learning part, simply because it wasn’t technically possible, even if some of the components or tools like the idea of the sound spectrograph are still informing machine listening today. My work has been more about machine hearing. So I’ve worked on things like the roots of the cochlear implant, attempts to like recreate human hearing in a machine, or the sound spectrograph create getting a machine that captures what here captures a sound wave as it’s traveling through air. So not the actual reception moment, but the transmission moment, and visualizes what that sound would look like rather than what it would sound like. When I was thinking about Machine Listening though, you know, as someone who trained a long time ago in biology and was in the sciences for a while, I remember all of my intro physics classes always started with the question of what is a machine and they start with like the lever or the inclined plane. To be the historian of science for a moment, there is a way in which if we take the word machine to include any even simple machines, is something that’s mechanical model not electronic that just is able to like modulator or change motion or force then even some of my early work on like your trumpets, which are, you know, channeling sound waves, channeling, the motion of air in that sense could be, you know, considered to be something related to, or in the neighborhood of machine listening, you know, it’s more listening and hearing through machines. I think when I’ve written about sort of mechanical aids or amplifiers or tools related to sound and the 19th century, I usually don’t use the word machine for them, but there’s no reason why one couldn’t, if one’s just taking the basic definition of a simple machine. But with Xiaochang’s definition, I often personally work on the history of machine hearing and, or on hearing, hearing and listening through machines. So the mechanization of hearing, rather than its automation, that mechanization step that comes before automation.
James Parker (00:26:10) - One of the moves you make Mara is often that, to point to the fact that the reason for turning to the machine is often to supplement or delegate hearing in the context of assistance. Right. You know, so that, you know, that the deaf need a Machine to hear for them. And then the machine is coming in as a kind of a supplement to facilitate a deficit. And you see that in your, your work and you critique that, and you talk about how that story is part of the history of sound technologies, per se. And it’s not a minor history or a marginal history, but we see that story a lot even today. We were speaking with somebody from Google recently, you know, saying that, that a lot of the ASR and audio assistive technologies that he’s been working on, they first sort of first get in because of the assistive function. But he was a bit suspicious. I think that maybe that’s sort of a little bit of marketing thing. So I’m really interested about the story, how to think about that history, that proceeds, you know, AI about, or machine learning, but that’s about the relationship between assistive or delegated listening, or that naturalizes a certain kind of privileged form of listening, hearing, speaking and so on.
Mara Mills (00:27:41) - Yeah, I mean, media theorists and especially media historians love this narrative of media as prosthesis, whether you’re talking about McLuhan or Kittler or Paul Virilio - the sense that like some human or maybe even all human beings have these deficits, which require some kind of supplementation as you put it. Actually, if you look at the historical record, you know, first off in the case of hearing, since that’s what we’re talking about, hearing loss is so common, everyone loses hearing as they age. It’s so prevalent and it’s such an internally diverse category that tons and tons of inventors, whether you were talking about someone like Edison or Fleming, who was one of the developers of the vacuum tube were hard of hearing or deaf either from childhood or in later age. And it’s not necessarily that their deafness caused them to be inventive. It surely it had some influence on their work, but I just think it’s important to remember that this is like such a common phenomenon, that it appears all throughout history, and sometimes it’s correlated, but it’s not like causal of, of some of these inventions.
Mara Mills (00:28:52) - And, the other thing I’ve found is, you know, deaf people themselves have invented tons of, I mean, again, Edison being deaf is a great example. I have done a lot of the invention themselves. They haven’t necessarily needed someone to come create a machine to integrate them into oral culture. And a lot of deaf people reject that anyway and, and identify as linguistic minorities. So there is a rhetoric, and I think in the United States, it can be linked to things like the NSF and the NIH, big granting organizations requiring someone to show broader impacts when they write a grant proposal. And a really easy way to do that is to talk about rehabilitating a disability or other kind of impairment or illness. In fact, that tends not to be where the market is and where the money comes from. So a lot of inventors will start by getting funding and by talking about doing work on behalf of say, deaf people, but then when they market their product, they often end up leaving disabled people completely behind, and they market for a broader audience and their tools aren’t necessarily accessible. So I write about that phenomenon as what I call the pretext.
Jessica Feldman (00:30:00) - And I think if you’re McLuhan or Kittler looking through the record, it’s like, ‘Oh look, they invented this for someone deaf.’ And then as you peel back the layers of the history, they actually didn’t and they certainly didn’t and the tool often wasn’t even marketed that way. Historians always look at things in almost an annoyingly contrarian and fine-grained way, and that I could go on and on with other examples, but I’ll pause myself there because we have other, I mean, I’m, you know, we have two other people who aren’t historians and I think who are getting at the broader implications, the social implications beyond the technology in other ways.
James Parker (00:30:39) - That’s a very generous segue. Mara, I’d love to talk with you about the cybernetic, the work, the work you’ve done, the relation, the relates to cybernetics and information theory. But, maybe I could just off to ask Jessica a question, because, Mara, you talked about being fine-grained and historians being really fine-grained, but Jessica I was reading, your paper on affective and emotion detection in relation to the voice. And I was just really struck when I was reading your paper about how fine grain the analysis was. You weren’t talking about Machine Listening right. You were talking about this specific technology that this specific company is developing in this very specific way. It has a very specific imagination of, I can’t remember what you called it, the human emotional structure or something like this, and you were tracking really, and this, again, in this really fine grained way, like these very different technologies that might appear in the marketing as simply, you know, AI-led or, you know, similarly machinic but actually working very differently and then follow through how they, you know, they come to market and their effects, you know, and so on, but I was just really struck by your text methodologically, actually. And I wondered if you have any, do you think of your work as having to do with Machine Listening and you have an idea of Machine Listening in the background or is it, or is it about watching always the technology in its specificity?
Jessica Feldman (00:32:05) - Hmm, I think when I was doing that work, well, I didn’t know what I was going to find and, you know, I, and there were only a handful of companies working on this at the time. So I was able to do a really close patent analysis of like, you know, five different companies that were kind of pioneering this vocal emotion detection. I think it matters actually a lot, these rubrics of emotional and psychological constitution that are deeply embedded in these technologies, even though once they come to market, they’re sort of being marketed all for the same thing. I think what matters here is how they’re imagining the human soul and they are doing it somewhat differently. I don’t think there’s a right way to do it necessarily. I’m quite concerned by all of these tools. But in the research, I was really curious to see like these different paradigms and how they developed as they got closer and closer to the market,
James Parker (00:33:17) - Could you maybe give an example of some of the technologies and in their specificity?
Jessica Feldman (00:33:23) - Sure, sure. Well, so the first one that I looked at was, is developed by this, uh, Israeli company called them a Cisco that really focuses on a security technology. And their work is really coming out of like a history of lie detection technology, which has been proven to be inaccurate and illegal in many places. So, but they’re using that technology basically. And what they’re looking for is like, kind of like micro, micro tremors in the voice that they read as negative of a sort of discomfort that could mean that you are not telling the truth, or that you’re a security risk. So the switch was in the rhetoric of it being factual to it being a risk. I think this is a rhetorical change. This isn’t a design change because they couldn’t actually claim that they were able to determine whether someone was lying or not. They could just determine whether it was probable that someone was lying or they could claim to determine that it was probable, that someone was lying. So, I mean, it’s a bit, selling snake oil, I think. But it’s important because it’s being adopted so widely now. So they’re just looking for these tiny tremors in the voice that operate kind of at the level of I forget the name of the muscles, but you know, the muscles that we can not control consciouslyt.
Jessica Feldman (00:34:58) - So that’s, that’s one paradigm and they’re really just lining this up with like stress and lining up stress with truth. And then there are some other companies that have these like entirely different mappings of the human soul, basically. And information processing systems that they imagine as the human soul and by analyzing your voice, they can sort of claim to tell what your mood is or what your feelings are in that moment. And then there are others that really make a very like pure affective science claim that the voice is expressing something that is universal and uncontrollable and pre conscious and pre linguistic. And if you can just like pull that out, then you will know something about what the person is reacting to. So this is like less a model of ‘this is how the soul is structured’ and more a model of like ‘there is some kind of universal pre linguistic communication that’s happening here that we don’t even like need a human to get at.’ I would say those are the three main models that emerged from the research and they understand the human in different ways. And I think that matters because it’s being used on us.
Mara Mills (00:36:35) - Jessica, I wanted to ask you because I know you mentioned that one of these companies focused on surveillance, which is very similar to the work Michelle is doing, and I spent so much time on. And so did Xiaochang on speech recognition that I wasn’t thinking very much about the para-linguistic – about questions of accent or affect and the other aspects of voice, the things that count as voice. I was focusing on the things that count as speech, but what terrified me when I read your piece was how pervasive, as you mentioned, these tools are it’s, it’s not just order control. It’s like so many different automated systems for calling a call center to ask for help. And I feel like now I always try to like, after reading your dissertation and this article try to like modulate my voice in certain ways in the hopes that someone will actually not think I’m as angry as I am as a customer, but I’m just curious if you, do you feel that like the surveillance and policing function is where the initial money came from or is it really consumer culture or is it just, it’s just so it’s just that once this technology became available, it just instantly flooded into all of these different markets.
Jessica Feldman (00:37:41) - Hmm. Well, the first thing I want to tell you is that if you want to escalate your claim, you should be more angry and you will more quickly get to a human. That’s one of the most useful thing that I’ve learned from my research. I think the money came from security. Yeah. At least as far as I can tell from like the early early stuff. There’s also an element of rhetoric, this sort of like, uh, what do you call it? The assistive pretext, there’s a lot of that language too about helping autistic people. In any sort of emotion recognition, you start to get the disability language for autistic people, but it doesn’t ever materialize in a tool that actually is useful to them, as you said.
Jessica Feldman (00:38:43) - I think what I’m seeing so far is like, it sort of, the money sort of usually comes from surveillance. And then the deployments are first sort of in games and, children, you know, children’s tools like a little robot that you can talk to and some fun app you can play with on your phone and that’s how they build their training data set. And then it sort of scales up into consumer products like marketing and the call center and the neuro marketing sort of stuff. And, then we’ll see where it goes from there.
Xiaochang Li (00:39:20 - The stuff that you were bringing up reminded me of the kind of deep naturalism that is present in a lot of the speech technology researchers, right? And there’s, there’s a long sort of his sort of legacy of them. And many of these like early speech recognition technologies, too. There’s this fantasy that what this would produce is a more natural form of writing that proceeded the current form of writing. And it comes from the fact that a lot of these engineers were referencing material from like these kinds of amateur scientists that had fashioned themselves as naturalists, because they were just like barons. And that’s like a thing you do and so on. And so they, they carried forward all these ideas about how sort of the sort of vocalization of speech is like a predecessor of writing, of course, but it’s also the sort of next level after the gesture, because you have to speak out when your hands are full or, I mean, these kinds of very like, to us, like very silly assumptions that were deeply embedded in sort of these naturalists ideas about sort of primitivism as well. In the kind of moment of colonialism and very much syncs up with, you know, and I’m thinking of Fatimah Tobing Rony’s work on the third eye of cinema, and the early anthropological film and the capture of motion as imagined to be the unfashioned right evidence, right. Evidence that the person couldn’t themselves consciously manipulate the same kind of fantasy of your systems catch the micro tremors of the vocal apparatus and therefore can’t be controlled or manipulated by the person and leads us to discover evidence about them better, like beneath their capacity to deserve control.
Mara Mills: Yeah, this idea of media industry or the tech industry wanting to like undo itself, like wanting to get it somehow enhance the state of nature and make itself invisible rather than creating all sorts of other technologies that we haven’t even thought about that might, we might use our bodies and all sorts of different ways. And not recognizing that speech is learned. Speech is a technology quote, unquote unaided speech is a technology. I know this is at the heart of a lot of Michelle’s work on accent too. And how like absurd, the idea is that you are going to find something innate about a person and unlearned and untouched based on voice or speech, but it is just bizarre what you’re saying, judging all this rhetoric in the tech industry is about the sort of evils of non-natural tech
Xiaochang Li - Writing is the artificial technology here, not the machine that now interprets language.
Jessica Feldman (00:42:05) - Yeah. And it was something that sort of came up in my research and probably it’s, uh, connects a lot to Michelle’s work is that these technologies do not at all accommodate inflected languages. So even though they’re making these claims to be universal, in fact, it wouldn’t work with Chinese. They’re very Western-centric.
Mara Mills – Cochlear implants, same thing. They totally were designed for phonetic languages, not tonal ones. That was one of the major – if you’re doing from a values and design perspective – major biases in those.
Michelle Pfeifer - I keep thinking about this for me, this question about whether or not just like, what is, what is Machine Listening, but also if it actually matters that the machine is doing the listening, because I like in sort of like the case that I looked at where there was a kind of like previous iteration of linguists were doing this work, linguists were like analyzing language and were sort of like then, you know, kind of like on the basis of that, trying to determine where someone is from, which is like really crucial, a really important indicator in an asylum claim. And then there’s this move to semi-automation, and of course the kind of assumption is the same. The assumption is still that you connect - that language is actually kind of like an indicator for citizenship, which of course it’s like not. But also that like the language doesn’t change or it’s like stable, it’s not mobile. And that also people are not moving, that people wouldn’t be immersed in like different languages or be bilingual or all these different things. And especially, I think when it comes to migration, usually like a migrant biography doesn’t really look like that.
Actually, if we think about like, kind of like how European borders work it’s like intentionally made so that people cannot move from like one point to another point that they get stuck at many different places. So of course, like that assumption is the same, whether linguist is doing this work or the machine. But then I think there’s like some things that I am trying to take seriously. I guess one thing is like how the machine can conceal what it’s doing more easily. Not to use an overused metaphor, like the black box metaphor, but I think it’s really the case that it’s really hard to figure out what actually happens in that determination of the dialect someone speaks. And what is the impact of that on the outcome of an asylum claim. There’s actually that sort of back and forth between human and machine or there is supposed to be. The German government has always emphasized that the machine is not making decisions. It’s only supposed to supposed to assist a decision-making, but it’s also supposed to be a solution to this human error or failure or something that the human is not capable of doing. If we think about it in the context of the kind of like refugee crisis, it’s also important that they wanted something that they could scale up.
They wanted to like solve all these logistical problems that there were so many people that they couldn’t process all the asylum claims. I mean, every person, every sort of like state representative I talked to, they were always just talking about how they needed to make something that they could scale up that could do the work much more quickly. And the linguists who are doing this, they were, I don’t know, nobody’s sort of agreeing on like, what was the right way to do it, or whether you should be doing it at all. And also of course the money – who is investing in the development of this kind of stuff? Even though the assumption understanding the voice has the passport is still there and operative, I do think that there’s something about how the violence is maybe easier to conceal or like smooth over in the machine. And then this idea of something that is supposed to be more objective and I think that’s really powerful. That’s also something that needs to be confronted by just actually making visible the sort of effects.
Mara Mills - What is the tool right now that is the main one being used by German border control to identify people’s accents and are more people being turned away? I guess my question is, are more people being turned away and denied asylum claims because of the machine or is it actually just incidental? Is it just, is it the same as it would have been with linguists and it’s just adding this air, as you say, of objectivity.
Michelle Pfeifer - It’s kind of hard to answer that question because so many things happened that made it so that it’s much harder to actually make it to Germany. So there’s like less people who are actually applying for asylum, which is a more general kind of development of externalizing border policing. And, the other thing is also that there was this backlog of asylum claims being processed. So I don’t know. It’s just now this moment that some of the cases make it to the courts, because there was like a backlog there too.
I wouldn’t be able to say that more people are not getting asylum because of that. But there’s, I think, more subtle ways in which people describe how when they’re doing their asylum interviews, that ways of questioning are referring to these reports. So then we can also think about your sort of like influence of that kind of like objective aura influences the people who are asking the questions. I think that this kind of stuff is happening, but it’s hard to quantify – to see it in numbers.
James Parker (00:49:19) - It’s very helpful to think about the way in which the technical systems are embedded and related to border regimes and economic systems. Jessica in your work, you’re very clear about the way in which the coming to market is a really important point that shifts things and it’s not as if there’s something called Machine Listening that is separate from all of these things. And actually I was wondering – because it seems like what you’re saying Michelle is that we need to un- black box and to teach the technologists (or the border agents or the government) a lesson about voice – like sound studies can do political work here. It’s not just sound studies, but you know that their theory of the voice is wrong. And also theory of migration and refugees, you know, it’s profoundly wrong. And, Jessica, and your work, I was struck by what you said about snake oil and I was reading one bit about Beyond Verbal, one of the companies you discuss. You say that their 2011 patent includes rubrics relating to vocal pitch. For example, the pitch of C is associated with the need for activity and survival, whereas E is associated with self-control and B with command and leadership. Now that’s an example where like, there’s no black box because they said it in the patent and it is bananas! Right. Apart from the fact that it’s based on a theory of Western harmony, which is like very recent and specific… I mean, it’s just bananas. So part of the politics there is to say a lesson from sound studies or music theory in saying this is just wrong – so profoundly wrong. Whereas in Xiaochang’s work, and perhaps Mara’s is too (although it depends a bit which part of your work we are talking about) , the politics are a bit different. In general, Xiaochang, you’re, you’re talking kind of about the emergence of a new, I don’t know if it’s an episteme exactly but you talk about the statistical turn, like the emergence of statistical listening and the way in which this automatic speech recognition is a precursor to our being embedded in, or kind of leads to, data knowledge, data power, or something more generally. So it strikes me that there are different ways that politics is happening in each of the different projects going on here. And I just sort of wanted to maybe draw that out and maybe also invite Xiaochang to say something about the politics of your work, whether where they are in your work because it seems like there are really significant political stakes. But it’s not of the same kind that Michelle and Jessica are dealing with, not to say that I’ve like mapped the entire politics of your work.
Xiaochang Li (00:52:34) - If I were to kind of draw a through line in terms of the things that we’re talking about that starts to edge into the realm of the political is that if we’re thinking about something like Machine Listening, we all seem to be working with technologies that are trying to map some kind of measurement meaning relation, right? In terms of how it’s separate from the sort of earlier voice technologies that are registering and quantifying the voice for different kinds of visual and mathematical scrutiny, what starts to happen with speech recognition, what we see with the emotion detection, and what we see with the accent detection is that they’re trying to make the machine do the work of linking the measurements, the acoustic measurements to meaningful categories of things. Be those phonetic categories of language, dialect, ‘C equals some kind of emotion’. Those are the sites where the politics really start to enter in a lot of ways. I think in some of the work it’s much more explicit. I think in mine, it is kind of hovering either hovering in the background or sort of as an undertow, whichever metaphor you want to go with, but I’m still actually trying to work out what the sort of more explicit direct link is. A part of it is that it’s, it’s part of the drive that makes data into a commercial commodity, right? It helps set up the imperative for constantly gathering data because it as sort of speech recognition and natural language processing grow, all of a sudden bladder becomes very relevant to sort of things like search engines, things like advertising, right? There’s things like surveillance that in new ways is really part of this move from a kind of scientific resource into an industrial commodity. And it starts to set up that kind of data arms race that you see with all the big tech companies, which are essentially all data brokering companies in one way or another. So it’s tied in there, but it’s also tied into the development of models or the popularization of models I should say that cannot be assessed except in how well they predict things. It’s the shift in using statistics as a kind of descriptive quantification tool to one that is fundamentally about predicting outcomes. And so you both set up this lens in which the outcomes are used to deal with population and still with different kinds of groups – people like refugees, for instance – are made statistical in new ways that are not only descriptive, but fundamentally about predicting what they will do and assessing them in terms of risk. Then you get locked into those kinds of models because that’s fundamentally all the mathematical model permits you to do. And, because it doesn’t model anything in particular, you can’t open that black box and say, ‘Oh, okay, it categorizes the voice this way, and this determines X and this determines.’ Why it’s a black box is not because it’s like enclosed and hidden from view, it’s a black box because there’s nothing in it. It’s just trying to discern patterns between inputs and outputs and so it only can be assessed in terms of how well it predicts, which also means that it can only predict things that have happened in before. So it it’s fundamentally conservative in that way and also ushers in a regime in which we’re always assessing things in terms of risk. And so I think the politics really start to come together around those things.
James Parker (00:56:32) - This is why your doctoral thesis was about text prediction, right? So ASR becomes part of a story about the emergence of automated prediction as a kind of political force or something in its own. Am I reading too much into that?
Mara Mills (00:56:52) - No, I think part of what I was trying to understand was what this kind of epistemic regime of prediction was. Part of what I’m trying to figure out now, and haven’t quite locked in on is how it is that a set of epistemic priorities gets plugged into kind of systems of power and things like that, which is I think where a lot of Michelle’s work and Jessica’s work really hones in. I’m trying to figure out where those two pieces lock together.
Jessica Feldman (00:57:23) - One thing that I was thinking as you were speaking, Xiaochang, is about, because in my work it’s like you totally see everything that you’re talking about. Like you start to see it happening in the rhetoric and in like the only way, the only epistemology that is acceptable for this type of technology is that of risk and prediction. You elucidate that history through your research and then I was thinking, well, risk and prediction of what? Maybe one of the things that we can work on articulating and critiquing are the categories into which we can be put by these technologies. And maybe that’s a place of political intervention, and that’s kind of why I’m interested in this middle layer where I’m like, okay, well, what rubrics of the soul even exists that we are being predicted into? Maybe that’s like a kind of an interesting and important question, like what categories, what futures are being offered to us through these forms of machine prediction.
Xiaochang Li (00:58:35) – Yeah. I think there’s something really interesting about the production of machine categories, right? The rhetoric is often something like because the machine doesn’t have a conceptualization of humans as individuals – as meaningful units to produce categories around – it’s therefore more objective because it doesn’t care and doesn’t know what a person is. So that makes it more suitable to produce categories of persons, which is a very strange line of reasoning. Right?
Jessica Feldman (00:59:11) - Yeah. I mean, I think that’s the efficiency part? And that’s where we, we save time and money, instead of the psychoanalytic approach where you look at an individual’s history, you can do this as affective listening approach,
Xiaochang Li (00:59:27) – And the scale and logistics part that Michelle brought up. Something about this really is fundamentally about the ability to scale that is not so much about the relationship of the human to the machine as we sometimes thought of it, but perhaps the relationship between the machine and the infrastructure that it’s supposed to court.
James Parker (00:59:48) - I mean, this reminds me a little bit, Jessica, in your argument, you talk about this diversity at the level of design and technology. So, the different ways in which all of these different affect recognition technologies conceive the sole or are made to conceive the soul and so on, and then you said the moment they hit market, they all start predicting things, but you would expect them to, and there’s a kind of a convergence. So like, yeah, the tail is wagging the dog. Just the market imperatives really take over. And I was struck, you know, that’s so similar to what Michelle was saying about efficiency. And then I was looking at Mara’s work on the hearing glove and how you talk Mara about the importance of telephony and early information theory, and things in the emergence of what you call an industrial conception of language. So it seems like the market imperatives, just like rush in and take over pretty quickly in all of the stories that we’re telling and, and in Mara’s story, you know, there’s a, pre-history there too…
Mara Mills (01:01:08) - And the industrial conception, it’s commodification along with technication. You can’t forget the commodity piece. It’s always there. I mean, the telephone system in the United States at least was a legally sanctioned monopoly. It wasn’t like the post office. It wasn’t a service the way it might’ve been in other places. So that’s what I meant by the industrial conception, there’s always that commodification and market piece there too.
James Parker (01:01:33) - So that’s a crucial piece of the history for you, the telepathy as part of the history of Machine Listening?
Mara Mills (01:01:39) - I mean, that’s definitely part of that story that I was telling, I mean, the telling the history of telephany in a nutshell is quite hard because the phone system is bigger than just the phone. And, it’s the beginning of electro acoustics and they’re byproducts of this, of research in the American telephone system AT&T includes sound film and loudspeakers and not just cellular cellular networks. It’s so massive. It’s like saying it’s like trying to put, if someone in the next century trying to put Google into a nutshell or Amazon, it’s almost impossible to do that. I do think that in my research on the history of things that were happening before 1948, cause I always want to look what many people talk about 1948 as this miracle year of, you know, the transistor and cybernetics and Claude Shannon that gave birth to the so-called information age. And I always want to look at what was happening before and in my work, like Xiaochang’s, I do think some of the bigger picture questions about, as we were mentioning before, phonetic biases, or say that, you know, the telephonic bias towards speech over music actually ended up in terms of the signal processing that was coming out of like Bell Labs or like vocoding that got built into the earliest cochlear implants and many cochlear implant users complained that they couldn’t listen to music that they could, and they couldn’t hear tonal languages. So I do think some of my work it’s useful for people doing work on technology today because you can see where those biases have crept in that, as Michelle said, have been black boxed or completely invisibilized.
Even if the technology looks different, going through the patent chain, you often find that it has older roots than, than you would think. But then there’s also just parallels to the sort of thing that Michelle was describing that maybe don’t even exist anymore. There’s parallels like the idea of a technological fix is a long-held dream across many different industries, including education, including governance. And it never turns out to be true because people technology doesn’t act on its own, even automated technology. People have to learn how to use cochlear implants, their finance money training education goes into all the employment of these technologies, just, you know, in the 1960s and seventies, just as in Michelle’s case legislation, police officers, actual laws, actual bodies, migrating, all of these things have to take place. The tool doesn’t just happen and on its own. So we see some of the same stories playing out, across our different domains. And then I think in, and even then they’re just in parallel. And then in some cases, Xiaochang and I are providing sort of bedrock stories for components or ideas that are still like proliferating in technologies today, it’s more genealogical than parallel.
James Parker (01:04:44) - I wonder if I could spin off there to a different, slightly different pre-history. Cause you know, Xioachang your work is about automatic speech recognition, so sort of content on some level. And then we’ve talked about kind of form or the grain of the voice with Michelle and Jessica’s work, but Mara and Xiaochang your collaboration on the voiceprint sort of takes us in a slightly different direction. It’s it’s sort of related to Michelle story about the voice as possible, but voice as identity, the emergence of biometrics as a kind of a big field in relation to Machine Listening. I’d love to hear a little bit more about that project, cause that’s an historical project again, but I read that and all I can see is all of these crazy companies that exist now that claim to be doing voice biometrics and you hinted at it. You’re telling a much older story. So I just wondered if either or both of you would be interested in saying a little bit about that work and the history of the voiceprint and if it has any lessons for us, not that history only has the lessons.
Mara Mills (01:05:59) - This is an example where people should have spent some time looking at the history because, we saw across many decades over a century, in fact, and in many different national contexts and in different industries. I mean, again, whether it’s government or policing or private industry, this fantasy, that one could create a voiceprint that a human being could be uniquely identified by their voice and their voice alone. Even though, even in like the 1920s speech, scientists and engineers were, were realizing as they worked on these tools, like with the early oscillographs, even the early that, you know, people’s voices change with emotion, that’s one of Jessica’s work. That’s one of the things that supports the voiceprint and people’s voices change with age. In our study, I’m sure people are working on voice prints today, but what we found and what we argued was that the impulse toward voice printing shifted to an impulse towards speech recognition using the exact same tools instead, which is the idea that you can identify a word spoken by lots of different people.
Mara Mills (01:07:09) - It’s also a huge technical challenge, but it’s a quite different one than having someone uniquely identified by their voice as if it were a finger unchanging fingerprint, which it isn’t. Yeah. The question of speech printing is something that’s supposedly has to work across accent. Now, one thing that we didn’t raise was this sort of in-between category that Michelle is describing where people as a group of people can be identified, a population gets identified based on accent. It’s a different kind of identification puzzle. It’s not recognition, it’s identification. And then if your accent doesn’t seem quite right from the place you’re saying you’re from then you don’t get to immigrate. You don’t get asylum. We didn’t come across anything quite like that in our research Xiaochang.
Xiaochang Li - I don’t think we did, but I think we start to see hints of that, right? Because, as you pointed out, what’s really strange about what happens to voice printing as you hit the sound spectrograph is that they’re using the same instrument in the same measurements to make completely sort of opposite arguments. One that the voice is so unique that you can identify a single person, whatever they’re saying, however they’re saying it. And then one that the voices, that speech is sufficiently common, that the acoustics of speech is sufficiently common, that you can identify specific words, despite who’s saying it right. And so these don’t seem like arguments that should fit together. And yet they’re being produced using the same instrument and same measurements. And part of how that comes to work is the way in which they end up sort of mathematically compositing measurements across different voices, across different sort of localized sounds, things like that. And I think that’s where you start to kind of see the beginnings of that same kind of population thinking as the kind of statistical manipulation start to come in. Right. And that’s where you start to see like, ‘Oh, well, if we can kind of map a pattern of behavior, there is then a tendency of a kind of voice, even if we cannot sort of specify a particular word or a particular person.’ And I think too, that there’s another story in there that needs to be explored more, which is that this opens up a world of biometrics in a slightly different way than we’ve seen talkedabout because it’s, Kersta who Mara and I wrote about in this story. He goes on to sort of found his own company doing voice prints. But one of the other things that the company was trying to do was to create all kinds of other kinds of biometric identifications based on sound. So like listening to heartbeats and like blood movements and other sounds produced, or not even sounds, like other acoustic measurements produced in the body that were not actually like perceptible or like things that a person could hear that would serve as biometric identifiers.
Mara Mills - And he was doing stuff with Birdsong. So of course, biometric doesn’t have to be the human there’s lots of other biotic life forms out there. And doing like Birdsong prints and some of that work still exists at like the Cornell school of ornithology. It’s not tended to be interpreted by critics in as like, anxious of terms.
James Parker (01:10:30) - That’s an interesting way of putting it. Do you think it should be because, because basically, you know, so you were talking about the pretext, the assistive pretext, every time we come across, we have this conversation, people will say, ‘well, look – there is ecological uses and so, it can’t be all bad.’ We haven’t really got into the critique or the possibility of critiquing the ornithologist.
Mara Mills (01:11:00) - Well, I think some people in sound studies have looked at the misapplication of some of these sound tools, for estimating populations of wildlife in the oceans, for instance. And if you misestimate, then you are saying something’s endangered or not endangered. So like tracking whale song and trying to like predict whale populations and whale migration, maybe it’s just human narcissism. And anthropocentrism that we’re not quite as anxious about that as we are about the human applications, but I’m sure – this isn’t something I do primary research on – but I have read work by my colleague, Alex Hui on some of the underwater listening tools, which are also militarized and also prone to bias and error.
James Parker (01:11:45) - There was just one other thread I wanted to pick up from your article, on the voiceprint stuff. You talk about vocal criminology and so it just strikes me that there’s a threat, you know, we talked about commercial imperatives driving research, but to tie a thread into Michelle’s work again about the concept of the criminal or the risky subject. Yeah, I’m not sure if it’s part of a story or a long history of the emergence of criminal law or criminology or, but it just, yeah, it just struck me that there’s something going on there, that the idea that the voiceprint could be somehow at the birth of the concept of the criminal or arrive the same time as the idea of the criminal.
Mara Mills (01:12:36) - We dropped that thread because it was sort of physiognomic, it was right. There was an early moment in like Dogan and other phoneticians working in Berlin who were, were creating these archives of phonograph recordings. And at first they did hope that they could identify a criminal type of voice. So that’s different, that’s a sort of typological thinking that’s like physiognomy or something. You know, other people were interested in, you know, racialized types or gender types or all sorts of pathological types at that moment. And they weren’t based on like statistical data sets the way we describe in our article later, population-based thinking around the voice existing. And they also weren’t one-on-one so yeah, there were a lot of attempts to identify what a criminal voice was like. And we basically, in our, in our study, we’re like, Oh, and then people stopped doing that because it’s proven to not be that scientific. And instead the mode is towards creating big data sets and having, you know, statistical more statistical thinking, but actually hearing some of the things that Jessica is describing. It sounds like that physiognomic kind of not that mathematical, not even based on a big dataset kind of thinking. And also, and I actually don’t know how the technology that Michelle is describing is created – and Michelle I haven’t asked. I mean, maybe that’s not what you’re researching since you’re doing ethnographic work, you’re looking at how it’s applied, but I really don’t understand how they’re coming up with this idea of an accent, being particular to a particular place. That to me seems sort of physiognomic as well. I don’t know if they’re asserting that certain kinds of accents are more, that certain people with certain accents are more prone to be criminals. I mean, that’s certainly like of 1920s kind of logic. I’m hoping it’s gone now.
Xiaochang Li - Yeah. I mean, just to expand on that a little bit too, I think part of it is that our article kind of stops at a certain moment that that kind of like typological thinking does come back in big data and machine learning a little bit later, and it comes back in a slightly different form, which is that rather than finding these kinds of direct it’s, it’s not that they know. So like the early sort of physiognomic kind of thinking really thinks about the body as the site, in which you can identify these things. And it seems to sort of foreclose the possibility of the social factors and the kind of machine learning imagination around creating these types of ologies is that actually machine learning is very good at this because all of those sort of bass complex social factors are then already embedded in the measurement itself.
And because we can’t untangle it as humans, we can actually just ask the machine to sort the patterns for us if we get enough examples of this. Right? And so it gets to this black boxing thing that I think is actually more important than the like blacking like mode of the black box in terms of the obfuscation and more about the boxing of all the factors together in such a way that we can no longer disentangle them, such that as the models then get ported to conditions that have very different factors involved, you’re now sort of mapping things that don’t actually fit together. But now you can’t disentangle, which ones are sort of based on like institutional factors and social factors and geographic things and all of that stuff.
Mara Mills - Well, and the physiognomic is based on human perceptions. As we were talking in our article, like higher level human understandings of what a speech feature is. And it’s usually not based on a huge data set. It’s like, ‘Oh, these characteristics seem to be common to criminals’ in a more ad hoc basis. And it’s things that humans can perceive. And, and as Xiaochang mentions in her research, there’s like imperceptible to humans, lower level acoustic elements that actually are the basis for making these identifications of speech or voice in the systems we’re talking about now. So features that wouldn’t be considered features by the human ear wouldn’t even be detected.
Michelle Pfeifer - Yeah. I was just going to say that I looked at your article again this week and Mara and Xiaochang and you quote someone and they’re kind of saying that the problem with the body, like with fingerprinting is that you can tamper with them. that you can like temper with them. Like you can burn them, but with the voice, you can’t do that. And for me, it was just so interesting because the kind of like looking at the voice to determine, like identify someone in the context of asylum. It’s only one thing that states, especially in the global north are doing. And there’s a long history of like looking at the body, doing age assessments, looking at DNA or these different kinds of things. And of course also registering people with their fingerprints. So often people try to tamper with their fingerprints because in the EU there’s like a fingerprint database. And if you get registered in one country in the EU, you have to apply for asylum there. And so that might be Greece, but maybe you actually want to go to Germany right.
Or somewhere else. And, so there’s a reason you don’t really want to get fingerprinted. And for me, it’s just made me think again about what is it about the voice that there’s this assumption that it is this more like, I don’t know, it gives you like access to like the soul. Maybe that’s how Jessica would say it, right. There’s like something that we have there’s like somewhere embedded is this idea through the voice, you actually get to the identity of someone better than maybe other parts of the body. I just wanted to mention that because I think it also kind of relates to questions about risk. And I feel like in my research it’s really like people who like migrate are always seen as kind of risky to the nation state, you know? And they’re always like under suspicion. So it’s always like this, the categories are like ‘Are you trustworthy or are you not?’ These are like the sort of like two categories that are there and why is the voice supposed to give us this kind of some ultimate verification or like truth about who you actually are
James Parker (01:19:25) - Great point. And, you know, we’re probably, we probably should wrap up soon, but I can’t help myself because everything you were saying just reminds me of that Adriana Cavarero book, For more than one voice: toward a philosophy of vocal expression, where she’s trying to kind of build a political philosophy basically on the idea of, I think she calls it the phenomenology of vocalic uniqueness. And I know that this is like extremely academic, but it seems like there’s a kind of… yeah, I was always struck in that book. I don’t know if any of you people read the book, is it familiar? Yeah. You know, she’s so committed to the, that phenomenon, what she calls a phenomenology of uniqueness that it’s really asserted. And I sort of an like, okay, well, yeah, maybe. But it seems like what you’re saying and from your research Mara and Xiaochang is that basicly, I mean, as a sort of matter of science, that’s sort of highly questionable and it’s not to say that like Adriana Cavarero is complicit with, you know, police, you know, like policing and border security forces and whatever, but that’s the kind of cutting edge. One of the most, the biggest books in the sound studies or voice studies for the last however many years. And it’s very committed to that basic idea, which I don’t know, it says something maybe about what the public and what people are going to be willing to accept in terms of the promise of voice identification that somehow the voice really is tethered to our soul. And this is philosophically defensible, but it’s sort of empirically justifiable and things. So I’m just, yeah, it’s a, it’s a very esoteric note to end on, but I just wondered if yeah, if you have any thoughts about that, like, is it that…
Mara Mills (01:21:20) - I think, I mean, I, what I loved about that book when it came out was that she was interested in voice, the grain of the voice and not just in speech, but I think, you know, voice is learned, vocalization is learned, including the vocalizations that count as speech. We know that. And you know, I guess the question I would ask is what’s at stake in arguing that, that there is vocal uniqueness. I mean, I think if we can parse vocal uniqueness from something that’s essentialist it actually might have still have political force, but there are so many communities who don’t want there to be an essentialism around one’s vocal uniqueness. I mean, I’m thinking about all the interesting activist work and, and speech therapy work and also academic work that’s happening around the trans voice right now. And people are the sound spectrograph and its digital online form to retrain their voices, to change the gender of their voice – genders learned, and it can be relearned.
And so I would say like disentangling vocal uniqueness from the essentialist piece is a really important step that needs to be taken. Across all of our stories studies that we are doing, you see that there, the voice can be made to tell a story about population, about universality, about speech, about affect, about an individual. It can almost be endlessly interpreted. But whatever, whoever the scientist or the perceiver is – using actually the same tools as I think Xiaochang stated at the beginning of our conversation. So yeah, I guess my question, I don’t know if anyone can even answer this question, but what is at stake right now and making an argument about individual vocal uniqueness as opposed to something universal across all humans or something that’s population-based, I’m not sure.
Xiaochang Li (01:23:11) - Yeah. I mean I haven’t read the book, so I can’t really speak to the arguments within the book itself, but I will say that I think, you know, if it does coincide with some of the thinking that we see from these speech technology engineers, I mean, that is not an accident in so far as the issue with a lot of these speech technologies, especially early on, is that they were neither particularly feasible nor particularly useful. And so part of what made them desirable to pursue was the already existing imagination of what the voice had to offer us. So that I think the alignment comes from the fact that the engineer’s thinking is derivative of these fantasies already. And not so much that these kinds of political imaginings around the voice are like coming out of the engineering fantasies.
Joel Stern (01:24:11) - Can I just give an anecdote because I was just thinking about how a few years ago I had to call the Australian tax office to sort of sort through my finances. And it was the first time I’d encountered vocal biometrics as a security device. So previously, you would have to give a password and sort of gives details about your date of birth and things like that. But in this instance, I encountered a recorded message, which informed me that they would make a recording of my voice. And that would, from that point on, be my passport into the sort of secure zone of discussing my taxation and the words that the tax office system asked you to say three times, over and over, repeatedly in order to take a sample of your voice and run it through the system where: ‘In Australia, my voice identifies me. In Australia, my voice identifies me.’ And it was so striking that you were not only giving them, in that moment, the data that the vocals that have data to analyze, but you were also sort of saying the words that, that indoctrinate you into the sort of system the ideological or the sort of political system that you have to believe in order for that to work. So it’s sort of interesting because in some senses, it does work, you know, and I’ve been able to continue to access my data using that system. In other ways, it’s striking that they – rather than just sort of saying, you know, ‘hello, my name is Joel’ or whatever – they want you to say ‘in Australia, my voice identifies me.’ And, and for you to believe it, as you’re saying it.
Mara Mills (01:26:03) - Although it’s verification, probably not identification and checking. And I talk about this in the article and we probably don’t have time to get into the details of which is which, but it’s true. There’s lots of speech or speaker verification systems now, but they’re not the same as the hundred percent identity that one would need in say policing. Although I think it’s a slippery slope and it’s probably convincing some people that it’s possible to get again, we’ll start all over again and get to speaker identification and voice printing.
Joel Stern (01:26:38) - No, that’s right. But I was also just thinking, and I know we’re going to, we’re going to end up, when I think Jessica, you were sort of talking about the proliferation of sort of machining classifications, and we were speaking with someone a couple of weeks ago about genre recognition in music. And, the way that, Spotify for example, have tried to tag genres to songs using Machine Listening and sort of different forms of audio analysis. And it’s so often wrong because the construction of genre is so much a cultural formation. And just because a song sounds like another song, it doesn’t mean that it’s part of the same genre because you know, the different subcultures that produced it have complex affectations and there’s irony and, you know, there’s all of those different things. So that’s just an example of where someone who listens to a lot of music and is part of a musical subculture will have so much more capacity to identify genre then a Machine Listening that is sort of an algorithm that is attempting to do it on using some statistical kind of process
Mara Mills (01:27:58) - The disability historian in me actually wants to add one little final comment on the question. I was just realizing that I wanted to also mention in terms of the question of the like atypical or unique voice, again, with my disability history hat that so many people who’ve been so many voices that are called unique are actually called that in the spirit of the atypical and it’s a pathologizing claim. And so I think I would just say with regard to your revisiting of Cavarero, even though this doesn’t specifically come up in her reading, like there is a hierarchy to vocal uniqueness. And, a lot of people who are told that they have unique voices and that’s not good. They’re told it in the spirit of it being somehow pathological or having vocal absence or not entering into speech in the appropriate way.
James Parker (01:28:53) - That’s an excellent point. It does actually remind me that of some companies that are specifically developing speech recognition tools and things for non typical voices.
James Parker (01:29:07) - But I don’t know whether they’re well used or well liked by the communities they’re supposed to be for.