machinelistening
/
curriculum

---
title: "(Against) the coming world of listening machines"
has_experiments:
  [
    "top-ten.md",
    "choose-your-own-smart-wife.md",
    "living-with-a-black-box.md",
    "diagnostic-listening.md",
  ]
sessions:
  [
    "unsound-2020-session-1.md",
  ]
---

# (Against) the coming world of listening machines

## Alexa, what is machine listening?

{{< yt id="iUbglqQLdrI" yt_start="2" modestbranding="true" color="white" >}}


"Machine listening" is one common term for a fast-growing interdisciplinary field of science and engineering which uses audio signal processing and machine learning to "make sense" of sound and speech.[^Cella, Serizel, Ellis] Machine listening is what enables you to be "understood" by Siri and Alexa, to Shazam a song, and to interact with many audio-assistive technologies if you are blind or vision impaired.[^alper] As early as the 90s, the term was already being used in computer music to describe the analytic dimension of ['interactive music systems'](https://wp.nyu.edu/robert_rowe/text/interactive-music-systems-1993/chapter5/),[^rowe] whose behavior changes in response to live musical input, though there are ![precedents](audio:static/audio/maier.mp3) even before that.[^maier_audio_1] Machine Listening was also, of course, a cornerstone of the mass surveillance programs revealed by Edward Snowden in 2013: SPIRITFIRE's "speech-to-text keyword search and paired dialogue transcription"; EViTAP's "automated news monitoring"; VoiceRT's "ingestion", according to one NSA slide, of Iraqi voice data into voiceprints. Domestically, machine listening technologies underpin the vast databases of vocal biometrics now held by many [prison providers](https://theintercept.com/2019/01/30/prison-voice-prints-databases-securus/ "Prisons Across the U.S. Are Quietly Building Databases of Incarcerated People’s Voice Prints") and, for instance, the [Australian Tax Office](https://www.computerworld.com/article/3474235/the-ato-now-holds-the-voiceprints-of-one-in-seven-australians.html "The ATO now holds the voiceprints of one in seven Australians"). And they are quickly being integrated into infrastructures of development, security and policing.

![Automatic speech recognition](audio:static/audio/kathy-reid-intro-to-ASR.mp3),[^kathy_audio_1] transcription and translation [[i](https://www.statnews.com/2020/05/22/ai-startup-transcribes-annotates-doctor-visits-for-patients/ "AI startup transcribes and annotates doctor visits for patients"), [ii](https://www.iflytek.com/en/products/#/Home "iFlyTek: Create a better world with A.I."), [iii](https://www.wired.com/story/iflytek-china-ai-giant-voice-chatting-surveillance/ "How a Chinese AI Giant Made Chatting—and Surveillance—Easy")] -
targeted key word detection [[i](https://theintercept.com/2015/05/05/nsa-speech-recognition-snowden-searchable-text/ "How the NSA Converts Spoken Words Into Searchable Text")] -
vocal biometrics and audio fingerprinting[^li and mills] [[i](https://www.nice.com/engage/real-time-technology/voice-biometrics/ "NICE leverages voice biometrics for safer and more secure customer authentication"), [ii](https://www.acrcloud.com/audio-fingerprinting/ "What Is Audio Fingerprinting?")]-
speaker identification, differentiation, enumeration and location [[i](https://theintercept.com/2018/01/19/voice-recognition-technology-nsa/ "Finding Your Voice"), [ii](https://patents.google.com/patent/US20100235169A1/en "Google Speech differentiation Patent")] -
personality and emotion recognition [[i](https://www.youtube.com/watch?v=86I3-VYIvAM "callAIser in action: Call Center agent gets desperate over angry customer")] -
accent identification [[i](https://www.theverge.com/2017/3/17/14956532/germany-refugee-voice-analysis-dialect-speech-software "Germany to use voice analysis software to help determine where refugees come from")] -
sound recognition [[i](https://reality.ai/sound-recognition/ "Let your device hear and recognize sound with Reality AI")]-
audio object recognition [[i](https://ieeexplore.ieee.org/document/7295798 "Multimodal object recognition from visual and audio sequences")] -
auditory scene analysis [[i](https://www.amazon.com/Computational-Auditory-Scene-Analysis-Applications/dp/0471741094 "Computational Auditory Scene Analysis: Principles, Algorithms, and Applications 1st Edition")] -
intelligent audio analysis[^intelligent_audio_analysis] -
audio event analysis[^virtanen] -
audio context awareness [[i](https://ieeexplore.ieee.org/document/1285814 "Audio-based context awareness acoustic modeling and perceptual evaluation")] -
music mood analysis [[i](https://www.semanticscholar.org/paper/Multi-Modal-Non-Prototypical-Music-Mood-Analysis-in-Schuller-Weninger/ed0c10ca76ea8ee17514fa569ddf9d0ac7c3a6d5 "Multi-Modal Non-Prototypical Music Mood Analysis in Continuous Space: Reliability and Performances")] -
music identification [[i](https://www.shazam.com/ "Shazam: Name any song in seconds")] -
cover song identification [[i](https://steinhardt.nyu.edu/marl/research/projects/cover-song-identification)] -
music playlist generation[^seaver] -
audio synthesis [[i](https://paperswithcode.com/task/audio-generation "Audio Generation")] -
speech synthesis [[i](https://deepmind.com/blog/article/wavenet-generative-model-raw-audio " WaveNet: A generative model for raw audio "), [ii](https://cloud.google.com/text-to-speech " Convert text into natural-sounding speech using an API powered by Google’s AI technologies."), [iii](https://www.descript.com/lyrebird "Lyrebird AI
Using artificial intelligence to enable creative expression.")] -
musical synthesis [[i](https://openai.com/blog/jukebox/ "Jukebox, a neural net that generates music")] -
adversarial music [[i](https://arxiv.org/abs/1911.00126 "Real World Audio Adversary Against Wake-word Detection System")] -
brand sonification [[i](https://www.audioanalytic.com/brand-sonification-power-recognising-sounds-brands/ "RBrand sonification: The power of recognising the sounds of brands")] -
aggression detection [[i](https://www.soundintel.com/products/overview/aggression/ "Deterring and Preventing Assault"), [ii](https://www.audeering.com/what-we-do/automotive/ "Cars take care of their passengers")] -
depression detection [[i](https://news.mit.edu/2018/neural-network-model-detect-depression-conversations-0830 "
Model can more naturally detect depression in conversations")] -
laughter detection [[i](http://www.hannahishere.com/project/the-laughing-room-with-jonny-sun/ "The Laughing Room")] -
emotion detection - [[i](https://www.theverge.com/platform/amp/2020/8/27/21402493/amazon-halo-band-health-fitness-body-scan-tone-emotion-activity-sleep?__twitter_impression=true&s=09 "Amazon announces Halo, a fitness band and app that scans your body and voice")]
intoxication detection [[i](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3872081/ "Intoxicated Speech Detection: A Fusion Framework with Speaker-Normalized Hierarchical Functionals and GMM Supervectors")] -
scream detection[^lei_mak] -
lie detection [[i](https://www.nemesysco.com/lva-technology/ "https://www.nemesysco.com/lva-technology/")] -
hoax detection [[i](https://amp.abc.net.au/article/12568084 "University of Southern Queensland gets $300k for hoax emergency call detection technology")] -
gunshot detection - [[i](https://www.shotspotter.com/ "Enhance Public Safety and SiteSecurity with ShotSpotter")]
parkinson's diagnosis [[i](http://www.canaryspeech.com/ "Using voice to identify human conditions sooner.")] -
covid diagnosis [[i](https://app.surveylex.com/surveys/5384d6d0-6499-11ea-bc3a-b32c3ca92036 "We are launching an initiative to collect your voices with a goal to be able to triage, screen and monitor COVID-19 virus.")] -
machine fault diagnosis - psychosis diagnosis [[i](https://www.sciencedaily.com/releases/2019/06/190613104552.htm "The whisper of schizophrenia: Machine learning finds 'sound' words predict psychosis")] -
bird sound identification [[i](https://voicebot.ai/2020/06/26/voice-match-is-for-the-birds-new-google-competition-seeks-avian-audio-ai/ "Voice Match is for the Birds")] -
gender identification and ethnicity detection [[i](https://theintercept.com/2018/11/15/amazon-echo-voice-recognition-accents-alexa/ "Amazon’s Accent Recognition Technology Could Tell the Government Where You’re From")] -
age determination [[i](https://www.phonexia.com/en/product/age-estimation "Phonexia Age Estimation speech technology automatically estimates the age of a speaker")] -
voice likability determination [[i](hhttps://dl.acm.org/doi/10.1145/3123266.3123338 "A Paralinguistic Approach To Speaker Diarisation: Using Age, Gender, Voice Likability and Personality Traits")] -
risk assessment [[i](https://www.clearspeed.com/ "Clearspeed: Using the Power of Voice for Good")]...
{.nosup}

These applications are all either currently in use by states, corporations and other entities around the world, or under development. The list is obviously ![not exhaustive](audio:static/audio/mattern_critique.mp3).[^mattern] Nor does it convey the real diversity of markets, cyberphysical, social and political contexts into which these applications are quickly embedding themselves:

Digital voice assistants -
voice user interfaces -
state and corporate surveillance [[i](https://paranoid.com/products "Paranoid Home. Data is forever. Get Paranoid.")] -
profiling -
border security -
home security -
pre-emptive policing -
weapons systems -
court systems [[i](https://verbit.ai/industries-legal/ "Revolutionizing Legal Transcription"), [ii](https://www.wired.com/story/star-witness-your-smart-speaker/ "Meet the Star Witness: Your Smart Speaker")] -
hospital systems -
call centre optimisation -
oral hygiene [[i](https://www.mediapost.com/publications/article/354686oral-b-ushers-alexa-into-bathroom-with-new-voice-i.html "health and beauty aids Oral B Ushers Alexa Into Bathroom With New Voice-Integrated Toothbrush")] -
disability services -
grocery store wayfinding [[i](https://edition.cnn.com/2020/08/27/business/amazon-fresh-first-grocery-store/index.html "Alexa, what aisle is the milk in?")] -
ambient elderly monitoring [[i](https://get.cherryhome.ai/care/ "Cherry Home")] -
baby monitoring [[i](https://www.washingtonpost.com/technology/2020/02/25/ai-baby-monitors/ "AI baby monitors attract anxious parents: ‘Fear is the quickest way to get people’s attention’")]-
house arrest monitoring [[i](https://www.shadowtrack.com/about_us/security/ "VOICE BIOMETRICS FOR HOUSE ARREST MONITORING")] -
![human rights monitoring](audio:static/audio/intro-to-pulse-and-radio-content-analysis.mp3)[^andre_audio_1] -
remote education and proctoring [[i](hhttps://www.freep.com/story/news/education/2020/07/28/michigan-online-bar-test-michigan/5518279002/ "Michigan’s online bar exam testers worry software tracks eye movements, noises")]-
school security [[i](https://features.propublica.org/aggression-detector/the-unproven-invasive-surveillance-technology-schools-are-using-to-monitor-students/ "In response to mass shootings, some schools and hospitals are installing microphones equipped with algorithms")] -
remote diagnostics -
biomonitoring and personalised health [[i](https://www.voiceome.org/ "The Voiceome Project")] -
social distancing -
music streaming -
music education -
composition [[i](https://disclaimer.org.au/contents/holly-herndon-and-mat-dryhurst-in-conversation-with-sean-dockray "Inhuman Intelligence")] -
gaming - [[i](https://voicebot.ai/2020/06/05/new-sony-patent-elaborates-how-the-playstation-5-voice-assistant-will-help-you-kill-zombies/ "New Sony Patent Elaborates How the PlayStation 5 Voice Assistant Will Help You Kill Zombies")]
brand development [[i](https://www.adnews.com.au/news/the-most-effective-brand-audio-logos-in-australia "The most effective brand audio logos in Australia")] -
marketing [[i](https://www.veritonic.com/ "Veritonic The Sonic Truth")] -
acoustic ecology [[i](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiUnKOJmfzrAhXm-GEKHYraD18QFjAGegQIARAB&url=https%3A%2F%2Fwww.mitpressjournals.org%2Fdoi%2Fpdf%2F10.1162%2Fisal_a_00059&usg=AOvVaw3cJs8wtpsNJTHANaG1E5e6 "Toward a Synthetic Acoustic Ecology")] -
employee performance metrics [[i](https://www.theverge.com/2018/12/21/18151738/walmart-eavesdrop-patent-customer-employee-privacy "Walmart secured a patent to eavesdrop on shoppers and employees")] -
wearables [[i](https://www.wareable.com/wearable-tech/amazon-halo-wearable-announced-8068 "If you thought Alexa was creepy, try the new Amazon Halo wearable")] -
hearables [[i](https://www.youtube.com/watch?v=vuoxprrEwbg "Sony: Ambient Sound Control"), [ii](https://www.audioanalytic.com/use-cases/hearables/ "The future is ear")] -
recruitment, banking and insurance [[i](https://www.clearspeed.com/ "Identify Risk. Reduce Fraud. Build Trust.")] -
voice banking and vocal proshetics [[i](https://vocalid.ai/individual/vocal-legacy/ "Vocal Legacy")]
{.nosup}

As with all forms of machine learning, questions of efficacy, access, privacy, bias, fairness and transparency arise with every use case. But machine listening also demands to be treated as an epistemic and political system in its own right, that [casts a shadow](https://youtu.be/iyyM4vDg0xw), that increasingly enables, shapes and constrains basic human possibilities, that is making our auditory worlds knowable in new ways, to new institutions, according to new logics, and is remaking (sonic) life in the process.

Machine listening is much more than just a new scientific discipline or vein of technical innovation then. It is also an emergent field of knowledge-power and cultural production, of data extraction and colonialism, of capital accumulation, automation and control. We must make it a field of political contestation and struggle. If there is to be a world of listening machines, we must make it emancipatory.[^exemplary projects]

## ~~Machine listening~~

Machine listening isn't just machinic.

Materially, it entails enormous exploitation of both human and planetary resources: to build, power and maintain the vast infrastructures on which it depends, along with all the microphones and algorithms which are its [most visible manifestations](https://anatomyof.ai/ "Anatomy of an AI System").[^Crawford and Joler] Even these are not so visible however. One of the many political challenges machine listening presents is its tendency to ![disappear](audio:static/audio/vladan.mp3) at point of use, even as it indelibly marks the bodies of distant workers and permanently scars ecological systems.

Scientifically, machine listening demands enormous volumes of data: exhorted, extracted and appropriated from auditory environments and cultures which, though numerous already, will never be diverse enough. This is why responding to machinic bias with a politics of inclusion can also be ![a trap](audio:static/audio/lawrence_trap.mp3).[^halcyon_audio_1] It means committing to the very system that is oppressing or occluding you: a "techno-politics of perfection."[^goldenfein]

Because machine listening is trained on (more-than) human auditory worlds, it inevitably encodes, invisibilises and reinscribes normative listenings, along with a range of more arbitrary artifacts of the datasets, statistical models and computational systems which are at once its lifeblood and fundamentally opaque.[^mcquillan] This combination means that machine listening is simultaneously an alibi or front for the proliferation and normalisation of specific auditory practices _as_ machinic, and, conversely, often irreducible to human apprehension; which is to say the worst of both worlds.

Moreover, because machine listening is so deeply bound up with logics of automation and pre-emption, it is also recursive. It feeds its listenings back into the world - ![gendered and gendering](audio:static/audio/gender-and-gendering.mp3),[^ys] colonial and colonizing, ![raced and racializing](audio:static/audio/halcyon-siri-imperialism.mp3),[^halcyon_audio_1] classed and productive of class relations - as Siri's answer or failure to answer; by alerting the police, denying your [claim for asylum](https://www.theverge.com/2017/3/17/14956532/germany-refugee-voice-analysis-dialect-speech-software), or continuing to play Autechre - and this incites an auditory response to which it listens in turn. The soundscape is increasingly cybernetic. Confronting machine listening means recognising that common-sense distinctions between human and machine simply fail to hold. We are all machine listeners now. We have been becoming machine listeners for a long time. Indeed, the becoming machinic of listening is a foundational concern for any contemporary politics of listening; not because mechanisation _itself_ is a problem, but because it is the condition in which we increasingly find ourselves.[^Abu Hamdan]

But machine listening isn't exactly listening either.

Technically, the methods of machine listening are diverse, but they bear little relationship to the biological processes of human audition or psychocultural processes of meaning making. Many are fundamentally [imagistic](https://medium.com/@krishna_84429/audio-classification-using-transfer-learning-approach-912e6f7397bb "Audio classification using transfer learning approach"), in the sense that they work by first transforming sound into spectograms. Many work by combining auditory with other forms of data and sensory inputs: machines that [listen by looking](https://www.wired.com/story/lamphone-light-bulb-vibration-spying/ "Spies Can Eavesdrop by Watching a Light Bulb's Vibrations"), or by cross-referencing audio with geolocation data. In the field of Automatic Speech Recognition, for instance, it was only when researchers at IBM moved away from attempts to simulate human listening towards statistical data processing in the 1970s that the field began making decisive steps forward.[^airplanes] Speech recognition needed to untether itself from "human sensory-motor phenomenon" in order to start recognising speech. Airplanes don't flap their wings. [^airplanes]

Even if machine listening did work by analogising human audition, the question of cognition would still remain. Insofar as "listening" implies a subjectivity, machines do not (yet) listen. But this kind of anthropocentrism simply begs the question. What is at stake with machine listening is precisely a new auditory regime: an analogue of Paul Virilio's "sightless vision", [^virilio] the possibility of a listening without hearing or comprehension, a purely correlative listening, with the human subject decentered as privileged auditor.

One way of responding to this possibility would be to simply bracket the question of listening and think in terms of "listening effects" instead, so that the question is no longer whether machines _are_ listening, but what it means to live in a world in which they act like it, and we do too.

Another response would be to say that when or if machines listen, they listen ![operationally](audio:static/audio/andrejevic-on-operationalism.mp3): not in order to understand, or even facilitate human understanding, but to perform an operation: to diagnose, to identify, to recognize, to trigger.[^Faroki, Paglen] And we could notice that as listening becomes increasingly operational sound does too. [Operational acoustics](https://www.trillbit.com/trillbit-home-page.html "Creating the Internet of sound"): sounds made by machines for machine listeners. [Adversarial acoustics](https://www.youtube.com/watch?v=r4XXGDVs0f8 "Adversarial Music Demo Video"): sounds made by machines _against_ human listeners, and vice versa.[^Billy Li]

# Resources

[^Cella, Serizel, Ellis]: ![](bib:7c769ce6-5e9e-40d3-96ef-1838a7f57365)
[^alper]: Meryl Alper, _Giving Voice: Mobile Communication, Disability_, and Inequality (MIT Press, 2017)
[^rowe]: Robert Rowe, [_Interactive music systems: Machine listening and composing._](https://wp.nyu.edu/robert_rowe/text/interactive-music-systems-1993/) Cambridge, MA: The MIT Press (1993)
[^maier_audio_1]: Stefan Maier, [_Machine Listening_](https://technosphere-magazine.hkw.de/p/Machine-Listening-kmgQVZVaQeugBaizQjmZnY), Technosphere Magazine (2018); Interview with [Stefan Maier](http://stefanmaier.studio/info/) on September 11, 2020
[^li and mills]: Xiaochang Li and Mara Mills, "Vocal Features: From Identification to Speech Recognition by Machine" 60(2) _Technology and Culture_ (2019) pp.129-S160 DOI: https://doi.org/10.1353/tech.2019.0066
[^intelligent_audio_analysis]: ![](bib:827d1f44-5a35-4278-a527-4df67e5ba321)
[^virtanen]: ![](bib:7cf99c5d-1a28-44d9-958a-8ff5e9cb4441)
[^lei_mak]: Lei, B., Mak, MW. "Robust scream sound detection via sound event partitioning. Multimed Tools Appl" 75, 6071–6089 (2016). https://doi.org/10.1007/s11042-015-2555-z 
[^Abu Hamdan]: Interview with Lawrence Abu Hamdan, publication forthcoming 2021
[^airplanes]: ![](bib:6676af8a-7a4d-4aa8-af96-f26452f58753)
[^kathy_audio_1]: Interview with [Kathy Reid](https://blog.kathyreid.id.au) on August 11, 2020
[^mattern]: Interview with [Shannon Mattern](https://wordsinspace.net/shannon/) on August 18, 2020
[^andre_audio_1]: Interview with [André Dao](https://andredao.com/) on September 4, 2020
[^seaver]: ![](bib:d2b1e24c-c800-42b9-ba67-105b0b25efc9)
[^exemplary projects]: See for instance [Data 4 Black Lives](https://d4bl.org/programs.html), [Feminist Data Manifest-No](https://www.manifestno.com/)
[^Crawford and Joler]: ![](bib:3f8dd486-3e28-45ef-929f-65086850870e)
[^goldenfein]: ![](bib:6e8f7c36-d251-4a07-ac5d-0b938c5f5fee)
[^mcquillan]: ![](bib:c58be9a5-a599-4a4b-b58f-a07721fc1721)
[^halcyon_audio_1]: Interview with [Halcyon Lawrence](http://www.halcyonlawrence.com/) on August 31, 2020. See also Thao Phan, "Amazon Echo and the Aesthetics of Whiteneness" 5(1) _Catalyst: Feminism, Theory, Technoscience_ (2019), 1-38.
[^virilio]: ![](bib:8558647f-101d-43ff-a531-5df8eb87199a) p.53
[^Faroki, Paglen]: Mark Andrejevic, [Operational Listening (Eavesdropping)](https://youtu.be/OxOKlgsc3_M), recorded on August 10, 2018
[^Billy Li]: ![](bib:fac6c1a2-946f-43c4-83f5-e54fd7185c18) For a good introduction to adversarialism, see ![](bib:bc39dd7f-1dcc-46dc-9f52-6a16b913ff5a)
[^ys]: ![](bib:26f7b730-9064-464b-b905-fbe63c5d4e4b)