> Voice agents allow businesses to be available to their customers 24/7 to answer questions, schedule appointments, or complete purchases. Customer availability and business availability no longer have to match 1:1 (ever tried to call an East Coast bank after 3 p.m. PT?). With voice agents, every business can always be online.
I don't get it -- textual support chatbots have been around for decades. Even if we accept the premise that people would rather speak to them by voice, how do voice agents represent some kind of sea change in availability?
(And I personally find customer support chatbots deeply frustrating to use for reasons that have nothing to do with the modality or the quality of the AI model. I only ever need to use one when the question I have is not answered in the documentation, which is often the extent of the chatbot's business-specific training data. Inevitably I end up being led in circles, screaming for a human.)
Before LLMs, chatbots and voice bots were dumb pattern matchers. You had to list every “utterance” that you wanted to match on. The only variance was in the “slots”.
An utterance is something like “give me directions from $source to $destination”.
LLMs mean that you don’t have to give the system every utterance in every supported language.
First, I can't listen to this article so this makes their point kinda less relevant.
> It is the most frequent (and information dense)
Second, this is false. Voice is effective when the sensory context is available to both people, e.g. in the dinner table where "pass the salt" makes immediate sense. Otherwise it is an erratic form of communication, prone to misunderstanding, often repetitive and redundant.
It is not more information dense, but it is the most immediate. The latency of AI applications makes its immediacy less useful.
Indeed the one example I can see this is great for; when driving alone. Which is maybe a few hours per year for me anyway, but I can imagine it could be a few hours per workday for many people.
I'm pretty convinced that voice interaction will be the biggest UI change since apps.
Voice is simply natural to humans. Downloading an app to learn about the departure of the next bus is not.
I used voice bots to let my 5-year-old play role-playing games (e.g., checking into a hotel) or let my parents (60+) call a fake car dealership.
It's amazing to observe. They behave as if they're talking to a human, especially when doing it via a phone. That is exactly the UX a computer system should have—simply a phone number and voice.
As soon as people have to learn something new (a new webpage, a new app, etc.), something is wrong.
Voice interaction requires an enclosed area. I find it difficult to use any voice assistants in my life. Other people think I'm talking to them. Perhaps we'll all get single person offices with closing doors.
When you reference enclosed do you mean it needs to be enclosed because TTS is so bad that any background noise throws it off, or do you mean for privacy reasons?
- noise: I expected that this will be solved soon. Eg. LiveKit just announced a VAD model that works on human speech behavior and not voice detection
- privacy: this seems to be a cultural thing. And can quickly change. People moved quickly from everyone on their Bluetooth headset (mid-2000) to calls at all 202x
You’re underestimating how many people are super antisocial, or at least don’t like talking that much! But it’s a fair point — I’d use Siri more if it was reliable
For some it might have to do with anti-social; i'm very social and like talking a lot socially (try to make me shut up); however, for getting stuff done, I find it incredibly time wasting and inefficient. Typing/reading is always faster for me. Like I did a wiring job in a house of someone who only speaks English and their plumber only speaks Spanish, so I call with the plumber in Spanish, he explains what is up; there are at least 20 occasions in that 30 minutes where he drops out or either of us don't hear some part of a sentence so there is repeat. Then I call the English people to explain this to them. If the spanish guy would've sent a whatsapp/signal/whatever, and I would've pulled it through AI and sent it on to the English people, we would've been done in 5 minutes what now took almost 1 hour. But the plumber AND the English people are young and seemingly incapable of reading and really bad at writing. It's not anti-social for me at least; besides sitting in a room for a focused discussion about a feature or so, I cannot imagine how it's not more efficient to do it in writing. Not to mention that I can look/search for it later (but AI does solve that).
I agree that voice control is great, but I feel we’re at an “uncanny valley” moment. You can talk to a machine fluently in natural language, until you suddenly can’t and it makes the dumbest misunderstanding, either from recognition or from parsing.
You still get the best results by talking like a robot.
> For enterprises, AI directly replaces human labor with technology. It's cheaper, faster, more reliable — and often even outperforms humans.
That’s… quite the claim. I guess we’re picking the worst people, the best voice-based AI, the easiest of scenarios, and a total desire for humanity to remove other human from interaction.
Voice is the most dense form of communication? Maybe if AI does stt perfectly all the time, but then the reverse, tts is really not very efficient for me; I read far faster and can do a fast skim (taking milliseconds) to see if the answer is in there or reprompt instead of having to listen to the slow warbling of something/someone only to conclude it was worthless. Oh and tts, at least for me, is not perfect; it often gets things wrong making the other side return nonsense too.
I'd much rather type questions than ask them. Being able to review what I've written before I hit send gives me a sense of control lacking in voice interfaces.
Talking to machines is generational hangup making a lot of anti voice curmudgeons. Watching younglings talk to chatbot like it's just another particapant in a conversation makes the opposition seems futile. TBH I think most of us would love voice interfaces if it was silent... aka subvocalization / functional mind reading, but ultimately that's just talking in your head.
My beef with AI voice is it's so fucking slow. As someone use to podcasts at 3-4x speed, I can't wait to ditch human interaction if as voice agents adopt variable speech rate.
Who cares what businesses do, what I want is an AI agent I can point to a business with my goal (e.g., be relentless in either negotiating my cable bill down by 25% and if after 30 days you fail, cancel my subscription) and have it do it.
You obviously missed the memo where it said that AI would only be used by giant corps to maximize shareholder return.
I do like your idea though -- it reminds me of William Shatner donning boxing gloves and "fighting" to get you the best deal on priceline.com (gosh, I just checked and that's from 2016!!)
You mean the memos all over LinkedIn about how companies can reimagine and automate their customer experience to better instrumentalize customers to hand over their money while removing the need to actually interact with said customers?
I have yet to 'meet' a voice AI on a phone. If I do and I can tell; I will hang up and the company just lost a client. I am a person and I like speaking to persons not machines. If a company thinks I am not worth talking to a human you are not worth my money.
I dunno, that seems a bit narrow minded to me. You're making an assumption about talking to AI being a worse experience than talking to a person (which is frequently _terrible_).
What if you were able to get helpful support, 24/7/365, with no time waiting in a queue, in your own language (regardless of the service provider's location and 'native' language support)? And the company was able to provide the product and support for it cheaper, resulting in less cost to you?
I am a software developer. I avoid technology in my house. I like people; I would love to see other people get paid not fired and replaced. I am Dutch so it will not happen in the near future for me. We have strict laws for employment; plus we are always behind in tech (Except for self service).
It is not narrow minded; AI (ahem, Machine Learning) is quickly replacing the wrong things in my opinion.
So you prefer static menus in IVR systems? Seems they'd typically be more cumbersome to use (unless you use the same one frequently perhaps and have memorized its menus).
I’m the opposite, I’ll talk to a machine, but I want to talk as if they are a machine, not a person. And I want them to be as quick as possible, not rambling on about bullshit. I don’t want to talk to some Indian.
Wow, lot’s of negative responses here on voice. I’m a reader. I read. A lot. And I still think 4o’s advanced voice mode is unique, extremely useful, and I dearly wish we had open models or even some closed competitive models that were as good as it.
I will note that the model has been successively nerfed, massively, from launch, you can watch some demo pre-launch videos, or just try out some basic engagement, for instance, try asking it to talk to you in various accents and see which ones Open AI deems “inappropriate” to ask for and which are fine. This kind of enshittification I think is pretty likely when you are the only one in town with a product.
That said, even moderately enshittified, there’s something magic about an end to end trained multimodal model — it can change tone of voice on request. In fact, my standard prompt asks it to mirror my tone of voice and cadence. This is really unique. It’s not achievable through a whisper -> LLM -> Synthesizer/TTS approach. It can give you a Boston accent, speculate that a Marseille accent is the equivalent in French, and then (at least try) to give you a Marseille accent. This is pretty strong medicine, and I love it.
There’s been so much LLM commoditization this year, and of course the chains keep moving forward on intelligence. But, I hope Ms. Moore is correct that we’ll see better and more voice models soon, and that someone can crack the architecture.
The technology is still extremely young and immature. As models get better, it should be possible to build tools that allow manual annotations to tweak how much emotion and expressiveness goes into what's being communicated, and eventually it should be possible to fully automate a first pass at this which produces passable results.
In any case, I think the biggest win is that tons of books which have never received audiobooks now have the option of getting a way better alternative than legacy TTS tools. Even if current TTS tools are a bit limited, they still feel like a massive leap in quality from what was available a few years back. Making it trivial to generate better audiobooks will help make tons of information more accessible to people.
The choice of audiobook is rarely going to be between a professional actor and TTS, but between no audiobook at all or a TTS version.
Meh - even though the original got negative reactions (for not hitting the mark as a sultry femme fatale), I still think her VO did better readings than a lot of the ones in this. Some of the mod's sound like several different takes by different people spliced together.
(This is still a crazy impressive amount of work, they clearly labored over matching things to facial expressions)
I think many of the negative comments in this thread haven't seen recently the human-machine interactions of the young generations with Siri and her chatbot friends.
Lots of negativity in the comments. if voice works, it's a superior UI than GUI. It's an article from an investment firm that are betting on this. Nothing wrong with that
I don't believe that. For input, maybe (you do draw things probably to explain stuff, or send reference documents). For output, not at all; it really sucks. Not only is reading faster/more economical (if you can read of course, but that's another story); adding visuals (images, charts, but tables, animations, videos, calendars, kanban, mindmaps etc etc) aka GUI really helps in communicating. That's all GUI.
Voice. There’s something about talking to an AI that just always feels wrong. An uncanny valley for audio communication. Maybe it would help if devs dropped the attempt at imitating humans and just made them talk like machines, like Glados or something. At least then you know upfront no one is thinking they can fool you with fake pleasantries.
Anthropomorphism is to AI what skeuomorphism is to UIs. I can’t wait for us to move into the “flat design” era of AI, where instead of being patronized with phrases like “Hi! I’m Bobby! Your intelligent AI assistant, how can I help you?” we just get something cold and straight to the point like “Ready for Instructions”, in some crunchy byte encoding. Sorry for the rambling, I’m a little drunk.
the only use case I can see for voice is while driving. Maybe some other professional setting where you need to be hands free. I would never use a voice assistant in a cafe or something along these lines.
> Voice agents allow businesses to be available to their customers 24/7 to answer questions, schedule appointments, or complete purchases. Customer availability and business availability no longer have to match 1:1 (ever tried to call an East Coast bank after 3 p.m. PT?). With voice agents, every business can always be online.
I don't get it -- textual support chatbots have been around for decades. Even if we accept the premise that people would rather speak to them by voice, how do voice agents represent some kind of sea change in availability?
(And I personally find customer support chatbots deeply frustrating to use for reasons that have nothing to do with the modality or the quality of the AI model. I only ever need to use one when the question I have is not answered in the documentation, which is often the extent of the chatbot's business-specific training data. Inevitably I end up being led in circles, screaming for a human.)
Before LLMs, chatbots and voice bots were dumb pattern matchers. You had to list every “utterance” that you wanted to match on. The only variance was in the “slots”.
An utterance is something like “give me directions from $source to $destination”.
LLMs mean that you don’t have to give the system every utterance in every supported language.
First, I can't listen to this article so this makes their point kinda less relevant.
> It is the most frequent (and information dense)
Second, this is false. Voice is effective when the sensory context is available to both people, e.g. in the dinner table where "pass the salt" makes immediate sense. Otherwise it is an erratic form of communication, prone to misunderstanding, often repetitive and redundant.
It is not more information dense, but it is the most immediate. The latency of AI applications makes its immediacy less useful.
Voice is pretty sweet if you're driving for example
Indeed the one example I can see this is great for; when driving alone. Which is maybe a few hours per year for me anyway, but I can imagine it could be a few hours per workday for many people.
I'm pretty convinced that voice interaction will be the biggest UI change since apps.
Voice is simply natural to humans. Downloading an app to learn about the departure of the next bus is not.
I used voice bots to let my 5-year-old play role-playing games (e.g., checking into a hotel) or let my parents (60+) call a fake car dealership.
It's amazing to observe. They behave as if they're talking to a human, especially when doing it via a phone. That is exactly the UX a computer system should have—simply a phone number and voice.
As soon as people have to learn something new (a new webpage, a new app, etc.), something is wrong.
Voice interaction requires an enclosed area. I find it difficult to use any voice assistants in my life. Other people think I'm talking to them. Perhaps we'll all get single person offices with closing doors.
When you reference enclosed do you mean it needs to be enclosed because TTS is so bad that any background noise throws it off, or do you mean for privacy reasons?
- noise: I expected that this will be solved soon. Eg. LiveKit just announced a VAD model that works on human speech behavior and not voice detection - privacy: this seems to be a cultural thing. And can quickly change. People moved quickly from everyone on their Bluetooth headset (mid-2000) to calls at all 202x
No. Privacy and social courtesy.
You’re underestimating how many people are super antisocial, or at least don’t like talking that much! But it’s a fair point — I’d use Siri more if it was reliable
For some it might have to do with anti-social; i'm very social and like talking a lot socially (try to make me shut up); however, for getting stuff done, I find it incredibly time wasting and inefficient. Typing/reading is always faster for me. Like I did a wiring job in a house of someone who only speaks English and their plumber only speaks Spanish, so I call with the plumber in Spanish, he explains what is up; there are at least 20 occasions in that 30 minutes where he drops out or either of us don't hear some part of a sentence so there is repeat. Then I call the English people to explain this to them. If the spanish guy would've sent a whatsapp/signal/whatever, and I would've pulled it through AI and sent it on to the English people, we would've been done in 5 minutes what now took almost 1 hour. But the plumber AND the English people are young and seemingly incapable of reading and really bad at writing. It's not anti-social for me at least; besides sitting in a room for a focused discussion about a feature or so, I cannot imagine how it's not more efficient to do it in writing. Not to mention that I can look/search for it later (but AI does solve that).
You are right. Funny enough, this can be mitigated by stating you talk to an AI. People are not as "afraid" of talking to an AI as to a human.
I agree that voice control is great, but I feel we’re at an “uncanny valley” moment. You can talk to a machine fluently in natural language, until you suddenly can’t and it makes the dumbest misunderstanding, either from recognition or from parsing.
You still get the best results by talking like a robot.
> For enterprises, AI directly replaces human labor with technology. It's cheaper, faster, more reliable — and often even outperforms humans.
That’s… quite the claim. I guess we’re picking the worst people, the best voice-based AI, the easiest of scenarios, and a total desire for humanity to remove other human from interaction.
Pretty dark and sinister if you ask me.
Voice is the most dense form of communication? Maybe if AI does stt perfectly all the time, but then the reverse, tts is really not very efficient for me; I read far faster and can do a fast skim (taking milliseconds) to see if the answer is in there or reprompt instead of having to listen to the slow warbling of something/someone only to conclude it was worthless. Oh and tts, at least for me, is not perfect; it often gets things wrong making the other side return nonsense too.
> Voice is the most dense form of communication
This is one those claims that's like....yea I guess you can go on the internet and just say things.
What a stupid slide deck. Jesus Christ.
I'd much rather type questions than ask them. Being able to review what I've written before I hit send gives me a sense of control lacking in voice interfaces.
Talking to machines is generational hangup making a lot of anti voice curmudgeons. Watching younglings talk to chatbot like it's just another particapant in a conversation makes the opposition seems futile. TBH I think most of us would love voice interfaces if it was silent... aka subvocalization / functional mind reading, but ultimately that's just talking in your head.
My beef with AI voice is it's so fucking slow. As someone use to podcasts at 3-4x speed, I can't wait to ditch human interaction if as voice agents adopt variable speech rate.
Who cares what businesses do, what I want is an AI agent I can point to a business with my goal (e.g., be relentless in either negotiating my cable bill down by 25% and if after 30 days you fail, cancel my subscription) and have it do it.
You obviously missed the memo where it said that AI would only be used by giant corps to maximize shareholder return.
I do like your idea though -- it reminds me of William Shatner donning boxing gloves and "fighting" to get you the best deal on priceline.com (gosh, I just checked and that's from 2016!!)
You mean the memos all over LinkedIn about how companies can reimagine and automate their customer experience to better instrumentalize customers to hand over their money while removing the need to actually interact with said customers?
I have yet to 'meet' a voice AI on a phone. If I do and I can tell; I will hang up and the company just lost a client. I am a person and I like speaking to persons not machines. If a company thinks I am not worth talking to a human you are not worth my money.
I dunno, that seems a bit narrow minded to me. You're making an assumption about talking to AI being a worse experience than talking to a person (which is frequently _terrible_).
What if you were able to get helpful support, 24/7/365, with no time waiting in a queue, in your own language (regardless of the service provider's location and 'native' language support)? And the company was able to provide the product and support for it cheaper, resulting in less cost to you?
We're far from there, but I expect it'll happen.
I am a software developer. I avoid technology in my house. I like people; I would love to see other people get paid not fired and replaced. I am Dutch so it will not happen in the near future for me. We have strict laws for employment; plus we are always behind in tech (Except for self service). It is not narrow minded; AI (ahem, Machine Learning) is quickly replacing the wrong things in my opinion.
So you prefer static menus in IVR systems? Seems they'd typically be more cumbersome to use (unless you use the same one frequently perhaps and have memorized its menus).
Have you used chatgpt advanced voice mode (voluntarily) and what was your experience like?
I don't use AI, I have a brain to compensate.
What's your plan for when you can't tell anymore?
If I can't tell I can't do anything about it. If I can tell I might switch company/service.
I’m the opposite, I’ll talk to a machine, but I want to talk as if they are a machine, not a person. And I want them to be as quick as possible, not rambling on about bullshit. I don’t want to talk to some Indian.
Wow, lot’s of negative responses here on voice. I’m a reader. I read. A lot. And I still think 4o’s advanced voice mode is unique, extremely useful, and I dearly wish we had open models or even some closed competitive models that were as good as it.
I will note that the model has been successively nerfed, massively, from launch, you can watch some demo pre-launch videos, or just try out some basic engagement, for instance, try asking it to talk to you in various accents and see which ones Open AI deems “inappropriate” to ask for and which are fine. This kind of enshittification I think is pretty likely when you are the only one in town with a product.
That said, even moderately enshittified, there’s something magic about an end to end trained multimodal model — it can change tone of voice on request. In fact, my standard prompt asks it to mirror my tone of voice and cadence. This is really unique. It’s not achievable through a whisper -> LLM -> Synthesizer/TTS approach. It can give you a Boston accent, speculate that a Marseille accent is the equivalent in French, and then (at least try) to give you a Marseille accent. This is pretty strong medicine, and I love it.
There’s been so much LLM commoditization this year, and of course the chains keep moving forward on intelligence. But, I hope Ms. Moore is correct that we’ll see better and more voice models soon, and that someone can crack the architecture.
I'm not convinced TTS can get all the way to the quality of professional actors for things like audiobooks.
I'll take a professional actor over TTS any day - incomparably better quality even with the best TTS.
The technology is still extremely young and immature. As models get better, it should be possible to build tools that allow manual annotations to tweak how much emotion and expressiveness goes into what's being communicated, and eventually it should be possible to fully automate a first pass at this which produces passable results.
In any case, I think the biggest win is that tons of books which have never received audiobooks now have the option of getting a way better alternative than legacy TTS tools. Even if current TTS tools are a bit limited, they still feel like a massive leap in quality from what was available a few years back. Making it trivial to generate better audiobooks will help make tons of information more accessible to people.
The choice of audiobook is rarely going to be between a professional actor and TTS, but between no audiobook at all or a TTS version.
Slightly off-topic, but here’s a video comparing a real voice actor to a mod in a video game. Personally, I think the mod sounds much better.
https://www.youtube.com/watch?v=Ug4h-3qTd1E
Meh - even though the original got negative reactions (for not hitting the mark as a sultry femme fatale), I still think her VO did better readings than a lot of the ones in this. Some of the mod's sound like several different takes by different people spliced together.
(This is still a crazy impressive amount of work, they clearly labored over matching things to facial expressions)
I think many of the negative comments in this thread haven't seen recently the human-machine interactions of the young generations with Siri and her chatbot friends.
We use https://www.lindy.ai/. I wonder why it's not on the map; I thought it was widely used.
lindy is voice activated?
AI voice is like AI art. I am sure many people will appreciate it and love it.
But the whole point of this medium is that you want the humanity and personality. Otherwise just use text.
Lots of negativity in the comments. if voice works, it's a superior UI than GUI. It's an article from an investment firm that are betting on this. Nothing wrong with that
> it's a superior UI than GUI.
I don't believe that. For input, maybe (you do draw things probably to explain stuff, or send reference documents). For output, not at all; it really sucks. Not only is reading faster/more economical (if you can read of course, but that's another story); adding visuals (images, charts, but tables, animations, videos, calendars, kanban, mindmaps etc etc) aka GUI really helps in communicating. That's all GUI.
Can it wreck a nice beach?
Voice. There’s something about talking to an AI that just always feels wrong. An uncanny valley for audio communication. Maybe it would help if devs dropped the attempt at imitating humans and just made them talk like machines, like Glados or something. At least then you know upfront no one is thinking they can fool you with fake pleasantries.
Anthropomorphism is to AI what skeuomorphism is to UIs. I can’t wait for us to move into the “flat design” era of AI, where instead of being patronized with phrases like “Hi! I’m Bobby! Your intelligent AI assistant, how can I help you?” we just get something cold and straight to the point like “Ready for Instructions”, in some crunchy byte encoding. Sorry for the rambling, I’m a little drunk.
Some of the latest models have focused on this and their results will surprise you. Much more emotional, real time reactions.
What is the point of building emotional connection with a voice model
the only use case I can see for voice is while driving. Maybe some other professional setting where you need to be hands free. I would never use a voice assistant in a cafe or something along these lines.
> There’s something about talking to an AI that just always feels wrong.
That the potential for scams and emotional manipulation seem much higher than any "positive" use cases
Is this a stolen article to build backlinks?: https://a16z.com/ai-voice-agents-2025-update/
The a16z article starts with "View this report on Gamma", so I'd assume they're on board with it. Maybe they're an investor?