A16Z AI Voice Update 2025

33 points by lxm 4 months ago

dougb5 4 months ago

> Voice agents allow businesses to be available to their customers 24/7 to answer questions, schedule appointments, or complete purchases. Customer availability and business availability no longer have to match 1:1 (ever tried to call an East Coast bank after 3 p.m. PT?). With voice agents, every business can always be online.

I don't get it -- textual support chatbots have been around for decades. Even if we accept the premise that people would rather speak to them by voice, how do voice agents represent some kind of sea change in availability?

(And I personally find customer support chatbots deeply frustrating to use for reasons that have nothing to do with the modality or the quality of the AI model. I only ever need to use one when the question I have is not answered in the documentation, which is often the extent of the chatbot's business-specific training data. Inevitably I end up being led in circles, screaming for a human.)

scarface_74 4 months ago

Before LLMs, chatbots and voice bots were dumb pattern matchers. You had to list every “utterance” that you wanted to match on. The only variance was in the “slots”.
An utterance is something like “give me directions from $source to $destination”.
LLMs mean that you don’t have to give the system every utterance in every supported language.
- mastazi 4 months ago
  
  True, but I think parent meant to say that even the LLM-based ones have been around for quite a while and that just switching from text to voice doesn't really make much of a difference.
  - scarface_74 4 months ago
    
    I’m not aware of any call center implementation of a well known company that is backed by an LLM.
    I work in the cloud call center space (Amazon Connect) as projects come up. I spent 3 years working in the consulting department at AWS.
    
    mastazi 4 months ago
    
    > I’m not aware of any call center implementation of a well known company that is backed by an LLM.
    Maybe this proves GP's point that it's not a desirable development.
    
    scarface_74 4 months ago
    
    There are plenty of pre-LLM chatbot implementations. Amazon Lex (the AWS version of Alexa) has been around forever and is integrated with Connect.
    The platform support just isn’t there yet. Amazon Connect and Genysis are the two most popular cloud platforms for it. Besides that, enterprises are just really slow to move.
    This is a quick mockup I put together for an earlier HN comment
    https://chatgpt.com/share/67be86bc-4090-8010-8017-f3048fe32d...
    Compare this to actually listing out every utterance.
    You would need to build something like this using a custom Lambda integration with Connect and Lex. I’ve done it before as an MVP.
sehansen 4 months ago

> answer questions
The FAQ page of their website already does that.
> schedule appointments
I've been scheduling all my doctors appointment through their website for years already.
> complete purchases
PG himself sold a company making websites doing this to Yahoo in 1998.
Just like all the people trying to reinvent the bus, they've reinvented the website...

seydor 4 months ago

First, I can't listen to this article so this makes their point kinda less relevant.

> It is the most frequent (and information dense)

Second, this is false. Voice is effective when the sensory context is available to both people, e.g. in the dinner table where "pass the salt" makes immediate sense. Otherwise it is an erratic form of communication, prone to misunderstanding, often repetitive and redundant.

It is not more information dense, but it is the most immediate. The latency of AI applications makes its immediacy less useful.

Tepix 4 months ago

Voice is pretty sweet if you're driving for example
- anonzzzies 4 months ago
  
  Indeed the one example I can see this is great for; when driving alone. Which is maybe a few hours per year for me anyway, but I can imagine it could be a few hours per workday for many people.

BrandiATMuhkuh 4 months ago

I'm pretty convinced that voice interaction will be the biggest UI change since apps.

Voice is simply natural to humans. Downloading an app to learn about the departure of the next bus is not.

I used voice bots to let my 5-year-old play role-playing games (e.g., checking into a hotel) or let my parents (60+) call a fake car dealership.

It's amazing to observe. They behave as if they're talking to a human, especially when doing it via a phone. That is exactly the UX a computer system should have—simply a phone number and voice.

As soon as people have to learn something new (a new webpage, a new app, etc.), something is wrong.

doug_durham 4 months ago

Voice interaction requires an enclosed area. I find it difficult to use any voice assistants in my life. Other people think I'm talking to them. Perhaps we'll all get single person offices with closing doors.
- BrandiATMuhkuh 4 months ago
  
  When you reference enclosed do you mean it needs to be enclosed because TTS is so bad that any background noise throws it off, or do you mean for privacy reasons?
  - noise: I expected that this will be solved soon. Eg. LiveKit just announced a VAD model that works on human speech behavior and not voice detection - privacy: this seems to be a cultural thing. And can quickly change. People moved quickly from everyone on their Bluetooth headset (mid-2000) to calls at all 202x
  - doug_durham 4 months ago
    
    No. Privacy and social courtesy.
anon7000 4 months ago

You’re underestimating how many people are super antisocial, or at least don’t like talking that much! But it’s a fair point — I’d use Siri more if it was reliable
- anonzzzies 4 months ago
  
  For some it might have to do with anti-social; i'm very social and like talking a lot socially (try to make me shut up); however, for getting stuff done, I find it incredibly time wasting and inefficient. Typing/reading is always faster for me. Like I did a wiring job in a house of someone who only speaks English and their plumber only speaks Spanish, so I call with the plumber in Spanish, he explains what is up; there are at least 20 occasions in that 30 minutes where he drops out or either of us don't hear some part of a sentence so there is repeat. Then I call the English people to explain this to them. If the spanish guy would've sent a whatsapp/signal/whatever, and I would've pulled it through AI and sent it on to the English people, we would've been done in 5 minutes what now took almost 1 hour. But the plumber AND the English people are young and seemingly incapable of reading and really bad at writing. It's not anti-social for me at least; besides sitting in a room for a focused discussion about a feature or so, I cannot imagine how it's not more efficient to do it in writing. Not to mention that I can look/search for it later (but AI does solve that).
- BrandiATMuhkuh 4 months ago
  
  You are right. Funny enough, this can be mitigated by stating you talk to an AI. People are not as "afraid" of talking to an AI as to a human.
fuzzy2 4 months ago

I agree that voice control is great, but I feel we’re at an “uncanny valley” moment. You can talk to a machine fluently in natural language, until you suddenly can’t and it makes the dumbest misunderstanding, either from recognition or from parsing.
You still get the best results by talking like a robot.

azinman2 4 months ago

> For enterprises, AI directly replaces human labor with technology. It's cheaper, faster, more reliable — and often even outperforms humans.

That’s… quite the claim. I guess we’re picking the worst people, the best voice-based AI, the easiest of scenarios, and a total desire for humanity to remove other human from interaction.

Pretty dark and sinister if you ask me.

anonzzzies 4 months ago

Voice is the most dense form of communication? Maybe if AI does stt perfectly all the time, but then the reverse, tts is really not very efficient for me; I read far faster and can do a fast skim (taking milliseconds) to see if the answer is in there or reprompt instead of having to listen to the slow warbling of something/someone only to conclude it was worthless. Oh and tts, at least for me, is not perfect; it often gets things wrong making the other side return nonsense too.

rozap 4 months ago

> Voice is the most dense form of communication
This is one those claims that's like....yea I guess you can go on the internet and just say things.
What a stupid slide deck. Jesus Christ.

muglug 4 months ago

I'd much rather type questions than ask them. Being able to review what I've written before I hit send gives me a sense of control lacking in voice interfaces.

ivolimmen 4 months ago

I have yet to 'meet' a voice AI on a phone. If I do and I can tell; I will hang up and the company just lost a client. I am a person and I like speaking to persons not machines. If a company thinks I am not worth talking to a human you are not worth my money.

nfm 4 months ago

I dunno, that seems a bit narrow minded to me. You're making an assumption about talking to AI being a worse experience than talking to a person (which is frequently _terrible_).
What if you were able to get helpful support, 24/7/365, with no time waiting in a queue, in your own language (regardless of the service provider's location and 'native' language support)? And the company was able to provide the product and support for it cheaper, resulting in less cost to you?
We're far from there, but I expect it'll happen.
- ivolimmen 4 months ago
  
  I am a software developer. I avoid technology in my house. I like people; I would love to see other people get paid not fired and replaced. I am Dutch so it will not happen in the near future for me. We have strict laws for employment; plus we are always behind in tech (Except for self service). It is not narrow minded; AI (ahem, Machine Learning) is quickly replacing the wrong things in my opinion.
Tepix 4 months ago

So you prefer static menus in IVR systems? Seems they'd typically be more cumbersome to use (unless you use the same one frequently perhaps and have memorized its menus).
Tepix 4 months ago

Have you used chatgpt advanced voice mode (voluntarily) and what was your experience like?
- ivolimmen 4 months ago
  
  I don't use AI, I have a brain to compensate.
demaga 4 months ago

What's your plan for when you can't tell anymore?
- ivolimmen 4 months ago
  
  If I can't tell I can't do anything about it. If I can tell I might switch company/service.
deadbabe 4 months ago

I’m the opposite, I’ll talk to a machine, but I want to talk as if they are a machine, not a person. And I want them to be as quick as possible, not rambling on about bullshit. I don’t want to talk to some Indian.

cc62cf4a4f20 4 months ago

Who cares what businesses do, what I want is an AI agent I can point to a business with my goal (e.g., be relentless in either negotiating my cable bill down by 25% and if after 30 days you fail, cancel my subscription) and have it do it.

disqard 4 months ago

You obviously missed the memo where it said that AI would only be used by giant corps to maximize shareholder return.
I do like your idea though -- it reminds me of William Shatner donning boxing gloves and "fighting" to get you the best deal on priceline.com (gosh, I just checked and that's from 2016!!)
- cc62cf4a4f20 4 months ago
  
  You mean the memos all over LinkedIn about how companies can reimagine and automate their customer experience to better instrumentalize customers to hand over their money while removing the need to actually interact with said customers?

maxglute 4 months ago

Talking to machines is generational hangup making a lot of anti voice curmudgeons. Watching younglings talk to chatbot like it's just another particapant in a conversation makes the opposition seems futile. TBH I think most of us would love voice interfaces if it was silent... aka subvocalization / functional mind reading, but ultimately that's just talking in your head.

My beef with AI voice is it's so fucking slow. As someone use to podcasts at 3-4x speed, I can't wait to ditch human interaction if as voice agents adopt variable speech rate.

mastazi 4 months ago

> Watching younglings talk to chatbot like it's just another particapant in a conversation makes the opposition seems futile.
I'm sorry but I have a young daughter and I don't see that at all. A few years ago we moved to a different country and she had to learn a new language; initially she would use voice-based translation because that new language had a different alphabet so writing would have been harder. Before using the voice-activated app she would move to an empty room where no one could hear her, place the device just next to her mouth and whisper into the microphone... Did not seem to be natural at all - and she is a digital native (the iPhone existed when she was born).
If anything it seems that young people are less likely to trust tech, compared to us. She and many of her friends have permanently placed a piece of tape on their macbook's webcams. When I tell her that back in the day people didn't care much about security and we would re-use the same password everywhere, she looks at me like I'm an alien. Kids don't have the same optimistic outlook about tech that we had back then. We grew up being told that tech would change the world, they grew up with primary school courses about the dangers of cyber-bullying.
- maxglute 4 months ago
  
  How old is your kid? I have feeling digital native growing up with conversation voice assistant is another cohort/generation from digital native that grew up with phones. I'm taking 8 year olds being habituated to talking to AI not 16 year olds many of whom are closer to boomers and find it weird.

vessenes 4 months ago

Wow, lot’s of negative responses here on voice. I’m a reader. I read. A lot. And I still think 4o’s advanced voice mode is unique, extremely useful, and I dearly wish we had open models or even some closed competitive models that were as good as it.

I will note that the model has been successively nerfed, massively, from launch, you can watch some demo pre-launch videos, or just try out some basic engagement, for instance, try asking it to talk to you in various accents and see which ones Open AI deems “inappropriate” to ask for and which are fine. This kind of enshittification I think is pretty likely when you are the only one in town with a product.

That said, even moderately enshittified, there’s something magic about an end to end trained multimodal model — it can change tone of voice on request. In fact, my standard prompt asks it to mirror my tone of voice and cadence. This is really unique. It’s not achievable through a whisper -> LLM -> Synthesizer/TTS approach. It can give you a Boston accent, speculate that a Marseille accent is the equivalent in French, and then (at least try) to give you a Marseille accent. This is pretty strong medicine, and I love it.

There’s been so much LLM commoditization this year, and of course the chains keep moving forward on intelligence. But, I hope Ms. Moore is correct that we’ll see better and more voice models soon, and that someone can crack the architecture.

BoorishBears 4 months ago

Is this a stolen article to build backlinks?: https://a16z.com/ai-voice-agents-2025-update/

dougb5 4 months ago

The a16z article starts with "View this report on Gamma", so I'd assume they're on board with it. Maybe they're an investor?

wewewedxfgdf 4 months ago

I'm not convinced TTS can get all the way to the quality of professional actors for things like audiobooks.

I'll take a professional actor over TTS any day - incomparably better quality even with the best TTS.

TheAceOfHearts 4 months ago

The technology is still extremely young and immature. As models get better, it should be possible to build tools that allow manual annotations to tweak how much emotion and expressiveness goes into what's being communicated, and eventually it should be possible to fully automate a first pass at this which produces passable results.
In any case, I think the biggest win is that tons of books which have never received audiobooks now have the option of getting a way better alternative than legacy TTS tools. Even if current TTS tools are a bit limited, they still feel like a massive leap in quality from what was available a few years back. Making it trivial to generate better audiobooks will help make tons of information more accessible to people.
The choice of audiobook is rarely going to be between a professional actor and TTS, but between no audiobook at all or a TTS version.
doughnutstracks 4 months ago

Slightly off-topic, but here’s a video comparing a real voice actor to a mod in a video game. Personally, I think the mod sounds much better.
https://www.youtube.com/watch?v=Ug4h-3qTd1E
- lelandfe 4 months ago
  
  Meh - even though the original got negative reactions (for not hitting the mark as a sultry femme fatale), I still think her VO did better readings than a lot of the ones in this. Some of the mod's sound like several different takes by different people spliced together.
  (This is still a crazy impressive amount of work, they clearly labored over matching things to facial expressions)

deadbabe 4 months ago

Voice. There’s something about talking to an AI that just always feels wrong. An uncanny valley for audio communication. Maybe it would help if devs dropped the attempt at imitating humans and just made them talk like machines, like Glados or something. At least then you know upfront no one is thinking they can fool you with fake pleasantries.

Anthropomorphism is to AI what skeuomorphism is to UIs. I can’t wait for us to move into the “flat design” era of AI, where instead of being patronized with phrases like “Hi! I’m Bobby! Your intelligent AI assistant, how can I help you?” we just get something cold and straight to the point like “Ready for Instructions”, in some crunchy byte encoding. Sorry for the rambling, I’m a little drunk.

rat87 4 months ago

> There’s something about talking to an AI that just always feels wrong.
That the potential for scams and emotional manipulation seem much higher than any "positive" use cases
mountainriver 4 months ago

Some of the latest models have focused on this and their results will surprise you. Much more emotional, real time reactions.
- deadbabe 4 months ago
  
  What is the point of building emotional connection with a voice model
mastazi 4 months ago

the only use case I can see for voice is while driving. Maybe some other professional setting where you need to be hands free. I would never use a voice assistant in a cafe or something along these lines.

reffaelwallen 4 months ago

We use https://www.lindy.ai/. I wonder why it's not on the map; I thought it was widely used.

orliesaurus 4 months ago

lindy is voice activated?

krembo 4 months ago

I think many of the negative comments in this thread haven't seen recently the human-machine interactions of the young generations with Siri and her chatbot friends.

threeseed 4 months ago

AI voice is like AI art. I am sure many people will appreciate it and love it.

But the whole point of this medium is that you want the humanity and personality. Otherwise just use text.

mohsen1 4 months ago

Lots of negativity in the comments. if voice works, it's a superior UI than GUI. It's an article from an investment firm that are betting on this. Nothing wrong with that

anonzzzies 4 months ago

> it's a superior UI than GUI.
I don't believe that. For input, maybe (you do draw things probably to explain stuff, or send reference documents). For output, not at all; it really sucks. Not only is reading faster/more economical (if you can read of course, but that's another story); adding visuals (images, charts, but tables, animations, videos, calendars, kanban, mindmaps etc etc) aka GUI really helps in communicating. That's all GUI.

kennyloginz 4 months ago

Can it wreck a nice beach?