Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com) 101
An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?
Not so much (Score:5, Informative)
Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.
Re: (Score:1)
Heck, a good number of the ads I hear on radio have unnatural timing. Even a politician on a teleprompter sounds unnatural to me. Lots of people are bad (or untrained) at sounding natural as they read from copy.
Re: (Score:2)
Heck, a good number of the ads I hear on radio have unnatural timing.
Part of that is because audio can now be digitally sped up without a corresponding pitch change, which precludes the need to hire actors like John Moschitta Jr. [wikipedia.org] to read the terms, conditions, warnings, etc., at the end of an ad. I'm starting to suspect some agencies compress the entire ad in this manner to try to fit in more content without their actors sounding out of breath.
Re: (Score:2)
"Even a politician on a teleprompter sounds unnatural to me."
But some of them 'have the best words', or so they say.
Re: (Score:2)
Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.
Funny thing is, I thought both samples sounded more like a computer more than a human.
Re: (Score:3)
Re: Not so much (Score:1)
Re: (Score:3)
I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard.
The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice.
Much lik
Re: (Score:2)
I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard. The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice. Much like how CGI Characters even perfectly rendered ones, just don't show the details of the emotions.
Still ... "it took over a hundred questions with Rachel, didn't it??"
Re: (Score:2)
Which never made sense to me. All through the movies, artificial organisms have serial numbers, as did the Nexus 8 in 2049. Couldn't Deckard just sample Rachel's DNA? Probably do it with a hand held reader by that time.
Re: (Score:2)
That makes sense. Our speaking apparatus, the muscles and nerves and whatnot are modulated by the emotions running through us at the moment. At the same time our own listening apparatus is trained through endless repetition to catch many of those modulations and identify them, consciously or not. For AI speech to be "indistinguishable from humans" it would need to simulate modulation by emotions which depend on the person and the context.
Re: (Score:2)
Yeah... it said that when I commented. Hence my claim that it is not indistinguishable. Do you understand?
Re:Baloney (Score:4, Insightful)
Re: (Score:1)
Words matter, caveman.
and those words' meanings change all the time.
Re: (Score:3)
But not in science they don't! AI has a definite scientific meaning.
And since its inception in the 1960's, AI has included basic algorithms used to approximate the results of intelligent thought.
Re: (Score:3)
Re: (Score:2)
Before the mid 1900's if you saw the term AI it would have almost certainly meant artificial insemination, so I assure you the meaning of AI has changed over time.
Re: (Score:2)
No, they don't.
Yes [wikipedia.org], they do [ted.com]
If your argument was somehow about "AI" specifically, you can see ranton's comment and/or picture how "AI" can become another instance of the example words I linked to.
Re: (Score:2, Insightful)
Everyone is going to call it AI, though.
Everyone can be wrong, of course, but who loses in normal conversation? The Average Joe or a pedant?
I'm sure the technology will be referred to in the correct terms by the people who use and probably invented the correct terms. For everyone else, there's AI.
Re: Baloney (Score:1)
I feel your pain binary. You should relax though, can you remember the mainframe, cloud, and e buzzwords? Everything will be called AI for a short while because its sounds cool and advanced to the masses, but this buzzword shall pass.
Re: (Score:2)
Re: (Score:3)
Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".
People really need to start modding these types of comments as Troll and move on. AI has included basic algorithms used as a stand in for intelligent thought since the field arguably began at The Dartmouth Summer Research Project on Artificial Intelligence over 60 years ago. At the time they were very aware of how difficult it could be to define intelligence, so they intentionally did not let that limit what was considered artificial intelligence research.
Today the researchers and field of scientific journa
Re: (Score:3)
Words matter, caveman. What we are calling "AI" is definitely artificial, not not intelligent. If we are going to start calling computer programs "AI" just to start another VC hype cycle, then what is the point? Microsoft Word is "AI".
There's a straightforward difference. If the logic (or business logic, or branching structure / conditionals) was authored by a human programmer then we call it a conventional program. If the logic was an emergent property of running a learning algorithm over a training set, then we call it AI.
This is a practically useful distinction for us working software engineers. (Why? The latter can't usefully be checked into source control itself; only its training data. You can't diff it. The typical bugs you get is
Re:Baloney (Score:4, Insightful)
Even my doorbell has AI in it, because it rings when it "knows" someone is at the door looking for me.
Re:Baloney (Score:5, Funny)
Same with electric heater. The thermostat has built in AI so that it knows when to turn the heater off when it is too hot.
Re: (Score:2)
Re:Baloney (Score:5, Informative)
Listen for the "plosives", the "p" or "b" sounds. All text-speech systems get them wrong, because they are generally programmed from recorded speech that is very frequency limited. There are reasons for that. Full digital sampling of sound uses analog-to-digital converters, limited by the digital sampling. To reduce the amount of digital storage and processing required, the designers of both recording and synthesis tools lower the sampling frequency as far as possible. They also add low bandwidth filter on the input and the outputs, to avoid sharp step functions from generating undesired artifacts on the output, and to avoid weird "beat" harmonics with the sampling frequency from confusing the recorded inputs. But the result is smearing of sharp sounds which are more rich in transients, such as "t" and "p". And dear lord, does it screw up languages with "click" sounds like Zulu.
Re: (Score:2)
Sounds like bullshit. A CD is only 650 MB, and holds 80 minutes of high quality audio. Who cares about the amount of digital storage for a couple of "b" and "t" samples ?
Re:Baloney (Score:4, Informative)
But regarding speech synthesis specifically - there is software out there, still being used by somebody I'm sure, that was designed to be run on consumer PCs back in the 90s. At that time, on those systems, there were computational limits that were relevant to sound quality. Whatever outdated software Stephen Hawking uses, sounds like it renders the output at no higher than 10 or 12 kHz sampling rate (compared to 40 - 50 kHz to cover the human hearing range.) But the sampling rate is a very small part of why Hawking sounds bad. The artifacts you hear from a low sampling rate are mostly limited to high-frequency sounds being cut. (And possibly temporal smearing, depending on how you filter.) It sounds similar to turning the treble knob on your stereo all the way down.
The quality problems with Hawking's synthesizer go way beyond a treble knob. Things like pacing, emphasis, minor slurring of certain sounds that are adjacent to each other, etc... problems that you take care of by making the software more intelligent, not upping the sample frequency. Which is exactly what Google is doing, and making some progress at it too. No, it doesn't sound like a human yet.
Re: (Score:2)
Re: (Score:1, Troll)
Re: (Score:2)
If you smash a pickaxe through your eye, you will no longer care what people call AI, and we won't have to read your inane shit. It's a win/win.
Re: (Score:2)
And a hacker is someone who enjoys making technology do interesting things. Good luck trying to redefine common language.
For that matter, this isn't even "common" language. Researchers in the field call it AI as well, and have for decades. When necessary they distinguish between strong AI and weak AI, but most of the time it's not necessary because strong AI doesn't yet exist.
Re: (Score:2)
When necessary they distinguish between strong AI and weak AI, but most of the time it's not necessary because strong AI doesn't yet exist.
And you haven't even started distinguishing between AI the result (what you're talking about) and AI the field (which you need to have before you arrive at the former).
Re: (Score:1)
It's funny how angry you keep getting every time the word AI appears in a slashdot article.
And yet, for all your rants, nothing changes. The world keeps on using AI to mean what you insist it doesn't mean.
In the English language, popular use determines meanings. So, this word has attained a new meaning, whether you approve of it or not.
But hey, keep posting your angry rants. Maybe they will go viral and convince the world to change.
Re: (Score:2, Insightful)
Re: (Score:2)
Of course this is more "AI" baloney as you can clearly tell it is speech synthesis.
Meanwhile, actual speech synthesis researchers are acutely aware that mimicking human speech requires dedicating significant NLP resources to generating correct prosody, which may very well be hard or next to impossible without the machine actually understanding what the text is about.
Welcome to the wide world of.... (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
Ha! Sabash!! Great competition. (Score:2)
Re: (Score:2)
I guess I need to listen to it to see just how bad it is. You make it seem like William Shatner should be worried about losing work to automation.
About 10 or so years ago, there was an automated voice reading weather reports on an HDTV sub-channel. I think it was actually the official National Weather Service radio audio. Whenever it came across "patchy fog", it would always say "patch-eef ogg". So now I'm expecting that times a hundred.
What about accents? (Score:2)
I'm going to guess they this is with an American accent. I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland. (It's something I find a little irritating when using Google Maps for navigation).
Re: What about accents? (Score:2, Interesting)
As speech synthesis rises in usage, my guess is evolution will eliminate harder accents like the Irish, Jamaican, Cuban, etc. It will also eventually eliminate plosive sounds, etc. The language we speak will end up leaving towards how these systems speak because they'll be more ubiquitous.
Re: (Score:2)
Re: (Score:2)
No need to guess, it says so right in the last paragraph of the article.
However, the system is only trained to mimic the one female voice; to speak like a male or different female, Google would need to train the system again.
Training against different accents is something that would easily be within Google's reach, once they're satisfied with the main product.
Re: What about accents? (Score:2)
I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.
Re: (Score:2)
I read some time back, that when first working on their Translate application, Google contracted with the United Nations for access to their professional translation archive. Thousands of samples of source material and professional translations in dozens of different languages.
If that included voice recordings as well as written translations, it could be the solution to the problem of training material. Not regional accents, of course, but still, a big leg up.
Re: (Score:2)
I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.
The problem is neural networks can be unpredictable in their response to training. Start feeding it different voices and it might just start averaging them out, or start doing the voice equivalent of code switching. That would be really weird to listen to.
Also don't go getting the author's guilds and voice actors all riled up. They'll be suing preemptively.
Re: (Score:2)
Nobody else says anything the same way you do in Ireland.
Re: What about accents? (Score:2)
True.
Specifically for this, most say Keelow-meters or Killow-meters, while we day kill-Om-eters. Emphasis is on the Om.
Terrible comparisons (Score:1)
Breath (Score:5, Insightful)
One thing that seems to be missing from all of these is a programmatic understanding of how much air is in the lungs.
"Alexa, what is 69! (factorial)"
Listen in amazment as she rhymes off the number but then enter the uncanney valley about the time she should be taking a breath...
Re: (Score:2)
This will be great! (Score:5, Funny)
Hey google, read all slashdot comments to me with a sarcastic tone.
I noticed this after the last upgrade. (Score:2)
That's not saying much. (Score:2)
When I was a kid, 35 years ago, I had a TI-99/4A home computer with a speech synthesizer (which was actually 5 years old tech at the time). Sure, it didn't sound great, but it was totally understandable. With the Terminal Emulator II cartridge you could build from phonemes directly and thus have it say any English word, and not just words from its predefined "dictionary" of words it knew how to pronounce already. That was 35 years ago, with a consumer grade home computer running at 3Mhz, that a 10 year old
Re: (Score:2)
Replaying pre-recorded phonemes is an adequate solution for poor quality speech, but you can't extend that method to reach high quality. In order to do that, you have to start over from scratch, using a much more difficult method.
Maybe not the best test subject (Score:2)
This is huge for the audio book market (Score:1)
Compared to what humans? (Score:2)
A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text.
If anyone remembers "reading groups" from primary school, there is a pretty big range in the term "human accurate reading".
Still sounds choppy (Score:2)
Good enough for Hawking maybe.
I'd prefer a nice high class British female voice Or Paul Bethany as Jarvis..
Eh, I think the title might be better worded... (Score:2)
Just a start (Score:3)
In a few years. AI will progress so that AI will sound more human than humans.
Re: (Score:2)
Have you heard about the woman working in a tourist shop on "The Sunshine Coast" of British Columbia, Canada?
She sells sea shells on the Sechelt Peninsula.
More voices please (Score:2)
I like Australian Siri and wish Alexa would offer similar accents. $0.02