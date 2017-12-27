Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com) 26
An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?
Not so much (Score:5, Informative)
Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.
Re: (Score:1)
Heck, a good number of the ads I hear on radio have unnatural timing. Even a politician on a teleprompter sounds unnatural to me. Lots of people are bad (or untrained) at sounding natural as they read from copy.
Re: (Score:2)
Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.
Funny thing is, I thought both samples sounded more like a computer more than a human.
Baloney (Score:1)
Re: (Score:3)
Re: (Score:1)
Words matter, caveman.
and those words' meanings change all the time.
Re: (Score:3)
Even my doorbell has AI in it, because it rings when it "knows" someone is at the door looking for me.
Welcome to the wide world of.... (Score:4, Insightful)
Ha! Sabash!! Great competition. (Score:2)
What about accents? (Score:2)
I'm going to guess they this is with an American accent. I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland. (It's something I find a little irritating when using Google Maps for navigation).
Re: (Score:2)
Re: (Score:2)
No need to guess, it says so right in the last paragraph of the article.
However, the system is only trained to mimic the one female voice; to speak like a male or different female, Google would need to train the system again.
Training against different accents is something that would easily be within Google's reach, once they're satisfied with the main product.
Re: What about accents? (Score:2)
I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.
Terrible comparisons (Score:1)
Breath (Score:3)
One thing that seems to be missing from all of these is a programmatic understanding of how much air is in the lungs.
"Alexa, what is 69! (factorial)"
Listen in amazment as she rhymes off the number but then enter the uncanney valley about the time she should be taking a breath...