Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI Communications Google Software Science Technology

Google's Voice-Generating AI Is Now Indistinguishable From Humans (qz.com) 101

An anonymous reader quotes a report from Quartz: A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text. The system is Google's second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet's AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly. The Google researchers also demonstrate that Tacotron 2 can handle hard-to-pronounce words and names, as well as alter the way it enunciates based on punctuation. For instance, capitalized words are stressed, as someone would do when indicating that specific word is an important part of a sentence. Quartz has embedded several different examples in their report that feature a sentence generated by AI along with a sentence read aloud from a human hired by Google. Can you tell which is the AI generated sample?
This discussion has been archived. No new comments can be posted.

Google's Voice-Generating AI Is Now Indistinguishable From Humans

Comments Filter:
  • Not so much (Score:5, Informative)

    by smallfries ( 601545 ) on Wednesday December 27, 2017 @08:07AM (#55814515) Homepage

    Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.

    • by Anonymous Coward

      Heck, a good number of the ads I hear on radio have unnatural timing. Even a politician on a teleprompter sounds unnatural to me. Lots of people are bad (or untrained) at sounding natural as they read from copy.

      • Heck, a good number of the ads I hear on radio have unnatural timing.

        Part of that is because audio can now be digitally sped up without a corresponding pitch change, which precludes the need to hire actors like John Moschitta Jr. [wikipedia.org] to read the terms, conditions, warnings, etc., at the end of an ad. I'm starting to suspect some agencies compress the entire ad in this manner to try to fit in more content without their actors sounding out of breath.

      • "Even a politician on a teleprompter sounds unnatural to me."

        But some of them 'have the best words', or so they say.

    • Despite choosing a low-quality human comparison (the audio fidelity is fine, but the timing and pronunciation is terrible), it is still quite obvious which is which. The synth version is slightly too clipped and the timing does not sound natural.

      Funny thing is, I thought both samples sounded more like a computer more than a human.

      • by kwoff ( 516741 )
        The voice reminded me of the narrator for "Physics Videos by Eugene Khutoryansky" [youtube.com]. Several people have asked in that channel's comments section if it is computer-generated, but it's claimed to be a woman named Kira. AFAICT, it's a voice actor, Kira Vincent. It makes me wonder if Google had her pronounce things, and her pronounciation just happens to be somewhat synthetic-sounding :) (though I looked quickly at the research paper and didn't find a mention of "Kira" or a name for the voice).
    • Still easy to distinguish. Just wait a few seconds and then try to interrupt and see if it stops talking.
    • I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard.
      The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice.
      Much lik

      • I remember a claim from the Final Fantasy movie how its CGI Characters are Indistinguishable from real people. But only hitting the Uncanny Valley very hard. The problem I expect in the audio is like with CGI a bit too perfect, that it misses human imperfections, A computer doing a voice will do the voice is suppose to do. While a narrator while an expert at his craft, is affected by their emotions. When reading what they are saying will emotionally move them so this response will be in their voice. Much like how CGI Characters even perfectly rendered ones, just don't show the details of the emotions.

        Still ... "it took over a hundred questions with Rachel, didn't it??"

        • Which never made sense to me. All through the movies, artificial organisms have serial numbers, as did the Nexus 8 in 2049. Couldn't Deckard just sample Rachel's DNA? Probably do it with a hand held reader by that time.

      • That makes sense. Our speaking apparatus, the muscles and nerves and whatnot are modulated by the emotions running through us at the moment. At the same time our own listening apparatus is trained through endless repetition to catch many of those modulations and identify them, consciously or not. For AI speech to be "indistinguishable from humans" it would need to simulate modulation by emotions which depend on the person and the context.

  • by Steve Jackson ( 4687763 ) on Wednesday December 27, 2017 @08:25AM (#55814565)
    Robocalls! :-D
    • by Megane ( 129182 )
      Wake me up when they can answer out-of-band questions like "What is today?", or respond in a human way to talking over their script with "Hello? Hello? Hello? Hello?" I'm not saying it won't happen, but for now, those are the fastest ways to fail them on a Turing test. When they figure those out, I'll move up to a next level of ez-fail questions.
    • Welcome? I've been in that world for years. Anyway, most robocalls play a recording of an actual human voice, so I fail to see what they'd gain by using a synthesizer. I doubt that *recording the message* is the thing that limits their profits.
  • Just yesterday we saw a thread about someone giving Alexa the skills to ask questions. Now we see Google home is answering them. Set one against another and watch the fun!
  • I'm going to guess they this is with an American accent. I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland. (It's something I find a little irritating when using Google Maps for navigation).

    • by Anonymous Coward

      As speech synthesis rises in usage, my guess is evolution will eliminate harder accents like the Irish, Jamaican, Cuban, etc. It will also eventually eliminate plosive sounds, etc. The language we speak will end up leaving towards how these systems speak because they'll be more ubiquitous.

    • by jrumney ( 197329 )
      Have you tried setting your default language to English (Ireland) or English (UK)? (they seem to both be the same South-East England accent) The way they pronounce kilometers is definitely different than the US English voice.
    • by chill ( 34294 )

      No need to guess, it says so right in the last paragraph of the article.

      However, the system is only trained to mimic the one female voice; to speak like a male or different female, Google would need to train the system again.

      Training against different accents is something that would easily be within Google's reach, once they're satisfied with the main product.

      • I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.

        • by chill ( 34294 )

          I read some time back, that when first working on their Translate application, Google contracted with the United Nations for access to their professional translation archive. Thousands of samples of source material and professional translations in dozens of different languages.

          If that included voice recordings as well as written translations, it could be the solution to the problem of training material. Not regional accents, of course, but still, a big leg up.

        • by EvilSS ( 557649 )

          I would add that the volume of training material is huge and varied. Though one imagines that Amazon have easier access to the material through their Audible subsidiary. Audiobooks with wispersync being especially useful.

          The problem is neural networks can be unpredictable in their response to training. Start feeding it different voices and it might just start averaging them out, or start doing the voice equivalent of code switching. That would be really weird to listen to.

          Also don't go getting the author's guilds and voice actors all riled up. They'll be suing preemptively.

    • I've yet to hear a Google voice that says "kilometres" in the same way we do in Ireland.

      Nobody else says anything the same way you do in Ireland.

  • by Anonymous Coward
    I'm impressed with the progress, but annoyed at how the results are oversold. First, they seemed to have asked that human comparison voice to sound like a robot and she succeeded, but credit for that doesn't go to the robot. Second, they only demonstrated sentences that fit in one breath. The way humans read a paragraph or a book chapter requires us to adjust our pauses for breath and our pacing to the content being read. I expect that Google know this and are working on it, and to be fair to them, it was s
  • Breath (Score:5, Insightful)

    by lazarus ( 2879 ) on Wednesday December 27, 2017 @09:02AM (#55814699) Journal

    One thing that seems to be missing from all of these is a programmatic understanding of how much air is in the lungs.

    "Alexa, what is 69! (factorial)"

    Listen in amazment as she rhymes off the number but then enter the uncanney valley about the time she should be taking a breath...

  • by burhop ( 2883223 ) on Wednesday December 27, 2017 @10:58AM (#55815489)

    Hey google, read all slashdot comments to me with a sarcastic tone.

  • I do not like it. It is unsettling.
  • When I was a kid, 35 years ago, I had a TI-99/4A home computer with a speech synthesizer (which was actually 5 years old tech at the time). Sure, it didn't sound great, but it was totally understandable. With the Terminal Emulator II cartridge you could build from phonemes directly and thus have it say any English word, and not just words from its predefined "dictionary" of words it knew how to pronounce already. That was 35 years ago, with a consumer grade home computer running at 3Mhz, that a 10 year old

    • Replaying pre-recorded phonemes is an adequate solution for poor quality speech, but you can't extend that method to reach high quality. In order to do that, you have to start over from scratch, using a much more difficult method.

  • I would think if they were trying to showcase their technology they would have chosen someone with a less "robotic" voice to copy. I guess they just wanted someone who spoke very clearly?
  • If every book can be accessed by those who want to listen instead of read! Not a trivial development at all.
  • A research paper published by Google this month -- which has not been peer reviewed -- details a text-to-speech system called Tacotron 2, which claims near-human accuracy at imitating audio of a person speaking from text.

    If anyone remembers "reading groups" from primary school, there is a pretty big range in the term "human accurate reading".

  • Good enough for Hawking maybe.

    I'd prefer a nice high class British female voice Or Paul Bethany as Jarvis..

  • I think it might be more realistic to say that Google and a speaker speaking in a monotonous, robotic way are pretty much indistinguishable from another. They both sound robotic to me. When it can imitate what people really sound like, normal people, then talk to me. Not that this isn't cool, but from the cursory bits I read and heard it seems to over-hype itself.
  • by BradMajors ( 995624 ) on Wednesday December 27, 2017 @05:20PM (#55818361)

    In a few years. AI will progress so that AI will sound more human than humans.

  • I like Australian Siri and wish Alexa would offer similar accents. $0.02

"There is no statute of limitations on stupidity." -- Randomly produced by a computer program called Markov3.

Working...