Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
IBM Science

Phoneme Approach For Text-to-Speech in SCIAM 197

jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
This discussion has been archived. No new comments can be posted.

Phoneme Approach For Text-to-Speech in SCIAM

Comments Filter:
  • cool (Score:1, Interesting)

    by Graspee_Leemoor ( 302316 ) on Monday March 17, 2003 @08:14AM (#5528081) Homepage Journal
    Whoa- finally something better than what we've had for years.

    Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!

    graspee

  • by jkrise ( 535370 ) on Monday March 17, 2003 @08:28AM (#5528128) Journal
    Text to Speech and vice-versa takes more memory and CPU time. as time goes on. Surely given market potential for these apps, their quality and availability should've been much much more.

    Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?
  • by anubi ( 640541 ) on Monday March 17, 2003 @08:29AM (#5528130) Journal
    About 30 years ago, I built a voice synthesizer for my IMSAI-8080 based on the General Instruments SC-01 Phoneme Synthesizer chip, which was available at that time from Radio Shack.

    I googled for +"General Instrument" +"SC-01" and got links shown here [http].

    I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.

    It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine.

    I know IBM is refining this, but the concept is really old hat.

  • TTS is great (Score:4, Interesting)

    by jjohn ( 2991 ) on Monday March 17, 2003 @08:31AM (#5528134) Homepage Journal
    Last year, I started playing with this IBM tech. I thought it would be cool to have RSS feeds read to you in middle of stream music. It's kind of do-it-yourself radio. Although I don't anything to show for that idea, I did make a few songs with it, like Make the Pie Higher [taskboy.com], Plug Nickle [taskboy.com] and Progress [taskboy.com].

    mmm. I hope the server can take a slashdotting...

    The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:

    #!/bin/bash
    # Make a text file a spoken MP3

    if [ -z "$1" ] ;
    then
    echo "usage: $0 <input.txt>";
    exit;
    fi

    base=`basename $1 .txt`
    echo "attempting to create $base.mp3"
    /home/jjohn/src/c/viavoice/cmdlinespea k/speakfile $1
    writewav.pl temp.au temp.wav
    lame -h temp.wav $base.mp3
    rm -f temp.au temp.wav

    speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(

    It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.

    #!/usr/bin/perl
    use Audio::SoundFile;
    use Audio::SoundFile::Header;

    my $BUFFSIZE = 16384;
    my $ifile = shift || usage();
    my $ofile = shift || usage();
    my $buffer;
    my $header;

    my $reader = new Audio::SoundFile::Reader($ifile, \$header);
    $header->{format} = SF_FORMAT_WAV | SF_FORMAT_PCM;
    my $writer = new Audio::SoundFile::Writer($ofile, $header);

    while (my $length = $reader->bread_pdl(\$buffer, $BUFFSIZE)) {
    $writer->bwrite_pdl($buffer);
    }

    $reader->close ;
    $writer->close;
    exit(0);

    sub usage {
    print <<EOT;
    usage: $0 <infile> <outfile>
    EOT
    exit(1);
    }

    mmm. There was indenting in code at one point. Sigh...

  • by Anonymous Coward on Monday March 17, 2003 @08:48AM (#5528181)
    Methinks another case of /.ers obtaining their scant science knowledge from bad TV and movie sci-fi (real SF comes in books!)

    Anybody willing to write "The Extended Phoneme?"
    Homer Simpson perhaps....
  • by aseidl ( 656884 ) on Monday March 17, 2003 @08:56AM (#5528215)
    I'm surprised by how many people (Mac users and otherwise) haven't noticed how long MacOS has come with text to speech. It's been included since at least MacOS 7.5, maybe even 7.0 (I was using it on my trusty ol' IIci yesterday). You could use it via SimpleText or even have it speak the text of dialog boxes. The quality of the voices could be better, but they do seem better than Festival. But, I have to admit it is pretty fun to scare people who don't know about it. One of my friends told me that his mother gets scared if she doesn't click OK of Cancel in a dialog because "those voices are going to come."
  • Old news (Score:3, Interesting)

    by payndz ( 589033 ) on Monday March 17, 2003 @08:58AM (#5528226)
    Text-to-speech? Come on, this has been around for donkey's years - maybe the computer voice doesn't sound like Majel Barrett yet, but it's hardly new and amazing stuff.

    I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")

  • by Anonymous Coward on Monday March 17, 2003 @09:04AM (#5528247)
    hey maybe IT industry should take a note from us musicians for a change (excuse the pun)...

    With sampling technology, especially multisampling where for example each note can have different sounds associated to it depending on the accent, you could achieve some really stunning results in the text to speech market.

    People like EastWest [eastwestsounds.com] have created such systems for virtual choirs...check out Voices Of The Apocalypse [soundsonline.com] as this is some pretty basic but revolutionary way of using samplers...
  • by Sheriff Fatman ( 602092 ) on Monday March 17, 2003 @09:20AM (#5528321) Homepage

    Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.

    OK, the point of this article is on-demand realtime speech synthesis - roughly analogous to the 3D engines used in games. It has to compromise quality and detail for the sake of speed and responsiveness. Could there be a market for 'voice rendering' - a system which can take a script (possibly with some additional mark-up to indicate emotions, emphasis, timing, etc.) and generate an audio version which approaches a reading of the same script by a competent voice actor?

    As well as the obvious 'virtual thespian' Hollywood angle, I'm thinking about stuff like low-budget audio drama [slashdot.org] - people who have the time and the technology to tweak a voice script, but can't afford professional actors to do it the old-fashioned way. There could be applications for creating audio books for the visually impaired, or to make life easier for students working through Shakespeare and Chaucer - I'm still amazed every time I hear Shakespeare read out loud how much meaning can be conveyed by the nuances of the human voice, compared to dry printed prose.

    Is anyone actively working on anything like this? If not, why not? Is it really that hard to fool the human ear? Or is it just a case that it's still cheaper and easier just to employ people to read things into a mic?

    --
  • by Mandrake ( 3939 ) <mandrake@mandrake.net> on Monday March 17, 2003 @11:14AM (#5528924) Homepage Journal
    We've also been doing this for quite some time. you can check out the Cepstral On-Line High Quality Synthesis Demos [cepstral.com], as well as our High Quality Limited Domain Demos [cepstral.com].
  • by wcb4 ( 75520 ) on Monday March 17, 2003 @11:46AM (#5529114)
    I have actually used textaloudMP3 (from nextUp) to real project gutenberg e-text aloud. Its not perfect, far from it, but it gets better since you can correct mispronunciations over time (my exceptions file now has about 200 entries) The program is a windows front end to ANY installed text to speach engine, be it Microsoft's or L&H or AT&T. I often have it read into mp3 files, which I burn onto CDs and listen to on the way to work I can usually get about 5-6 full books on a single CD, and its free (well...once you spend the $50 for the software and the TTS engine and the high quality voices)
  • by MrScience ( 126570 ) on Monday March 17, 2003 @12:36PM (#5529507) Homepage
    This was used in Mission to Mars for the spaceship's voice. The director was looking to do some sound FX to create one from a human voice, then found AT&T's product which was a perfect fit [att.com].

    I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.
  • by silentbozo ( 542534 ) on Monday March 17, 2003 @02:12PM (#5530258) Journal
    Apple's TTS technology is pretty old... and it shows. I've been waiting for them to release voice upgrades since the original PowerPC macs came out, but after they axed their (basic) research section, the likelyhood of that happening decreased dramatically. The IBM approach is also pretty old, but the voice quality is slightly better, probably because there are more voice samples/higher quality.

    No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded speaker - you cannot synthesize a brand new voice (with on the fly inflections that were never recorded, etc.) with that TTS method. There has been research into modeled speech synthesis [ogi.edu], where a mathematical model of lungs, windpipe, vocal cords, and mouth/tongue/lips, are manipulated in order to generate speech. Given the extreme amount of computing power today, you'd expect more people to use that type of TTS, since it's inherently more flexible. However, the biggest problem so far is nobody really has a good model for how all the various fleshy parts within the human speech apparatus interact together. Any open source people want to tackle this problem and start implementing some of these modeled synthesis speech algorithms?
  • by HoldmyCauls ( 239328 ) on Monday March 17, 2003 @02:43PM (#5530497) Journal
    I'm taking a Linguistics course this semester, and I've always found things like this interesting. You make several good points, but I feel that, like most doubters, you oversimplify trial as inevitable failure. You have to be careful when saying things like "Linux won't catch on," "Artificial Intelligence won't happen," or "phonemes are too hard to separate."

    In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these will be the same algorithm in reverse will be for the future to decide).

    "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

    Right there, you've laid out a *very complicated* but by no means difficult way of looking at phonemes individually. In my class, we have some 20 people who all have great difficulty in just figuring out allomorphs, which some Slashdotters might not know are phonemes either in complementary distribution, such as in the case of plural nouns: /s/ sound at the end of /kæts/ {cats} /z/ sound at the end of /kIdz/ {kids} /z/ sound at the end of /mæz/ {matches}

    or in free variation, such as the Lisa/Liza name which mean the same thing, and are derived from the same root, but which have split due to geographical/cultural/other reasons.

    Now, where the average English major might not always recognize similarities and patterns, the average Slashdotter has trained him/herself to do so, and some are likely saying to themselves, "where else does this happen?" and "where is this not true?", which are useful, scientific questions.

    You yourself present the answer to the problem you raise: we have to look at the surrounding phonemes in order to figure out how to make one particular sound fit the word it's in. This is *damn hard*, but not impossible. It's like the fact that stress affects a phoneme in certain languages: we just need to adapt to thinking about language in different terms than simply speaking it and spelling vague representations of it (by first realizing how vague those representations are, which is why the phoneme set is taught first in Linguistics classes).

    Personally, I think the problem lies in the fact that we all want TTS/STT and we want it *now!*, and why can't the computer just say it or hear it the way we do, and all the other questions that come from a lack of understanding, both of how the machine represents everything and the garbled way in which our language is represented. Phonemes are the obvious solution: the software should only have to do STP/PTS conversion, and our language should conform to that, really, since it's the creative dialectical shifts that create a problem, but we'll end up devising a creative solution for that, too.

    Now, we all know what happens with lossy compression...

    Yes, we get a slightly inaccurate but highly useful jpeg of the Andes, or someone's new desktop widget set, or a very listenable 192kbps mp3 of "Hurt" covered by Johnny Cash (even sadder than the original, IMHO).

    And TTS/STT will have its flaws as well, but a digital (though wide) set of sound symbols like phonemes will help us to break things down somewhat until we figure out that something *smaller* about those sounds is very functional, *and* how to represent *that* level of speech, just as we represented matter by some informal type, then by molecules, then atoms, and now we know quite a bit about how the electron, proton and neutron work, and are working on a smaller level.

    To say you "don't really believe in phonemes" oversi
  • by decrocher ( 444733 ) on Monday March 17, 2003 @03:57PM (#5531129) Homepage
    I think it is widely recognized that you need to take coarticulation and _meaning_ into account when converting between speech and text.

    You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-phones, you may be better off trying to understand the phrase from higher level of abstraction.

    Have we correctly identified the right compact expression of speech? I doubt it. Getting speech stuff to work involves a lot of tweaking that is theoretically ungrounded. Tweaking in a methodical and science-biased way _is_ engineering, however.

    BTW, I seem to remember a prof saying that X-ray cinematography more-or-less proved the existance of vocal tract target configurations in speech, which correspond to phonemes. Not to mention that you can encode a message in IPA and have it understood by someone else. Even if they're not totally correct, phonemes may be a sufficient basis for building speech systems.

Say "twenty-three-skiddoo" to logout.

Working...