Forgot your password?
IBM Science

Phoneme Approach For Text-to-Speech in SCIAM 197

Posted by Hemos
from the understanding-the-language dept.
jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
This discussion has been archived. No new comments can be posted.

Phoneme Approach For Text-to-Speech in SCIAM

Comments Filter:
  • by watzinaneihm (627119) on Monday March 17, 2003 @08:04AM (#5528053) Journal
    Does the poster have something against IBM ... to link an application to a slashdot post?
    Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.
  • by Tucan (60206) on Monday March 17, 2003 @08:04AM (#5528054)
    Phonemes are the building blocks of language not phenomes.
    • by Anonymous Coward
      Methinks another case of /.ers obtaining their scant science knowledge from bad TV and movie sci-fi (real SF comes in books!)

      Anybody willing to write "The Extended Phoneme?"
      Homer Simpson perhaps....
  • by LeoDV (653216) on Monday March 17, 2003 @08:04AM (#5528055) Journal
    If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.

    What's so special about it?
    • by Rubyflame (159891) on Monday March 17, 2003 @09:03AM (#5528243) Homepage
      Used to? Still does! It's called "AT&T Natural Voices," and there's an online demo [].
      • I thought that the IBM one was better. The acoustic stuff seemed to be about the same, but the intonation on the IBM one was a lot nicer for the two samples I tried.

        Incidentally, they don't seem to have improved a great deal from the concatenative TTS systems IBM had 4 years ago. There was one model of the UK marketing woman for ViaVoice, and for some sentences the TTS was almost indistinguishable from the real thing. The only problem with these systems is that the memory footprint is massive, so they tak
      • This was used in Mission to Mars for the spaceship's voice. The director was looking to do some sound FX to create one from a human voice, then found AT&T's product which was a perfect fit [].

        I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.
      • I'm working on a project involving voice synthesis, so we've been shopping around and evaluating different systems.

        We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, th

      • SpeechWorks also offers a high-quality network telephony concatenative TTS engine, called Speechify []. We also offer a formant-based TTS engine, as well as an embedded TTS one based on Speechify. See some demos here. []

        We also offer quite a large range of languages. Our Canadian French voice, which was just released, is fantastic! Looks like marketing hasn't put him on the demo page yet though... :(


    • I might be biased as an IBMer, but the IBM one sounds better to me. Both are certainly better than the one included with Notes Buddy [], which is all the rage in IBM right now since it is so much better than our previous IM tool.
    • We've also been doing this for quite some time. you can check out the Cepstral On-Line High Quality Synthesis Demos [], as well as our High Quality Limited Domain Demos [].
      • Much worse than the AT&T version. The words are run togther too much.

        Haven't gotten the IBM one to work yet.
        • our synthesizer runs in a very very small fraction of the footprint (memory and disk space) as the AT&T synthesizer. The AT&T synthesizer is also based on earlier work from our CTO (the AT&T synthesizer is ultimately just festival with some other code on top of it)
  • by trelanexiph (605826) on Monday March 17, 2003 @08:06AM (#5528061) Homepage
    I guess IBM didn't have much to say on the matter.

    IBM Text-to-Speech Research Demonstration

    Input Communcations Error.

    You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.

  • ...if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work.

    With the right technical document, this could cure insomnia as well...

    • if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work./i

      You should check out the Digital Talking Book specs. It is an open format and there are readers available which allows text to speech and other effects. Most of the readers have been designed with visually impaired target audien
    • by wcb4 (75520)
      I have actually used textaloudMP3 (from nextUp) to real project gutenberg e-text aloud. Its not perfect, far from it, but it gets better since you can correct mispronunciations over time (my exceptions file now has about 200 entries) The program is a windows front end to ANY installed text to speach engine, be it Microsoft's or L&H or AT&T. I often have it read into mp3 files, which I burn onto CDs and listen to on the way to work I can usually get about 5-6 full books on a single CD, and its free (
    • What would be REALLY funny is a tts / voice recognition battle between different computers - maybe running an eliza type system. As it messes up on the recognition, things could go down hill fast... :-)
  • by texchanchan (471739) <> on Monday March 17, 2003 @08:10AM (#5528073)
    Phoneme, a unit of sound in a word. From []: "The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English. [... from Greek phnma, phnmat-, utterance, sound produced, from phnein, to produce a sound, from phn, sound, voice...]"

    Related to "telephone," "phonics," etc.
  • by Anonymous Coward on Monday March 17, 2003 @08:13AM (#5528079)
    If you visit here:

    You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".

    The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.
    • The Currah speech unit for the Spectrum was hilarious. It came with a free game which was supposed to say "The Banshee wails at you but nothing happens".

      It actually sounded like "Shbansheehailsacthoowawaaaawaaaens"

      I remember you could also turn it on while you were programming, so evertime you pressed a key it would say "ONE ZERO PRINT QUOTE ACH EE ELL ELL O QUOTE ENTER TWO ZERO ENTER RUN ENTER". I used to drive me batty. It was one of those eighties things which you thought was "cool" at the time, but

    • by Anonymous Coward
      The IBM product seems to take the recording of a long text read by a human and automatically produce the data collection that is the artificial voice. It uses speech recognition methods to align text and recording. It also stores more than just a simple collection of phonemes: Where older text-to-speech solutions would modify the sample of a phoneme to reflect a certain position in a sentence, IBMs solution appears to use a phoneme sample from the same context, making the result much less monotone. This app
    • The way to smooth out the lumps is to not use phonemes at all, but diphones. Imagine recording two phonemes uttered by a human speaker in sequence, and then slicing through the middle of each phoneme to and discarding the ends. That gives you a diphone. Diphones are far superior because phonemes do not change in the middle, so there are no "lumps" at the splice. On the other hand phonemes do change depending on what phoneme is uttered next, simply because in articulating different phoneme sequences the
  • cool (Score:1, Interesting)

    Whoa- finally something better than what we've had for years.

    Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!


  • by wiggys (621350) on Monday March 17, 2003 @08:16AM (#5528090) o []

    Some of the voices sound okay I guess. Better than Stephen Hawking anyway.

  • *blush* (Score:5, Funny)

    by WeeBull (645243) on Monday March 17, 2003 @08:22AM (#5528107)
    Uhm, ok, who else did just spent 10 minutes (thoroughly) checking if IBM filter naughty words at the text-to-speech interface? Getting the female voices to utter favourable phrases regarding to one's studlyness, perhaps?

    Oh ... just me? *blush*

  • hmmmm... (Score:1, Informative)

    by koekepeer (197127)
    festival anyone?

    cut'n paste:
    • this link:

      does the same as IBM's demo page. sounds the same as well. but hey, i'm a layman in linguistic matters, so there's prolly a *huge* improvement i understand crap about
  • by wzrd2002 (596945) on Monday March 17, 2003 @08:23AM (#5528110)
    There is already freely available open source speech synthesis application for both linux and windows, called Festival [] created by The University of Edinburgh []
    • I hope it doesn't have a strong scottish accent, they're hard enough to understand in real life...
    • i was one minute earlier :-) but you'll prolly get the karma, because of the direct lijnks. i am too lazy to type in a href="etcetcetc.

      o wait, this will cost me karma as well! -1 offtopic :-)
    • Festival is great, especially with the OGI patches []. I was completely blown away by Festival's quality compared to other opensource TTS engines, and OGI stuff makes stock Festival sound pathetic. Really great stuff, regrettably still not as good as IBM's or AT&T's stuff, but they have got a TTS that I can listen to hours without making my ears bleed.

      Regrettably OGI patches are for personal/research use only, so Debian won't ship them...

      • Thats the problem with BSD style licenses (under which Festival was released). You may extend and restrictively licence the result. I'm still a little suprised that the OGI stuff is for non-commercial use only although it was at least partly government funded.

        Unfortunately free-TTS (i.e, playing any, not just replaying canned speech) is a growing area and there will definitely be a large commercial potential and everyone seems to know this.

    • You should also check out CMU Flite [], which is by one of the guys who built Festival. He also works on other, high quality synthesizers at our company, which you can get demos of at our demo site [].
    • The only problem with Festival is that it practically requires a PhD to get it up and running correctly, and the documentation is aimed at the speech synthesis development community, not the end users. The only reason I got mine working was the FreeBSD ports system an running across a reasonably small demo script I could hack to get what I wanted.
      • Doesn't seem that hard...

        # apt-get install festival festvox-poslex festvox-kallpc16k
        # lynx -dump -nolist | festival --tts

        • well...some people are retarded.
        • Eww, you're using the default voices. What you want to do is install the OGI RES LPC pack, the OGI Lexicon, the tll voice, and write a bit of scheme to get the thing configured. For instance, if you want it to just say whever you give it on the command line of a script:
          echo "(voice_tll_diphone) (Parameter.set 'Audio_Method 'freebsd16audio)(SayText \"$*\")" | festival --pipe

          Obviously using whatever sound system you have. By default it will try to use NAS if it is installed on your system, but I've nev
          • Eww, you're using the default voices. What you want to do is install the OGI RES LPC pack, the OGI Lexicon, the tll voice, and write a bit of scheme to get the thing configured.

            Someone who has figured out how to configure that should put it into Debian as a package... then ordinary users could use it.

  • by inblosam (581789) on Monday March 17, 2003 @08:26AM (#5528119) Homepage
    I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.

    How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.
    • by aseidl (656884) on Monday March 17, 2003 @08:56AM (#5528215)
      I'm surprised by how many people (Mac users and otherwise) haven't noticed how long MacOS has come with text to speech. It's been included since at least MacOS 7.5, maybe even 7.0 (I was using it on my trusty ol' IIci yesterday). You could use it via SimpleText or even have it speak the text of dialog boxes. The quality of the voices could be better, but they do seem better than Festival. But, I have to admit it is pretty fun to scare people who don't know about it. One of my friends told me that his mother gets scared if she doesn't click OK of Cancel in a dialog because "those voices are going to come."
      • "those voices are going to come."

        Maybe that explains the fanactial devotion of Mac users...

        "I do what the voices in my Mac tell me" sounds like a t-shirt begging to be printed up.

      • IIRC, it wasn't standard, but you could get Macintalk for OS 6. OS7 shipped with it standard. The default voice is the same one Koko the Gorilla and Stephen Hawking use. IIRC the entire module was 100k in size and left ample CPU time for other projects (like animating Moose lips) on a 16Mhz 68020.
      • Apple's TTS technology is pretty old... and it shows. I've been waiting for them to release voice upgrades since the original PowerPC macs came out, but after they axed their (basic) research section, the likelyhood of that happening decreased dramatically. The IBM approach is also pretty old, but the voice quality is slightly better, probably because there are more voice samples/higher quality.

        No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded

      cat -a is even cooler than snoop -a. :)

    • The "How does this compare to Apple's TTS" is really a two part question (at least, I may have missed something).

      The one you probably want answered is which sounds better. At this point the IBM voices sound better than the Apple TTS, but not by very much. Especially when you consider that Apple hasn't improved the voices in over 7 years IIRC (Of course given the option of better voices of having OS X, I'll forgo the voices). Playing several phrases from IBM's and Apple's TTS systems yields the opinion that
  • by rpiquepa (644694) on Monday March 17, 2003 @08:28AM (#5528122) Homepage
    IBM is not alone to work on text-to-speech technology and to have demos [] where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos []. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.
  • Text to Speech and vice-versa takes more memory and CPU time. as time goes on. Surely given market potential for these apps, their quality and availability should've been much much more.

    Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?
    • Debian has several text-to-speech systems built-in. One of them is Festival, based on a research prototype from Edinburgh. It's a few years behind IBM and ATT, but passable. With more training data, it would get better. There are also several open source speech recognition engines of varying quality, again, mostly derived from university research (I believe Cambridge, CMU, and a few others).

      Up to now, Microsoft has not really made any significant contributions to speech technology. They have bought l

      • They have bought lots of companies and hired away experts from other companies and universities.

        This reminded me of an amusing sideline in the history of speech Reco. Cambridge University Engineering department (CUED) originally built an engine called HTK []. This was then sold to a company called Entropic. Entropic were then bought by Microsoft, who have licensed HTK back to CUED, who distribute it for free. This leads to the ammusing situation in which the license [] for a piece of Microsoft code contains the

  • These systems seem to be getting incrementally better, but it doesn't look like a big breakthrough.

    Of course, the intonation is roughly that kind of compromise a PR spokesman employs who is trying to sound convincing but has no clue what he is saying. That's not surprising, given that the TTS systems really do not have any understanding of the meaning of what they are saying.

  • About 30 years ago, I built a voice synthesizer for my IMSAI-8080 based on the General Instruments SC-01 Phoneme Synthesizer chip, which was available at that time from Radio Shack.

    I googled for +"General Instrument" +"SC-01" and got links shown here [http].

    I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.

    It turns out all speech is nothing but sequences of utterances ( vowels and syllabic )

    • Dammit... I thought I checked that link..

      The Google General Instruments SC-01 Links [].

      Sorry for the botched post.

    • "It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine."

      It's a whole lot more complicated than that. If you think phonetically about the way we talk we often merge words together rather than leave short descreet pauses between words. (For example, do you say "leaderovthepack" or "leader. ov. the. pack"? Also note

    • I was doing some more tracing on what I reported in the parent

      Votrax made the SC-01 chip.

      General Instruments made the SP0256 chip

      I do not remember if the chip I had was dual marked - so I do not know if they were the same chip but under different numbers, and quite frankly I do not wanna tear into the old machine right now to verify.

      And it was in the early 1980's , which was about 20 years ago. Not 30.

      You can read more about it here [].

    • This is not a new approach.

      No, but it's a fairly sophisticated refinement of an old(ish) approach. The core ideas that make it possible have been around for a number of years, but there are a lot of constraints that make it difficult to achieve. And just for rant's sake, the qualifying use of the term 'phoneme' in the post is misleading. Phonemes are the fundamental of vocal articulation; it would be impossible to synthesize speech without them. What sets different TTS systems apart is how they are
  • TTS is great (Score:4, Interesting)

    by jjohn (2991) on Monday March 17, 2003 @08:31AM (#5528134) Homepage Journal
    Last year, I started playing with this IBM tech. I thought it would be cool to have RSS feeds read to you in middle of stream music. It's kind of do-it-yourself radio. Although I don't anything to show for that idea, I did make a few songs with it, like Make the Pie Higher [], Plug Nickle [] and Progress [].

    mmm. I hope the server can take a slashdotting...

    The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:

    # Make a text file a spoken MP3

    if [ -z "$1" ] ;
    echo "usage: $0 <input.txt>";

    base=`basename $1 .txt`
    echo "attempting to create $base.mp3"
    /home/jjohn/src/c/viavoice/cmdlinespea k/speakfile $1 temp.wav
    lame -h temp.wav $base.mp3
    rm -f temp.wav

    speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(

    It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.

    use Audio::SoundFile;
    use Audio::SoundFile::Header;

    my $BUFFSIZE = 16384;
    my $ifile = shift || usage();
    my $ofile = shift || usage();
    my $buffer;
    my $header;

    my $reader = new Audio::SoundFile::Reader($ifile, \$header);
    $header->{format} = SF_FORMAT_WAV | SF_FORMAT_PCM;
    my $writer = new Audio::SoundFile::Writer($ofile, $header);

    while (my $length = $reader->bread_pdl(\$buffer, $BUFFSIZE)) {

    $reader->close ;

    sub usage {
    print <<EOT;
    usage: $0 <infile> <outfile>

    mmm. There was indenting in code at one point. Sigh...

  • by lingqi (577227) on Monday March 17, 2003 @08:31AM (#5528138) Journal
    Unless the female voice can render the below lines with feelings, I don't think it's a mature technology.

    give me! give me! oh! I am coming!! OHHHH!

    Actually I did try it. the result (of the above line) was not spectacular. I am impressed with the quality in general, though. Tried "Sticking feathers up your butt does not make you a chicken," but that needs to be said with feelings as well, I suppose.

    Oh yeah, this kind of technology is excellent for a computer to read out the sites to you, if, say, your eyes are tired. It should work wonders for slashdot, even.

    • Oh yeah, this kind of technology is excellent for a computer to read out the sites to you

      I think you discovered the killer application for this technology: the voice reads erotic stories to you while you surf pr0n.

  • what's the status of the infinitely more amazing speech-to-text ? Being from belgium, and thus beiung scammed by Lernout&Hauspie who promised true S2T to be reality by 2000, I'm kinda sceptical towards it by now.

    Will it ever be possible ? As far as I can tell, S2T is quite a bit more difficult then english->french translation for instance, and that still has a long way to go...
  • uttering the sequence:
    "Aargh! I've been slashdotted!" []

    Bandwidth sponsored by danish research funding...
  • by Sam Lowry (254040) on Monday March 17, 2003 @08:52AM (#5528204)
    There are basicaly two TTS technologies on the market:
    • dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival []. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
    • corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important Given that the cost of developing a database for corpus synthesis may be orders of magnitude higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT [] and Scansoft [] (former L&H) and
    • actually, there are more types than this. For example, formant synthesis, and HMM synthesis.

      Also, festival supports unit selection synthesis (which is what you're calling corpus synthesis - the corpus is just the body of text to be recorded, which is used in diphone synthesis also) as well as diphone synthesis.

  • In the 80's, TI had a number of speech synth chips that were of amazing quality. The one used with the add-in modules for the TI-994A was amazing. I still have not heard a better quality speech synth since then. I wonder what happened to that TI technology.

  • Old news (Score:3, Interesting)

    by payndz (589033) on Monday March 17, 2003 @08:58AM (#5528226)
    Text-to-speech? Come on, this has been around for donkey's years - maybe the computer voice doesn't sound like Majel Barrett yet, but it's hardly new and amazing stuff.

    I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")

  • I guess this is what comes of dopes who don't know their own language...
  • I think this has been the first time I've been able to experience some sort of off-site media before it has been slashdotted.

    That just makes my day! :)
  • Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.

    OK, the point of this article

  • My (former) university : mbrola []

    It is even is free (as in beer) for personnal use.
  • by Bertie (87778) on Monday March 17, 2003 @10:01AM (#5528535)
    I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.

    In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a /t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.

    Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.

    Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

    The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...

    Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

    In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.
    • I'm taking a Linguistics course this semester, and I've always found things like this interesting. You make several good points, but I feel that, like most doubters, you oversimplify trial as inevitable failure. You have to be careful when saying things like "Linux won't catch on," "Artificial Intelligence won't happen," or "phonemes are too hard to separate."

      In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these
    • I think it is widely recognized that you need to take coarticulation and _meaning_ into account when converting between speech and text.

      You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-ph
    • I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes....In the beginning, there was the word. And the word was spoken...

      ...sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion.

      This is not a very coherent argument. You might as well say that you doubt the existence of musical notes, since you've diagrammed the power spectrum

  • So TTS with synthesized phonemes sounds bad, and they try to use recorded phonemes instead. Those still sound bad when the computer has to produce a phoneme combination that wasn't recorded.

    So what's the next step? Is there anyone working on physical modelling of the acoustic properties of the mouth, tongue, throat, larynx, and lungs as they glide between different phonemes to produce speech sounds? This seems like the only way you're gonna get something closer to natural than this recorded-phoneme technol
  • There is a book called MITalk (MIT Talk) that involves the efforts of using some major hardware to do this years ago. They were using a Vax (780?) just for one part of the processing and a few other big computers to do the rest. This lead to the DecTalker (aka the voice of Stephen Hawkings)

    It seems to me that with modern DSP's cranking along with much more calculations per second than a VAX could ever hope for, and one of the best theoretical mathematicians ever having a reliance on the technology, that
  • This raises the bar on fake sound bites. Imagine recording thousands of phrases spoken by Mr. Burns and piecing them together with this technique to make him say "Hello, Smithers. You're quite good at turning me on".
  • by Mandrake (3939) <> on Monday March 17, 2003 @11:01AM (#5528844) Homepage Journal
    This sort of technology has been under development for a long time, and we have demos up on our website, also: Cepstral Online Speech Synthesis Demos []. In fact, we have Higher Quality Limited Domain Demos [] available as well.
  • Is it just me? (Score:2, Insightful)

    by evronm (530821)

    Or does anyone else not understand what the big deal about text to speech is?

    I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.

    Speech to text, on the other hand...

    • I had that program; we used to make it try to call the dog.

      I think, though, that in retrospect it was not quite so good as we remember it; getting something like that to sound more natural is no small thing, nor is it to make it a smaller, faster program that makes fewer pronunciation errors. Incremental advancements are the name of the game for most technologies -- what was Apollo, after all, except a series of incremental advancements over Sputnik?
  • Seems to me that text to speech would be a good problem for darwinian competitive algorithms []. You can take a book on tape, feed the text as input, and have the computer have different algorithms compete by judging them against the human speaker.

    Many iterations later, you probably can get a computer sounding just like a person. And since it has had a whole book to practice over, it should be pretty general.
  • Prediction: They'll look at their server logs and find:

    a) requests for female voices saying dirty things and
    b) requests for male voices saying: "How are you gentlemen!! All your base are belong to us!! You have no chance to survive make your time!!"
    c) "I got an error, you insensitive clod!"

  • The quality of AT&T's TTS or SpeechWorks' TTS is far more advanced. I had some fun with Speechworks' one and posted samples:

    What I wish On-Star would actually say []

    A slightly-edited announcement calling our Bulldog to attend to a special matter []


  • Phonemes aren't really going help. It's easy to wreck a nice peach.

It's later than you think, the joint Russian-American space mission has already begun.