Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
IBM Science

Phoneme Approach For Text-to-Speech in SCIAM 197

jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
This discussion has been archived. No new comments can be posted.

Phoneme Approach For Text-to-Speech in SCIAM

Comments Filter:
  • by Tucan ( 60206 ) on Monday March 17, 2003 @08:04AM (#5528054)
    Phonemes are the building blocks of language not phenomes.
  • by LeoDV ( 653216 ) on Monday March 17, 2003 @08:04AM (#5528055) Journal
    If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.

    What's so special about it?
  • by texchanchan ( 471739 ) <ccrowley@gmail . c om> on Monday March 17, 2003 @08:10AM (#5528073)
    Phoneme, a unit of sound in a word. From Dictionary.com [reference.com]: "The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English. [... from Greek phnma, phnmat-, utterance, sound produced, from phnein, to produce a sound, from phn, sound, voice...]"

    Related to "telephone," "phonics," etc.
  • by Anonymous Coward on Monday March 17, 2003 @08:13AM (#5528079)
    If you visit here:
    http://www.naturalvoices.att.com/demos/

    You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".

    The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.
  • hmmmm... (Score:1, Informative)

    by koekepeer ( 197127 ) on Monday March 17, 2003 @08:22AM (#5528109)
    festival anyone?

    cut'n paste:

    http://www.cstr.ed.ac.uk/projects/festival/
  • by wzrd2002 ( 596945 ) on Monday March 17, 2003 @08:23AM (#5528110)
    There is already freely available open source speech synthesis application for both linux and windows, called Festival [ed.ac.uk] created by The University of Edinburgh [ed.ac.uk]
  • by inblosam ( 581789 ) on Monday March 17, 2003 @08:26AM (#5528119) Homepage
    I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start Speaking...in the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.

    How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.
  • by rpiquepa ( 644694 ) on Monday March 17, 2003 @08:28AM (#5528122) Homepage
    IBM is not alone to work on text-to-speech technology and to have demos [ibm.com] where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos [bell-labs.com]. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.
  • by wiggys ( 621350 ) on Monday March 17, 2003 @08:47AM (#5528179)
    "It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine."

    It's a whole lot more complicated than that. If you think phonetically about the way we talk we often merge words together rather than leave short descreet pauses between words. (For example, do you say "leaderovthepack" or "leader. ov. the. pack"? Also note the "ov" instead of "of")

    Not only that we pronounce words differently depending on the context of which they appear in (if you think about the mechanics of speaking you'll realise our mouths change shape, therefore if you've just pronounced an "m" you may find it tricky to hit an immediate "l"). Also, we give away many clues about our state or mind as we speak - when we say "yours truly" we often sound humble, but when we say "Mine's better than yours" the "yours" in the latter sentence sounds more aggressive.

    Probably the most important difference is emotion. A good narrator or speaker can draw you in to what he's saying because of the way he says it. Think about Kennedy delivering the line "We do these things not because they are easy..." - now feed the same line into a speech synthesizer. It's dead, isn't it? No impact, no emotion, no feeling. Personally, I find I can concentrate much more when a good narrator is reading an audio book than I can if a bad one reads it.

    I found an audio book on Kazaa once where Stephen Hawking's synthesizer reads aloud A Brief History Of Time. I had to stop listening after 2 minutes because it no longer made sense - had Richard Dawkin been reading it then I'm sure I could have absorbed it 10 times better.

  • by g4dget ( 579145 ) on Monday March 17, 2003 @08:51AM (#5528195)
    Debian has several text-to-speech systems built-in. One of them is Festival, based on a research prototype from Edinburgh. It's a few years behind IBM and ATT, but passable. With more training data, it would get better. There are also several open source speech recognition engines of varying quality, again, mostly derived from university research (I believe Cambridge, CMU, and a few others).

    Up to now, Microsoft has not really made any significant contributions to speech technology. They have bought lots of companies and hired away experts from other companies and universities. Those people are now toiling away at Microsoft research and waiting for their options to be worth something. Whether they'll make significant contributions to speech research while at Microsoft remains to be seen.

  • by Sam Lowry ( 254040 ) on Monday March 17, 2003 @08:52AM (#5528204)
    There are basicaly two TTS technologies on the market:
    • dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival [ed.ac.uk]. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
    • corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important Given that the cost of developing a database for corpus synthesis may be orders of magnitude higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT [att.com] and Scansoft [scansoft.com] (former L&H) and
  • by Rubyflame ( 159891 ) on Monday March 17, 2003 @09:03AM (#5528243) Homepage
    Used to? Still does! It's called "AT&T Natural Voices," and there's an online demo [att.com].
  • by WWWWolf ( 2428 ) <wwwwolf@iki.fi> on Monday March 17, 2003 @10:17AM (#5528634) Homepage

    Festival is great, especially with the OGI patches [ogi.edu]. I was completely blown away by Festival's quality compared to other opensource TTS engines, and OGI stuff makes stock Festival sound pathetic. Really great stuff, regrettably still not as good as IBM's or AT&T's stuff, but they have got a TTS that I can listen to hours without making my ears bleed.

    Regrettably OGI patches are for personal/research use only, so Debian won't ship them...

  • by Anonymous Coward on Monday March 17, 2003 @10:50AM (#5528790)
    including AT&T. This demo sounds much more natural over a broader range of words to my ear. Not much, but some better.
  • by Mandrake ( 3939 ) <mandrake@mandrake.net> on Monday March 17, 2003 @11:01AM (#5528844) Homepage Journal
    This sort of technology has been under development for a long time, and we have demos up on our website, also: Cepstral Online Speech Synthesis Demos [cepstral.com]. In fact, we have Higher Quality Limited Domain Demos [cepstral.com] available as well.
  • by Mandrake ( 3939 ) <mandrake@mandrake.net> on Monday March 17, 2003 @11:10AM (#5528899) Homepage Journal
    You should also check out CMU Flite [cmuflite.org], which is by one of the guys who built Festival. He also works on other, high quality synthesizers at our company, which you can get demos of at our demo site [cepstral.com].
  • by Anonymous Coward on Monday March 17, 2003 @11:17AM (#5528948)
    It's funny how most synthesized voices sound like the Software Automatic Mouth (S.A.M.) software that was available for Atari 800 computers long ago.
  • by Anonymous Coward on Monday March 17, 2003 @11:29AM (#5529021)

    Check out the FreeTTS. Its free, open source, and very good. The quality of the supply voice (as of now) is not as good, but the engine is very good. The footprint is small. And it's pure Java. Also, it's faster than C code (Flite) if some of you want to compare speed.
  • by Anonymous Coward on Monday March 17, 2003 @02:25PM (#5530349)
    I've done a bit of research on text to speech systems, and the absolutely BEST most natural text to speech I've come across is Rhetorical..

    Demo here [rhetorical.com]

    It's got a good range of voices. My answering machine is using one of them...
  • by SimHacker ( 180785 ) on Monday March 17, 2003 @03:47PM (#5531069) Homepage Journal
    I'm working on a project involving voice synthesis, so we've been shopping around and evaluating different systems.

    We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, they appearently didn't see any purpose in actually SUPPORTING the product -- so once their libraries stopped working against the latest Gnu/Linux libraries (happy birthday RMS!), they dropped their Linux voice synthesizer product like a hot potato instead of bothering to recompile it and issue an update.

    So we hoped AT&T would show more comittment to the promises they made on their web site about their flagship voice synthesizer product, but...

    Has anyone actually tried buying a single user copy of Natural Voices from AT&T? YOU CAN'T ANYMORE! They used to sell the synthesizer for workstations and voices for competitive prices (in the 100s of dollars range). So we bought a few voices to evaluate, and sent some simple technical questions into the email address they provided for support, never receiving a reply.

    After several weeks they never answered any of our questions, but we decided to buy some more voices to evaluate anyway. But by then, AT&T had pulled the consumer single user version of Natural Voices off of the market (and it took weeks of phone tag to find that out because they don't give out "technical" information on the phone, and they never answer their email support address).

    Now if you want to buy a Natural Voice from AT&T, you have to buy the server edition for tens of thousands of dollars. Had their support not absolutely sucked, it might have been worth us paying such a high price, but no way we'd ever consider going with AT&T, after they demonstrated such horrible unresponsive service.

    Actually it's a good thing we didn't go with AT&T's voice synthesizer, because we need support for voice authoring tools, and AT&T is incompetent in that regard, since they refuse to give out technical information over the phone, and never answer their email. No support whatsoever. Zilch. Nada. Forget about it.

    Fortunately we found some excellent open source software that works together (and whose authors are MUCH more responsive than IBM or AT&T): the Festival Speech Synthesis System [ed.ac.uk], the FestVox voice authoring tools [festvox.org], the small fast Flite runtime speech engine [cmu.edu], the Edinburgh Speech Tools [ed.ac.uk], the CSLU speech tools [colorado.edu], the OGI Festival tools [ogi.edu], and the MBROLA Multilingual Speech Project [fpms.ac.be]. This is state of the art research software, where IBM and AT&T got their ideas.

    The quality of the commercial voices comes more from throwing lots of time and money into the production process -- the commercial software is not any more advanced than the open source research projects -- in fact the research projects inspired the commercial products!

    -A speech synthesizer user who's been jerked around by AT&T and IBM, and is now happy to have no other choice but to use excellent open source software.

  • by tchapin ( 90910 ) on Monday March 17, 2003 @05:11PM (#5531754)
    SpeechWorks also offers a high-quality network telephony concatenative TTS engine, called Speechify [speechworks.com]. We also offer a formant-based TTS engine, as well as an embedded TTS one based on Speechify. See some demos here. [speechworks.com]

    We also offer quite a large range of languages. Our Canadian French voice, which was just released, is fantastic! Looks like marketing hasn't put him on the demo page yet though... :(

    Todd

"Money is the root of all money." -- the moving finger

Working...