Phoneme Approach For Text-to-Speech in SCIAM 197

Posted by Hemos on Monday March 17, 2003 @07:59AM from the understanding-the-language dept.

jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."

This discussion has been archived. No new comments can be posted.

Phoneme Approach For Text-to-Speech in SCIAM

Load All Comments

Search 197 Comments Log In/Create an Account

Comments Filter:

Does the poster have something against IBM (Score:4, Insightful)

by watzinaneihm ( 627119 ) writes: on Monday March 17, 2003 @08:04AM (#5528053) Journal

Does the poster have something against IBM ... to link an application to a slashdot post?
Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.

Share
twitter facebook
- Re:Does the poster have something against IBM (Score:2, Funny)
  
  by borgdows ( 599861 ) writes:
  
  It looks like IBM is not running their servers on a dead fly ;)
  - Re:Does the poster have something against IBM (Score:2)
    
    by jelle ( 14827 ) writes:
    
    Didn't they have that scalable supercomputer on demand thing [google.com] going on?
    
    Obviously, they are testing the ssytem under load now, and this is part of their test plan.
    
    Tomorrow, we'll see a 'get your own freshly compiled linux ISO from IBM' here...
    - - Re:Does the poster have something against IBM (Score:2)
        
        by jelle ( 14827 ) writes:
        
        "Interesting, but I'm sure IBM can afford to pay for testing of their own."
        
        But.... nothing compares to the /. effect. It's the real thing(tm)!
  - I think that ... (Score:2)
    
    by Snork Asaurus ( 595692 ) writes:
    
    it may be a problem with the U.B.A.
Phonemes not phenomes (Score:4, Informative)

by Tucan ( 60206 ) writes: on Monday March 17, 2003 @08:04AM (#5528054)

Phonemes are the building blocks of language not phenomes.

Share
twitter facebook
- Re:Phonemes not phenomes (Score:1, Interesting)
  
  by Anonymous Coward writes:
  
  Methinks another case of /.ers obtaining their scant science knowledge from bad TV and movie sci-fi (real SF comes in books!)
  
  Anybody willing to write "The Extended Phoneme?"
  Homer Simpson perhaps....
- - Re:As a concerned American patriot, (Score:1)
    
    by shepd ( 155729 ) writes:
    
    Yes, they should be called Freedomgnomes. Stupid latins editing our language for PH sounds.
    
    And where's freedomdot? It's all wrong, I tells ya, it's all freedomin' wrong!
    
    Freedom. The new Marklar.
    - Re:As a concerned Slashdot reader.. (Score:1)
      
      by jkrise ( 535370 ) writes:
      
      I find your posts, though insightful, tend to divert attention from the topic at the top of the thread. If you start a new thread, I promise to read all your posts. Just remember to retain the same title thuogh. Thanks.
I was expecting better... (Score:5, Informative)

by LeoDV ( 653216 ) writes: on Monday March 17, 2003 @08:04AM (#5528055) Journal

If memory serves me, I believe it was AT&T (?) that used to have a similar webpage with near-perfect text-to-speech, which is hardly the case of this project.

What's so special about it?

Share
twitter facebook
- Re:I was expecting better... (Score:5, Informative)
  
  by Rubyflame ( 159891 ) writes: on Monday March 17, 2003 @09:03AM (#5528243) Homepage
  
  Used to? Still does! It's called "AT&T Natural Voices," and there's an online demo [att.com].
  
  Parent Share
  twitter facebook
  - Re:I was expecting better... (Score:2)
    
    by perky ( 106880 ) writes:
    
    I thought that the IBM one was better. The acoustic stuff seemed to be about the same, but the intonation on the IBM one was a lot nicer for the two samples I tried.
    
    Incidentally, they don't seem to have improved a great deal from the concatenative TTS systems IBM had 4 years ago. There was one model of the UK marketing woman for ViaVoice, and for some sentences the TTS was almost indistinguishable from the real thing. The only problem with these systems is that the memory footprint is massive, so they tak
  - Re:I was expecting better... (Score:3, Interesting)
    
    by MrScience ( 126570 ) writes:
    
    This was used in Mission to Mars for the spaceship's voice. The director was looking to do some sound FX to create one from a human voice, then found AT&T's product which was a perfect fit [att.com].
    
    I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.
  - Natural Voices Gagged: AT&T is asleep at the d (Score:3, Informative)
    
    by SimHacker ( 180785 ) writes:
    
    I'm working on a project involving voice synthesis, so we've been shopping around and evaluating different systems.
    We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, th
  - Re:I was expecting better... (Score:2, Informative)
    
    by tchapin ( 90910 ) writes:
    
    SpeechWorks also offers a high-quality network telephony concatenative TTS engine, called Speechify [speechworks.com]. We also offer a formant-based TTS engine, as well as an embedded TTS one based on Speechify. See some demos here. [speechworks.com]
    We also offer quite a large range of languages. Our Canadian French voice, which was just released, is fantastic! Looks like marketing hasn't put him on the demo page yet though... :(
    Todd
- Re:I was expecting better... (Score:2)
  
  by John Harrison ( 223649 ) writes:
  
  I might be biased as an IBMer, but the IBM one sounds better to me. Both are certainly better than the one included with Notes Buddy [ibm.com], which is all the rage in IBM right now since it is so much better than our previous IM tool.
- Re:I was expecting better... (Score:3, Interesting)
  
  by Mandrake ( 3939 ) writes:
  
  We've also been doing this for quite some time. you can check out the Cepstral On-Line High Quality Synthesis Demos [cepstral.com], as well as our High Quality Limited Domain Demos [cepstral.com].
  - Re:I was expecting better... (Score:2)
    
    by swb ( 14022 ) writes:
    
    Much worse than the AT&T version. The words are run togther too much.
    
    Haven't gotten the IBM one to work yet.
    - Re:I was expecting better... (Score:2)
      
      by Mandrake ( 3939 ) writes:
      
      our synthesizer runs in a very very small fraction of the footprint (memory and disk space) as the AT&T synthesizer. The AT&T synthesizer is also based on earlier work from our CTO (the AT&T synthesizer is ultimately just festival with some other code on top of it)
speaking of the /. effect (Score:4, Funny)

by trelanexiph ( 605826 ) writes: on Monday March 17, 2003 @08:06AM (#5528061) Homepage

I guess IBM didn't have much to say on the matter.

IBM Text-to-Speech Research Demonstration

Input Communcations Error.

You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.

Share
twitter facebook
- Re:speaking of the /. effect (Score:1)
  
  by timmie... ( 141368 ) writes:
  
  Hardly surprising. Down at the bottom of the page there's a note that says the application can only be used 30 times per day... and it's linked to slashdot. :)
This could be a hit... (Score:1, Funny)

by WegianWarrior ( 649800 ) writes:

...if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work.

With the right technical document, this could cure insomnia as well...
- Re:This could be a hit... (Score:1)
  
  by yora ( 254503 ) writes:
  
  if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work./i
  
  You should check out the Digital Talking Book specs. It is an open format and there are readers available which allows text to speech and other effects. Most of the readers have been designed with visually impaired target audien
- Re:This could be a hit... (Score:2, Interesting)
  
  by wcb4 ( 75520 ) writes:
  
  I have actually used textaloudMP3 (from nextUp) to real project gutenberg e-text aloud. Its not perfect, far from it, but it gets better since you can correct mispronunciations over time (my exceptions file now has about 200 entries) The program is a windows front end to ANY installed text to speach engine, be it Microsoft's or L&H or AT&T. I often have it read into mp3 files, which I burn onto CDs and listen to on the way to work I can usually get about 5-6 full books on a single CD, and its free (
- Re:This could be a hit... (Score:2)
  
  by walt-sjc ( 145127 ) writes:
  
  What would be REALLY funny is a tts / voice recognition battle between different computers - maybe running an eliza type system. As it messes up on the recognition, things could go down hill fast... :-)
PHONEME, y'all, not *phenome (Score:3, Informative)

by texchanchan ( 471739 ) writes: <ccrowley.gmail@com> on Monday March 17, 2003 @08:10AM (#5528073)

Phoneme, a unit of sound in a word. From Dictionary.com [reference.com]: "The smallest phonetic unit in a language that is capable of conveying a distinction in meaning, as the m of mat and the b of bat in English. [... from Greek phnma, phnmat-, utterance, sound produced, from phnein, to produce a sound, from phn, sound, voice...]"

Related to "telephone," "phonics," etc.

Share
twitter facebook
- Re:PHONEME, y'all, not *phenome (Score:5, Funny)
  
  by WeeBull ( 645243 ) writes: on Monday March 17, 2003 @08:15AM (#5528083)
  
  .. and often uttered in distressed tones at the end of a night out, usually by desperate males attempting to re-attach themselves to some female. PHONEME! PLEASE PHONEME! I LOVE YOU! PHONEME!
  
  Parent Share
  twitter facebook
AT&T have been doing this for a while! (Score:5, Informative)

by Anonymous Coward writes: on Monday March 17, 2003 @08:13AM (#5528079)

If you visit here:
http://www.naturalvoices.att.com/demos/

You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".

The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.

Share
twitter facebook
- Re:AT&T have been doing this for a while! (Score:3, Funny)
  
  by wiggys ( 621350 ) writes:
  
  The Currah speech unit for the Spectrum was hilarious. It came with a free game which was supposed to say "The Banshee wails at you but nothing happens".
  It actually sounded like "Shbansheehailsacthoowawaaaawaaaens"
  I remember you could also turn it on while you were programming, so evertime you pressed a key it would say "ONE ZERO PRINT QUOTE ACH EE ELL ELL O QUOTE ENTER TWO ZERO ENTER RUN ENTER". I used to drive me batty. It was one of those eighties things which you thought was "cool" at the time, but
- Re:AT&T have been doing this for a while! (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  The IBM product seems to take the recording of a long text read by a human and automatically produce the data collection that is the artificial voice. It uses speech recognition methods to align text and recording. It also stores more than just a simple collection of phonemes: Where older text-to-speech solutions would modify the sample of a phoneme to reflect a certain position in a sentence, IBMs solution appears to use a phoneme sample from the same context, making the result much less monotone. This app
- Re:AT&T have been doing this for a while! (Score:2, Insightful)
  
  by prowley ( 587280 ) writes:
  
  The way to smooth out the lumps is to not use phonemes at all, but diphones. Imagine recording two phonemes uttered by a human speaker in sequence, and then slicing through the middle of each phoneme to and discarding the ends. That gives you a diphone. Diphones are far superior because phonemes do not change in the middle, so there are no "lumps" at the splice. On the other hand phonemes do change depending on what phoneme is uttered next, simply because in articulating different phoneme sequences the
cool (Score:1, Interesting)

by Graspee_Leemoor ( 302316 ) writes:

Whoa- finally something better than what we've had for years.

Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!

graspee
Here's another text-to-speech site (Score:4, Funny)

by wiggys ( 621350 ) writes: on Monday March 17, 2003 @08:16AM (#5528090)

http://www.research.att.com/~ttsweb/cgi-bin/ttsdem o [att.com]
Some of the voices sound okay I guess. Better than Stephen Hawking anyway.

Share
twitter facebook
*blush* (Score:5, Funny)

by WeeBull ( 645243 ) writes: on Monday March 17, 2003 @08:22AM (#5528107)

Uhm, ok, who else did just spent 10 minutes (thoroughly) checking if IBM filter naughty words at the text-to-speech interface? Getting the female voices to utter favourable phrases regarding to one's studlyness, perhaps?
Oh ... just me? *blush*

Share
twitter facebook
- Re:*blush* (Score:2)
  
  by wiggys ( 621350 ) writes:
  
  Maybe they should be used to generate the speech in those Weebl and Bob animations you link to in your profile!
- - Re:*blush* (Score:2)
    
    by Doomrat ( 615771 ) writes:
    
    I wish I could be cool just like you, but mother says that I'm not allowed to use those words.
hmmmm... (Score:1, Informative)

by koekepeer ( 197127 ) writes:

festival anyone?

cut'n paste:

http://www.cstr.ed.ac.uk/projects/festival/
- to try it out (Score:1)
  
  by koekepeer ( 197127 ) writes:
  
  this link:
  
  http://festvox.org/voicedemos.html
  
  does the same as IBM's demo page. sounds the same as well. but hey, i'm a layman in linguistic matters, so there's prolly a *huge* improvement i understand crap about
Open Source Speech Synthesis (Score:5, Informative)

by wzrd2002 ( 596945 ) writes: on Monday March 17, 2003 @08:23AM (#5528110)

There is already freely available open source speech synthesis application for both linux and windows, called Festival [ed.ac.uk] created by The University of Edinburgh [ed.ac.uk]

Share
twitter facebook
- Re:Open Source Speech Synthesis (Score:2)
  
  by wiggys ( 621350 ) writes:
  
  I hope it doesn't have a strong scottish accent, they're hard enough to understand in real life...
- hehe (Score:1)
  
  by koekepeer ( 197127 ) writes:
  
  i was one minute earlier :-) but you'll prolly get the karma, because of the direct lijnks. i am too lazy to type in a href="etcetcetc.
  
  o wait, this will cost me karma as well! -1 offtopic :-)
- Re:Open Source Speech Synthesis (Score:3, Informative)
  
  by WWWWolf ( 2428 ) writes:
  
  Festival is great, especially with the OGI patches [ogi.edu]. I was completely blown away by Festival's quality compared to other opensource TTS engines, and OGI stuff makes stock Festival sound pathetic. Really great stuff, regrettably still not as good as IBM's or AT&T's stuff, but they have got a TTS that I can listen to hours without making my ears bleed.
  Regrettably OGI patches are for personal/research use only, so Debian won't ship them...
  - Re:Open Source Speech Synthesis (Score:2)
    
    by anonymous cupboard ( 446159 ) writes:
    
    Thats the problem with BSD style licenses (under which Festival was released). You may extend and restrictively licence the result. I'm still a little suprised that the OGI stuff is for non-commercial use only although it was at least partly government funded.
    Unfortunately free-TTS (i.e, playing any, not just replaying canned speech) is a growing area and there will definitely be a large commercial potential and everyone seems to know this.
- Re:Open Source Speech Synthesis (Score:3, Informative)
  
  by Mandrake ( 3939 ) writes:
  
  You should also check out CMU Flite [cmuflite.org], which is by one of the guys who built Festival. He also works on other, high quality synthesizers at our company, which you can get demos of at our demo site [cepstral.com].
- Re:Open Source Speech Synthesis (Score:2)
  
  by jandrese ( 485 ) writes:
  
  The only problem with Festival is that it practically requires a PhD to get it up and running correctly, and the documentation is aimed at the speech synthesis development community, not the end users. The only reason I got mine working was the FreeBSD ports system an running across a reasonably small demo script I could hack to get what I wanted.
  - Re:Open Source Speech Synthesis (Score:2)
    
    by g4dget ( 579145 ) writes:
    
    Doesn't seem that hard... # apt-get install festival festvox-poslex festvox-kallpc16k # lynx -dump -nolist http://www.slashdot.org/ | festival --tts
    - Re:Open Source Speech Synthesis (Score:2)
      
      by MisterFancypants ( 615129 ) writes:
      
      well...some people are retarded.
    - Re:Open Source Speech Synthesis (Score:2)
      
      by jandrese ( 485 ) writes:
      
      Eww, you're using the default voices. What you want to do is install the OGI RES LPC pack, the OGI Lexicon, the tll voice, and write a bit of scheme to get the thing configured. For instance, if you want it to just say whever you give it on the command line of a script:
      echo "(voice_tll_diphone) (Parameter.set 'Audio_Method 'freebsd16audio)(SayText \"$*\")" | festival --pipe
      
      Obviously using whatever sound system you have. By default it will try to use NAS if it is installed on your system, but I've nev
      - Re:Open Source Speech Synthesis (Score:2)
        
        by g4dget ( 579145 ) writes:
        
        Eww, you're using the default voices. What you want to do is install the OGI RES LPC pack, the OGI Lexicon, the tll voice, and write a bit of scheme to get the thing configured.
        Someone who has figured out how to configure that should put it into Debian as a package... then ordinary users could use it.
comparison to Apple's technology? (Score:4, Informative)

by inblosam ( 581789 ) writes: on Monday March 17, 2003 @08:26AM (#5528119) Homepage

I run Mac OS X and in a lot of applications you have the option for the computer to read an entire document. For example, in TextEdit (a simple text editor by Apple) you can go to Edit, Speech, Start Speaking...in the menu and it will read everything for you. There are 10-15 different default voices to choose from, and built into the OS you can control pretty much everything by speech and get information by voice.

How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.

Share
twitter facebook
- Re:comparison to Apple's technology? (Score:4, Interesting)
  
  by aseidl ( 656884 ) writes: on Monday March 17, 2003 @08:56AM (#5528215)
  
  I'm surprised by how many people (Mac users and otherwise) haven't noticed how long MacOS has come with text to speech. It's been included since at least MacOS 7.5, maybe even 7.0 (I was using it on my trusty ol' IIci yesterday). You could use it via SimpleText or even have it speak the text of dialog boxes. The quality of the voices could be better, but they do seem better than Festival. But, I have to admit it is pretty fun to scare people who don't know about it. One of my friends told me that his mother gets scared if she doesn't click OK of Cancel in a dialog because "those voices are going to come."
  
  Parent Share
  twitter facebook
  - Re:comparison to Apple's technology? (Score:2)
    
    by Croaker ( 10633 ) writes:
    
    "those voices are going to come."
    Maybe that explains the fanactial devotion of Mac users...
    
    "I do what the voices in my Mac tell me" sounds like a t-shirt begging to be printed up.
  - Re:comparison to Apple's technology? (Score:2)
    
    by jandrese ( 485 ) writes:
    
    IIRC, it wasn't standard, but you could get Macintalk for OS 6. OS7 shipped with it standard. The default voice is the same one Koko the Gorilla and Stephen Hawking use. IIRC the entire module was 100k in size and left ample CPU time for other projects (like animating Moose lips) on a 16Mhz 68020.
  - Re:comparison to Apple's technology? (Score:3, Interesting)
    
    by silentbozo ( 542534 ) writes:
    
    Apple's TTS technology is pretty old... and it shows. I've been waiting for them to release voice upgrades since the original PowerPC macs came out, but after they axed their (basic) research section, the likelyhood of that happening decreased dramatically. The IBM approach is also pretty old, but the voice quality is slightly better, probably because there are more voice samples/higher quality.
    
    No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded
- Don't forget the talking cat: (Score:2)
  
  by pHDNgell ( 410691 ) writes:
  
  http://gnufoo.org/macosx/
  
  cat -a is even cooler than snoop -a. :)
- Re:comparison to Apple's technology? (Score:2)
  
  by gerardrj ( 207690 ) writes:
  
  The "How does this compare to Apple's TTS" is really a two part question (at least, I may have missed something).
  
  The one you probably want answered is which sounds better. At this point the IBM voices sound better than the Apple TTS, but not by very much. Especially when you consider that Apple hasn't improved the voices in over 7 years IIRC (Of course given the option of better voices of having OS X, I'll forgo the voices). Playing several phrases from IBM's and Apple's TTS systems yields the opinion that
And don't forget Bell Labs (Score:5, Informative)

by rpiquepa ( 644694 ) writes: on Monday March 17, 2003 @08:28AM (#5528122) Homepage

IBM is not alone to work on text-to-speech technology and to have demos [ibm.com] where you can type a phrase and listen to it. The Bell Labs Text-to-Speech system (TTS) has its own page featuring fun demos [bell-labs.com]. "You can play with our basic interface for some of our Text-to-Speech systems: American English, German, Mandarin Chinese, Spanish, French, Italian and Canadian French." This page is pretty old (it makes references to Netscape 3!!), but the demos still run fine.

Share
twitter facebook
I've always wondered why... (Score:2, Interesting)

by jkrise ( 535370 ) writes:

Text to Speech and vice-versa takes more memory and CPU time. as time goes on. Surely given market potential for these apps, their quality and availability should've been much much more.

Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?
- Re:I've always wondered why... (Score:3, Informative)
  
  by g4dget ( 579145 ) writes:
  
  Debian has several text-to-speech systems built-in. One of them is Festival, based on a research prototype from Edinburgh. It's a few years behind IBM and ATT, but passable. With more training data, it would get better. There are also several open source speech recognition engines of varying quality, again, mostly derived from university research (I believe Cambridge, CMU, and a few others).
  Up to now, Microsoft has not really made any significant contributions to speech technology. They have bought l
  - Re:I've always wondered why... (Score:2)
    
    by perky ( 106880 ) writes:
    
    They have bought lots of companies and hired away experts from other companies and universities.
    This reminded me of an amusing sideline in the history of speech Reco. Cambridge University Engineering department (CUED) originally built an engine called HTK [cam.ac.uk]. This was then sold to a company called Entropic. Entropic were then bought by Microsoft, who have licensed HTK back to CUED, who distribute it for free. This leads to the ammusing situation in which the license [cam.ac.uk] for a piece of Microsoft code contains the
incremental (Score:2)

by g4dget ( 579145 ) writes:

These systems seem to be getting incrementally better, but it doesn't look like a big breakthrough.
Of course, the intonation is roughly that kind of compromise a PR spokesman employs who is trying to sound convincing but has no clue what he is saying. That's not surprising, given that the TTS systems really do not have any understanding of the meaning of what they are saying.
This is not a new approach. (Score:2, Interesting)

by anubi ( 640541 ) writes:

About 30 years ago, I built a voice synthesizer for my IMSAI-8080 based on the General Instruments SC-01 Phoneme Synthesizer chip, which was available at that time from Radio Shack.
I googled for +"General Instrument" +"SC-01" and got links shown here [http].
I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.
It turns out all speech is nothing but sequences of utterances ( vowels and syllabic )
- Re:This is not a new approach. (Score:1)
  
  by anubi ( 640541 ) writes:
  
  Dammit... I thought I checked that link..
  The Google General Instruments SC-01 Links [google.com].
  Sorry for the botched post.
- Re:This is not a new approach. (Score:3, Informative)
  
  by wiggys ( 621350 ) writes:
  
  "It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine."
  It's a whole lot more complicated than that. If you think phonetically about the way we talk we often merge words together rather than leave short descreet pauses between words. (For example, do you say "leaderovthepack" or "leader. ov. the. pack"? Also note
- Re:This is not a new approach. (Score:1)
  
  by anubi ( 640541 ) writes:
  
  I was doing some more tracing on what I reported in the parent
  Votrax made the SC-01 chip.
  General Instruments made the SP0256 chip
  I do not remember if the chip I had was dual marked - so I do not know if they were the same chip but under different numbers, and quite frankly I do not wanna tear into the old machine right now to verify.
  And it was in the early 1980's , which was about 20 years ago. Not 30.
  You can read more about it here [redcedar.com].
- Re:This is not a new approach. (Score:2)
  
  by foqn1bo ( 519064 ) writes:
  
  This is not a new approach.
  
  No, but it's a fairly sophisticated refinement of an old(ish) approach. The core ideas that make it possible have been around for a number of years, but there are a lot of constraints that make it difficult to achieve. And just for rant's sake, the qualifying use of the term 'phoneme' in the post is misleading. Phonemes are the fundamental of vocal articulation; it would be impossible to synthesize speech without them. What sets different TTS systems apart is how they are
  - - Re:This is not a new approach. (Score:2)
      
      by foqn1bo ( 519064 ) writes:
      
      Apology accepted. :) Sometimes I'm a little too quick to defend my area of study against real or imagined attacks against its legitimacy.
      
      And I think you're right. Placement is everything. Cheers.
TTS is great (Score:4, Interesting)

by jjohn ( 2991 ) writes: on Monday March 17, 2003 @08:31AM (#5528134) Homepage Journal

Last year, I started playing with this IBM tech. I thought it would be cool to have RSS feeds read to you in middle of stream music. It's kind of do-it-yourself radio. Although I don't anything to show for that idea, I did make a few songs with it, like Make the Pie Higher [taskboy.com], Plug Nickle [taskboy.com] and Progress [taskboy.com].
mmm. I hope the server can take a slashdotting...
The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:

#!/bin/bash # Make a text file a spoken MP3 if [ -z "$1" ] ; then echo "usage: $0 <input.txt>"; exit; fi base=`basename $1 .txt` echo "attempting to create $base.mp3" /home/jjohn/src/c/viavoice/cmdlinespea k/speakfile $1 writewav.pl temp.au temp.wav lame -h temp.wav $base.mp3 rm -f temp.au temp.wav

speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(
It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.

#!/usr/bin/perl use Audio::SoundFile; use Audio::SoundFile::Header; my $BUFFSIZE = 16384; my $ifile = shift || usage(); my $ofile = shift || usage(); my $buffer; my $header; my $reader = new Audio::SoundFile::Reader($ifile, \$header); $header->{format} = SF_FORMAT_WAV | SF_FORMAT_PCM; my $writer = new Audio::SoundFile::Writer($ofile, $header); while (my $length = $reader->bread_pdl(\$buffer, $BUFFSIZE)) { $writer->bwrite_pdl($buffer); } $reader->close ; $writer->close; exit(0); sub usage { print <<EOT; usage: $0 <infile> <outfile> EOT exit(1); }

mmm. There was indenting in code at one point. Sigh...

Share
twitter facebook
ack. no good (Score:3, Funny)

by lingqi ( 577227 ) writes: on Monday March 17, 2003 @08:31AM (#5528138) Journal

Unless the female voice can render the below lines with feelings, I don't think it's a mature technology.
give me! give me! oh! I am coming!! OHHHH!
Actually I did try it. the result (of the above line) was not spectacular. I am impressed with the quality in general, though. Tried "Sticking feathers up your butt does not make you a chicken," but that needs to be said with feelings as well, I suppose.
Oh yeah, this kind of technology is excellent for a computer to read out the sites to you, if, say, your eyes are tired. It should work wonders for slashdot, even.

Share
twitter facebook
- Re:ack. no good (Score:2)
  
  by Wylfing ( 144940 ) writes:
  
  Oh yeah, this kind of technology is excellent for a computer to read out the sites to you
  I think you discovered the killer application for this technology: the voice reads erotic stories to you while you surf pr0n.
This is cool and all, but (Score:2)

by selderrr ( 523988 ) writes:

what's the status of the infinitely more amazing speech-to-text ? Being from belgium, and thus beiung scammed by Lernout&Hauspie who promised true S2T to be reality by 2000, I'm kinda sceptical towards it by now.

Will it ever be possible ? As far as I can tell, S2T is quite a bit more difficult then english->french translation for instance, and that still has a long way to go...
- - Re:Evil Anti-War Belgian Fries!!!! (Score:2)
    
    by selderrr ( 523988 ) writes:
    
    I am pouring all my Belgian wine down the toilet
    
    being belgian, so am I !
    we don't have much of a wine culture, dumbo. We're beer drinkers. Check out www.belgianbeer.com. We pratically invented the stuff.
Listen to "US female 2" (Score:1, Funny)

by infolib ( 618234 ) writes:

uttering the sequence:
"Aargh! I've been slashdotted!" [fys.ku.dk]

Bandwidth sponsored by danish research funding...
- And here's the Bell Labs version: (Score:2)
  
  by infolib ( 618234 ) writes:
  
  "Aargh! I've been slashdotted!" [fys.ku.dk]
  
  This one is much better at saying "slashdotted". Neither of them do the "Aargh!" very well. Especially the IBM one ought to be convincing, given current circumstances ;-)
  
  Generate more samples for yourself at http://www.naturalvoices.att.com/demos/ [att.com]
State of the art in TTS (Score:4, Informative)

by Sam Lowry ( 254040 ) writes: on Monday March 17, 2003 @08:52AM (#5528204)
There are basicaly two TTS technologies on the market:
- dyphone-based synthesis where the database contains one dyphone (end of first sound + start of next sound) for each psossible sound combination. This approach is used in Festival [ed.ac.uk]. Dyphone-based synthesis will hardly sound better that in Festival because dyphones have to be modified artificially to fit every variation of pitch, duration and any other parameter that is needed to produce a given phrase.
- corpus-based synthesis takes a different approach where a large database of several hours of speech is recorded and manually labelled to mark the start and end of each sound. Such a database is used to extract the best and the longest sequence of dyphones during the production. This approach gives naturally sounding results for short sentences where intonation is not so important Given that the cost of developing a database for corpus synthesis may be orders of magnitude higher than for dyphone synthesis, there are very few companies that make them. Two companies offer a demo on the internet: ATT [att.com] and Scansoft [scansoft.com] (former L&H) and
Share
twitter facebook
- Re:State of the art in TTS (Score:2)
  
  by Mandrake ( 3939 ) writes:
  
  actually, there are more types than this. For example, formant synthesis, and HMM synthesis.
  Also, festival supports unit selection synthesis (which is what you're calling corpus synthesis - the corpus is just the body of text to be recorded, which is used in diphone synthesis also) as well as diphone synthesis.
Better than the TI speech synth chips? (Score:2)

by farrellj ( 563 ) writes:

In the 80's, TI had a number of speech synth chips that were of amazing quality. The one used with the add-in modules for the TI-994A was amazing. I still have not heard a better quality speech synth since then. I wonder what happened to that TI technology.

ttyl
Farrell
Old news (Score:3, Interesting)

by payndz ( 589033 ) writes: on Monday March 17, 2003 @08:58AM (#5528226)

Text-to-speech? Come on, this has been around for donkey's years - maybe the computer voice doesn't sound like Majel Barrett yet, but it's hardly new and amazing stuff.
I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")

Share
twitter facebook
Bonehead: it's P-H-O-N-E-M-E (Score:1)

by evodas ( 244473 ) writes:

I guess this is what comes of dopes who don't know their own language...
Unbelievable! (Score:1)

by Tuffnut ( 618438 ) writes:

I think this has been the first time I've been able to experience some sort of off-site media before it has been slashdotted.

That just makes my day! :)
Hollywood applications for speech synthesis? (Score:2, Interesting)

by Sheriff Fatman ( 602092 ) writes:

Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.

OK, the point of this article
In the 'has been doing that for a while' series : (Score:2)

by dago ( 25724 ) writes:

My (former) university : mbrola [fpms.ac.be]

It is even is free (as in beer) for personnal use.
I'm not actually convinced phonemes exist, y'know (Score:5, Insightful)

by Bertie ( 87778 ) writes: on Monday March 17, 2003 @10:01AM (#5528535) Homepage

I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.

In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a /t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.

Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.

Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...

Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.

Share
twitter facebook
- Re:I'm not actually convinced phonemes exist, y'kn (Score:2, Interesting)
  
  by HoldmyCauls ( 239328 ) writes:
  
  I'm taking a Linguistics course this semester, and I've always found things like this interesting. You make several good points, but I feel that, like most doubters, you oversimplify trial as inevitable failure. You have to be careful when saying things like "Linux won't catch on," "Artificial Intelligence won't happen," or "phonemes are too hard to separate."
  
  In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these
- Re:I'm not actually convinced phonemes exist, y'kn (Score:2, Interesting)
  
  by decrocher ( 444733 ) writes:
  
  I think it is widely recognized that you need to take coarticulation and _meaning_ into account when converting between speech and text.
  
  You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-ph
- Phonemes don't exist? Do YOU??? (Score:2)
  
  by Doug Merritt ( 3550 ) writes:
  
  I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes....In the beginning, there was the word. And the word was spoken...
  ...sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion.
  This is not a very coherent argument. You might as well say that you doubt the existence of musical notes, since you've diagrammed the power spectrum
- - LPC vocoder (Score:2)
    
    by Latent Heat ( 558884 ) writes:
    
    I ran the AT&T synthesis through my trusty spectrum analyser and glottal pulse inverse filter analyser (http://www.medsch.wisc.edu/~milenkvc/tools.html) to see what they are up to.
    It looks like they are using glottal pulses as you say, and they are doing the female voice (Crystal) by boosting the first two harmonics and by filtering out the range past 4 kHz and replacing it with noise to give it that breathy sound that is characteristic of female voices in American culture (this varies with culture
What about physical modelling? (Score:2)

by PenguiN42 ( 86863 ) writes:

So TTS with synthesized phonemes sounds bad, and they try to use recorded phonemes instead. Those still sound bad when the computer has to produce a phoneme combination that wasn't recorded.

So what's the next step? Is there anyone working on physical modelling of the acoustic properties of the mouth, tongue, throat, larynx, and lungs as they glide between different phonemes to produce speech sounds? This seems like the only way you're gonna get something closer to natural than this recorded-phoneme technol
Check a university library (Score:2)

by thogard ( 43403 ) writes:

There is a book called MITalk (MIT Talk) that involves the efforts of using some major hardware to do this years ago. They were using a Vax (780?) just for one part of the processing and a few other big computers to do the rest. This lead to the DecTalker (aka the voice of Stephen Hawkings)

It seems to me that with modern DSP's cranking along with much more calculations per second than a VAX could ever hope for, and one of the best theoretical mathematicians ever having a reliance on the technology, that
Counterfeit sound bites (Score:2)

by p3d0 ( 42270 ) writes:

This raises the bar on fake sound bites. Imagine recording thousands of phrases spoken by Mr. Burns and piecing them together with this technique to make him say "Hello, Smithers. You're quite good at turning me on".
we've been doing this for a while (Score:3, Informative)

by Mandrake ( 3939 ) writes: <mandrake@mandrake.net> on Monday March 17, 2003 @11:01AM (#5528844) Homepage Journal

This sort of technology has been under development for a long time, and we have demos up on our website, also: Cepstral Online Speech Synthesis Demos [cepstral.com]. In fact, we have Higher Quality Limited Domain Demos [cepstral.com] available as well.

Share
twitter facebook
Is it just me? (Score:2, Insightful)

by evronm ( 530821 ) writes:

Or does anyone else not understand what the big deal about text to speech is?

I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.

Speech to text, on the other hand...
- Re:Is it just me? (Score:2)
  
  by Dolohov ( 114209 ) writes:
  
  I had that program; we used to make it try to call the dog.
  
  I think, though, that in retrospect it was not quite so good as we remember it; getting something like that to sound more natural is no small thing, nor is it to make it a smaller, faster program that makes fewer pronunciation errors. Incremental advancements are the name of the game for most technologies -- what was Apollo, after all, except a series of incremental advancements over Sputnik?
Good problem for competitive algorithms? (Score:2)

by MojoRilla ( 591502 ) writes:

Seems to me that text to speech would be a good problem for darwinian competitive algorithms [sciam.com]. You can take a book on tape, feed the text as input, and have the computer have different algorithms compete by judging them against the human speaker.

Many iterations later, you probably can get a computer sounding just like a person. And since it has had a whole book to practice over, it should be pretty general.
Slashdot Demographics (Score:2)

by SomeGuyFromCA ( 197979 ) writes:

Prediction: They'll look at their server logs and find:

a) requests for female voices saying dirty things and
b) requests for male voices saying: "How are you gentlemen!! All your base are belong to us!! You have no chance to survive make your time!!"
c) "I got an error, you insensitive clod!"
Not very good TTS (Score:2, Funny)

by DulcetTone ( 601692 ) writes:

The quality of AT&T's TTS or SpeechWorks' TTS is far more advanced. I had some fun with Speechworks' one and posted samples:
What I wish On-Star would actually say [dweebsofdeath.com]
A slightly-edited announcement calling our Bulldog to attend to a special matter [dweebsofdeath.com]
tone
Oh sure... (Score:2)

by sohp ( 22984 ) writes:

Phonemes aren't really going help. It's easy to wreck a nice peach.
- Re:This is AT&T's Watson from 1995! (Score:2)
  
  by perky ( 106880 ) writes:
  
  The web page is even called webtts.watson.ibm.com. Obviously the quality of TTS has not improved much since 1996.
  
  Assuming this isn't a troll, then you might notice that IBM operates the massive Thomas J Watson research lab. Perhaps the URL has something to do with that? Second, you might want to have a losten if you think TTS hasn't moved in 8 years.
- Re:This is AT&T's Watson from 1995! (Score:2)
  
  by bmetz ( 523 ) writes:
  
  Wrong.
  
  It's the Watson Research Lab, as in T. J. Watson, as in the CEO who started the company over 80 years ago.
- Singing speech synthesizers: Dictionaraoke! (Score:2)
  
  by SimHacker ( 180785 ) writes:
  
  Festival [ed.ac.uk] has some singing demos, using a simple XML format to mark up text with beat duration and note pitch information.
  And Oregon Graduate Institute's CSLU Toolkit [ogi.edu] extends Festival with an implementation of Sable: an XML format that lets you mark up text with arbitrary timing, pitch and volume envelopes.
  An of course there's Dictionaraoke [dictionaraoke.org]!
  Main Entry: dictionaraoke Pronunciation: 'dik-sh&-"ner-A-O-ke Definition: Audio clips from online dictionaries sing the hits of yesterday and today. The fun of

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Does the poster have something against IBM (Score:4, Insightful)

Re:Does the poster have something against IBM (Score:2, Funny)

Re:Does the poster have something against IBM (Score:2)

Re:Does the poster have something against IBM (Score:2)

I think that ... (Score:2)

Phonemes not phenomes (Score:4, Informative)

Re:Phonemes not phenomes (Score:1, Interesting)

Re:As a concerned American patriot, (Score:1)

Re:As a concerned Slashdot reader.. (Score:1)

I was expecting better... (Score:5, Informative)

Re:I was expecting better... (Score:5, Informative)

Re:I was expecting better... (Score:2)

Re:I was expecting better... (Score:3, Interesting)

Natural Voices Gagged: AT&T is asleep at the d (Score:3, Informative)

Re:I was expecting better... (Score:2, Informative)

Re:I was expecting better... (Score:2)

Re:I was expecting better... (Score:3, Interesting)

Re:I was expecting better... (Score:2)

Re:I was expecting better... (Score:2)

speaking of the /. effect (Score:4, Funny)

Re:speaking of the /. effect (Score:1)

This could be a hit... (Score:1, Funny)

Re:This could be a hit... (Score:1)

Re:This could be a hit... (Score:2, Interesting)

Re:This could be a hit... (Score:2)

PHONEME, y'all, not *phenome (Score:3, Informative)

Re:PHONEME, y'all, not *phenome (Score:5, Funny)

AT&T have been doing this for a while! (Score:5, Informative)

Re:AT&T have been doing this for a while! (Score:3, Funny)

Re:AT&T have been doing this for a while! (Score:2, Insightful)

Re:AT&T have been doing this for a while! (Score:2, Insightful)

cool (Score:1, Interesting)

Here's another text-to-speech site (Score:4, Funny)

*blush* (Score:5, Funny)

Re:*blush* (Score:2)

Re:*blush* (Score:2)

hmmmm... (Score:1, Informative)

to try it out (Score:1)

Open Source Speech Synthesis (Score:5, Informative)

Re:Open Source Speech Synthesis (Score:2)

hehe (Score:1)

Re:Open Source Speech Synthesis (Score:3, Informative)

Re:Open Source Speech Synthesis (Score:2)

Re:Open Source Speech Synthesis (Score:3, Informative)

Re:Open Source Speech Synthesis (Score:2)

Re:Open Source Speech Synthesis (Score:2)

Re:Open Source Speech Synthesis (Score:2)

Re:Open Source Speech Synthesis (Score:2)

Re:Open Source Speech Synthesis (Score:2)

comparison to Apple's technology? (Score:4, Informative)

Re:comparison to Apple's technology? (Score:4, Interesting)

Re:comparison to Apple's technology? (Score:2)

Re:comparison to Apple's technology? (Score:2)

Re:comparison to Apple's technology? (Score:3, Interesting)

Don't forget the talking cat: (Score:2)

Re:comparison to Apple's technology? (Score:2)

And don't forget Bell Labs (Score:5, Informative)

I've always wondered why... (Score:2, Interesting)

Re:I've always wondered why... (Score:3, Informative)

Re:I've always wondered why... (Score:2)

incremental (Score:2)

This is not a new approach. (Score:2, Interesting)

Re:This is not a new approach. (Score:1)

Re:This is not a new approach. (Score:3, Informative)

Re:This is not a new approach. (Score:1)

Re:This is not a new approach. (Score:2)

Re:This is not a new approach. (Score:2)

TTS is great (Score:4, Interesting)

ack. no good (Score:3, Funny)

Re:ack. no good (Score:2)

This is cool and all, but (Score:2)

Re:Evil Anti-War Belgian Fries!!!! (Score:2)

Listen to "US female 2" (Score:1, Funny)

And here's the Bell Labs version: (Score:2)

State of the art in TTS (Score:4, Informative)

Re:State of the art in TTS (Score:2)

Better than the TI speech synth chips? (Score:2)

Old news (Score:3, Interesting)

Bonehead: it's P-H-O-N-E-M-E (Score:1)

Unbelievable! (Score:1)

blush (Score:5, Funny)

Re:blush (Score:2)

Re:blush (Score:2)