

Phoneme Approach For Text-to-Speech in SCIAM 197
jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
Does the poster have something against IBM (Score:4, Insightful)
Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.
Re:Does the poster have something against IBM (Score:2, Funny)
Re:Does the poster have something against IBM (Score:2)
Obviously, they are testing the ssytem under load now, and this is part of their test plan.
Tomorrow, we'll see a 'get your own freshly compiled linux ISO from IBM' here...
Re:Does the poster have something against IBM (Score:2)
But.... nothing compares to the
I think that ... (Score:2)
Phonemes not phenomes (Score:4, Informative)
Re:Phonemes not phenomes (Score:1, Interesting)
Anybody willing to write "The Extended Phoneme?"
Homer Simpson perhaps....
Re:As a concerned American patriot, (Score:1)
And where's freedomdot? It's all wrong, I tells ya, it's all freedomin' wrong!
Freedom. The new Marklar.
Re:As a concerned Slashdot reader.. (Score:1)
I was expecting better... (Score:5, Informative)
What's so special about it?
Re:I was expecting better... (Score:5, Informative)
Re:I was expecting better... (Score:2)
Incidentally, they don't seem to have improved a great deal from the concatenative TTS systems IBM had 4 years ago. There was one model of the UK marketing woman for ViaVoice, and for some sentences the TTS was almost indistinguishable from the real thing. The only problem with these systems is that the memory footprint is massive, so they tak
Re:I was expecting better... (Score:3, Interesting)
I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.
Natural Voices Gagged: AT&T is asleep at the d (Score:3, Informative)
We were hoping AT&T would do a better job than IBM at supporting their voice synthesizer. IBM pulled the Linux version of ViaVoice off the market without so much as a peep to their adoring fans on Slashdot, and wiped all mention of the Linux version from their web server. (Goggle isn't even allowed to cache it.) After IBM milked the slashdot linux fanboy publicity for all it was worth, th
Re:I was expecting better... (Score:2, Informative)
We also offer quite a large range of languages. Our Canadian French voice, which was just released, is fantastic! Looks like marketing hasn't put him on the demo page yet though... :(
Todd
Re:I was expecting better... (Score:2)
Re:I was expecting better... (Score:3, Interesting)
Re:I was expecting better... (Score:2)
Haven't gotten the IBM one to work yet.
Re:I was expecting better... (Score:2)
speaking of the /. effect (Score:4, Funny)
IBM Text-to-Speech Research Demonstration
Input Communcations Error.
You have reached this page because of an severe input error. It appears that the client didn't connect to the server. Please inform the system administrator using the feedback mechanism on the main home page.
Re:speaking of the /. effect (Score:1)
This could be a hit... (Score:1, Funny)
...if they make some sort of interface between e-books and text-to-speech. Instant 'sound-book' *smiles*. No longer do the visualy impared have to wait for someone to make the soundbook for them, no longer do I need to actually read the long, booring documents people send me at work.
With the right technical document, this could cure insomnia as well...
Re:This could be a hit... (Score:1)
You should check out the Digital Talking Book specs. It is an open format and there are readers available which allows text to speech and other effects. Most of the readers have been designed with visually impaired target audien
Re:This could be a hit... (Score:2, Interesting)
Re:This could be a hit... (Score:2)
PHONEME, y'all, not *phenome (Score:3, Informative)
Related to "telephone," "phonics," etc.
Re:PHONEME, y'all, not *phenome (Score:5, Funny)
AT&T have been doing this for a while! (Score:5, Informative)
http://www.naturalvoices.att.com/demos/
You'll find AT&T's version a whole lot better. The main problem with voice synthesis is smoothing of phoneme edges, where if it is done too aggressively the speech synthesis can sound too "lumpy".
The other thing is, speech synthesis via phoneme's is very basic practise indeed! I remember having a Currah Speech module for my ZX Spectrum (1982 home computer) - and the first thing you were taught about was phenomes. I'm not entirely sure whats new about this IBM product. It's basically not that much evolved from the mid-90's.
Re:AT&T have been doing this for a while! (Score:3, Funny)
It actually sounded like "Shbansheehailsacthoowawaaaawaaaens"
I remember you could also turn it on while you were programming, so evertime you pressed a key it would say "ONE ZERO PRINT QUOTE ACH EE ELL ELL O QUOTE ENTER TWO ZERO ENTER RUN ENTER". I used to drive me batty. It was one of those eighties things which you thought was "cool" at the time, but
Re:AT&T have been doing this for a while! (Score:2, Insightful)
Re:AT&T have been doing this for a while! (Score:2, Insightful)
cool (Score:1, Interesting)
Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!
graspee
Here's another text-to-speech site (Score:4, Funny)
Some of the voices sound okay I guess. Better than Stephen Hawking anyway.
*blush* (Score:5, Funny)
Oh ... just me? *blush*
Re:*blush* (Score:2)
Re:*blush* (Score:2)
hmmmm... (Score:1, Informative)
cut'n paste:
http://www.cstr.ed.ac.uk/projects/festival/
to try it out (Score:1)
http://festvox.org/voicedemos.html
does the same as IBM's demo page. sounds the same as well. but hey, i'm a layman in linguistic matters, so there's prolly a *huge* improvement i understand crap about
Open Source Speech Synthesis (Score:5, Informative)
Re:Open Source Speech Synthesis (Score:2)
hehe (Score:1)
o wait, this will cost me karma as well! -1 offtopic
Re:Open Source Speech Synthesis (Score:3, Informative)
Festival is great, especially with the OGI patches [ogi.edu]. I was completely blown away by Festival's quality compared to other opensource TTS engines, and OGI stuff makes stock Festival sound pathetic. Really great stuff, regrettably still not as good as IBM's or AT&T's stuff, but they have got a TTS that I can listen to hours without making my ears bleed.
Regrettably OGI patches are for personal/research use only, so Debian won't ship them...
Re:Open Source Speech Synthesis (Score:2)
Unfortunately free-TTS (i.e, playing any, not just replaying canned speech) is a growing area and there will definitely be a large commercial potential and everyone seems to know this.
Re:Open Source Speech Synthesis (Score:3, Informative)
Re:Open Source Speech Synthesis (Score:2)
Re:Open Source Speech Synthesis (Score:2)
# apt-get install festival festvox-poslex festvox-kallpc16k
# lynx -dump -nolist http://www.slashdot.org/ | festival --tts
Re:Open Source Speech Synthesis (Score:2)
Re:Open Source Speech Synthesis (Score:2)
echo "(voice_tll_diphone) (Parameter.set 'Audio_Method 'freebsd16audio)(SayText \"$*\")" | festival --pipe
Obviously using whatever sound system you have. By default it will try to use NAS if it is installed on your system, but I've nev
Re:Open Source Speech Synthesis (Score:2)
Someone who has figured out how to configure that should put it into Debian as a package... then ordinary users could use it.
comparison to Apple's technology? (Score:4, Informative)
How does this compare? I think it is at least at the same level, if not further along! Good work Apple for being in the game, if not ahead of the game on this one.
Re:comparison to Apple's technology? (Score:4, Interesting)
Re:comparison to Apple's technology? (Score:2)
Maybe that explains the fanactial devotion of Mac users...
"I do what the voices in my Mac tell me" sounds like a t-shirt begging to be printed up.
Re:comparison to Apple's technology? (Score:2)
Re:comparison to Apple's technology? (Score:3, Interesting)
No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded
Don't forget the talking cat: (Score:2)
cat -a is even cooler than snoop -a.
Re:comparison to Apple's technology? (Score:2)
The one you probably want answered is which sounds better. At this point the IBM voices sound better than the Apple TTS, but not by very much. Especially when you consider that Apple hasn't improved the voices in over 7 years IIRC (Of course given the option of better voices of having OS X, I'll forgo the voices). Playing several phrases from IBM's and Apple's TTS systems yields the opinion that
And don't forget Bell Labs (Score:5, Informative)
I've always wondered why... (Score:2, Interesting)
Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?
Re:I've always wondered why... (Score:3, Informative)
Up to now, Microsoft has not really made any significant contributions to speech technology. They have bought l
Re:I've always wondered why... (Score:2)
This reminded me of an amusing sideline in the history of speech Reco. Cambridge University Engineering department (CUED) originally built an engine called HTK [cam.ac.uk]. This was then sold to a company called Entropic. Entropic were then bought by Microsoft, who have licensed HTK back to CUED, who distribute it for free. This leads to the ammusing situation in which the license [cam.ac.uk] for a piece of Microsoft code contains the
incremental (Score:2)
Of course, the intonation is roughly that kind of compromise a PR spokesman employs who is trying to sound convincing but has no clue what he is saying. That's not surprising, given that the TTS systems really do not have any understanding of the meaning of what they are saying.
This is not a new approach. (Score:2, Interesting)
I googled for +"General Instrument" +"SC-01" and got links shown here [http].
I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.
It turns out all speech is nothing but sequences of utterances ( vowels and syllabic )
Re:This is not a new approach. (Score:1)
The Google General Instruments SC-01 Links [google.com].
Sorry for the botched post.
Re:This is not a new approach. (Score:3, Informative)
It's a whole lot more complicated than that. If you think phonetically about the way we talk we often merge words together rather than leave short descreet pauses between words. (For example, do you say "leaderovthepack" or "leader. ov. the. pack"? Also note
Re:This is not a new approach. (Score:1)
Votrax made the SC-01 chip.
General Instruments made the SP0256 chip
I do not remember if the chip I had was dual marked - so I do not know if they were the same chip but under different numbers, and quite frankly I do not wanna tear into the old machine right now to verify.
And it was in the early 1980's , which was about 20 years ago. Not 30.
You can read more about it here [redcedar.com].
Re:This is not a new approach. (Score:2)
This is not a new approach.
No, but it's a fairly sophisticated refinement of an old(ish) approach. The core ideas that make it possible have been around for a number of years, but there are a lot of constraints that make it difficult to achieve. And just for rant's sake, the qualifying use of the term 'phoneme' in the post is misleading. Phonemes are the fundamental of vocal articulation; it would be impossible to synthesize speech without them. What sets different TTS systems apart is how they are
Re:This is not a new approach. (Score:2)
And I think you're right. Placement is everything. Cheers.
TTS is great (Score:4, Interesting)
mmm. I hope the server can take a slashdotting...
The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:
speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(
It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.
mmm. There was indenting in code at one point. Sigh...
ack. no good (Score:3, Funny)
give me! give me! oh! I am coming!! OHHHH!
Actually I did try it. the result (of the above line) was not spectacular. I am impressed with the quality in general, though. Tried "Sticking feathers up your butt does not make you a chicken," but that needs to be said with feelings as well, I suppose.
Oh yeah, this kind of technology is excellent for a computer to read out the sites to you, if, say, your eyes are tired. It should work wonders for slashdot, even.
Re:ack. no good (Score:2)
I think you discovered the killer application for this technology: the voice reads erotic stories to you while you surf pr0n.
This is cool and all, but (Score:2)
Will it ever be possible ? As far as I can tell, S2T is quite a bit more difficult then english->french translation for instance, and that still has a long way to go...
Re:Evil Anti-War Belgian Fries!!!! (Score:2)
being belgian, so am I !
we don't have much of a wine culture, dumbo. We're beer drinkers. Check out www.belgianbeer.com. We pratically invented the stuff.
Listen to "US female 2" (Score:1, Funny)
"Aargh! I've been slashdotted!" [fys.ku.dk]
Bandwidth sponsored by danish research funding...
And here's the Bell Labs version: (Score:2)
This one is much better at saying "slashdotted". Neither of them do the "Aargh!" very well. Especially the IBM one ought to be convincing, given current circumstances
Generate more samples for yourself at http://www.naturalvoices.att.com/demos/ [att.com]
State of the art in TTS (Score:4, Informative)
Re:State of the art in TTS (Score:2)
Also, festival supports unit selection synthesis (which is what you're calling corpus synthesis - the corpus is just the body of text to be recorded, which is used in diphone synthesis also) as well as diphone synthesis.
Better than the TI speech synth chips? (Score:2)
ttyl
Farrell
Old news (Score:3, Interesting)
I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")
Bonehead: it's P-H-O-N-E-M-E (Score:1)
Unbelievable! (Score:1)
That just makes my day!
Hollywood applications for speech synthesis? (Score:2, Interesting)
Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.
OK, the point of this article
In the 'has been doing that for a while' series : (Score:2)
It is even is free (as in beer) for personnal use.
I'm not actually convinced phonemes exist, y'know (Score:5, Insightful)
In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a
Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.
Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.
The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...
Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).
In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.
Re:I'm not actually convinced phonemes exist, y'kn (Score:2, Interesting)
In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these
Re:I'm not actually convinced phonemes exist, y'kn (Score:2, Interesting)
You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-ph
Phonemes don't exist? Do YOU??? (Score:2)
This is not a very coherent argument. You might as well say that you doubt the existence of musical notes, since you've diagrammed the power spectrum
LPC vocoder (Score:2)
It looks like they are using glottal pulses as you say, and they are doing the female voice (Crystal) by boosting the first two harmonics and by filtering out the range past 4 kHz and replacing it with noise to give it that breathy sound that is characteristic of female voices in American culture (this varies with culture
What about physical modelling? (Score:2)
So what's the next step? Is there anyone working on physical modelling of the acoustic properties of the mouth, tongue, throat, larynx, and lungs as they glide between different phonemes to produce speech sounds? This seems like the only way you're gonna get something closer to natural than this recorded-phoneme technol
Check a university library (Score:2)
It seems to me that with modern DSP's cranking along with much more calculations per second than a VAX could ever hope for, and one of the best theoretical mathematicians ever having a reliance on the technology, that
Counterfeit sound bites (Score:2)
we've been doing this for a while (Score:3, Informative)
Is it just me? (Score:2, Insightful)
Or does anyone else not understand what the big deal about text to speech is?
I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.
Speech to text, on the other hand...
Re:Is it just me? (Score:2)
I think, though, that in retrospect it was not quite so good as we remember it; getting something like that to sound more natural is no small thing, nor is it to make it a smaller, faster program that makes fewer pronunciation errors. Incremental advancements are the name of the game for most technologies -- what was Apollo, after all, except a series of incremental advancements over Sputnik?
Good problem for competitive algorithms? (Score:2)
Many iterations later, you probably can get a computer sounding just like a person. And since it has had a whole book to practice over, it should be pretty general.
Slashdot Demographics (Score:2)
a) requests for female voices saying dirty things and
b) requests for male voices saying: "How are you gentlemen!! All your base are belong to us!! You have no chance to survive make your time!!"
c) "I got an error, you insensitive clod!"
Not very good TTS (Score:2, Funny)
What I wish On-Star would actually say [dweebsofdeath.com]
A slightly-edited announcement calling our Bulldog to attend to a special matter [dweebsofdeath.com]
tone
Oh sure... (Score:2)
Re:This is AT&T's Watson from 1995! (Score:2)
Assuming this isn't a troll, then you might notice that IBM operates the massive Thomas J Watson research lab. Perhaps the URL has something to do with that? Second, you might want to have a losten if you think TTS hasn't moved in 8 years.
Re:This is AT&T's Watson from 1995! (Score:2)
It's the Watson Research Lab, as in T. J. Watson, as in the CEO who started the company over 80 years ago.
Singing speech synthesizers: Dictionaraoke! (Score:2)
And Oregon Graduate Institute's CSLU Toolkit [ogi.edu] extends Festival with an implementation of Sable: an XML format that lets you mark up text with arbitrary timing, pitch and volume envelopes.
An of course there's Dictionaraoke [dictionaraoke.org]!
Main Entry: dictionaraoke Pronunciation: 'dik-sh&-"ner-A-O-ke Definition: Audio clips from online dictionaries sing the hits of yesterday and today. The fun of