Phoneme Approach For Text-to-Speech in SCIAM 197
jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
cool (Score:1, Interesting)
Try "I never promised you a rose garden." -The speaker sounds genuinally pissed-off!
graspee
I've always wondered why... (Score:2, Interesting)
Is MS carrying any patents on this, and acting Dog-In-The-Manger..ish? Any good low-footprint Linux-based apps for text-speech?
This is not a new approach. (Score:2, Interesting)
I googled for +"General Instrument" +"SC-01" and got links shown here [http].
I think Votrax was in bed with General Instruments, as they have another chip by the same name, that apparently does the same thing, but I do remember mine was a GI part.
It turns out all speech is nothing but sequences of utterances ( vowels and syllabic ). Just string them together and you get speech. String them together very carefully and the speech begins sounding like it came from a human instead of a machine.
I know IBM is refining this, but the concept is really old hat.
TTS is great (Score:4, Interesting)
mmm. I hope the server can take a slashdotting...
The TTS interface is C++, but it comes with a program that will compile text into AU files. I wrote the following script to change those AU files into mp3s:
speakfile is a slightly hacked version of the demo program IBM ships. Unfortunately, /.'s lameness filter doesn't like C++ code. :-(
It's petty messy C++ hacking on my part, anyway. The Perl program is based on the CPAN module Audio::SoundFile. It's also hacked from a demo script that shipped with the module.
mmm. There was indenting in code at one point. Sigh...
Re:Phonemes not phenomes (Score:1, Interesting)
Anybody willing to write "The Extended Phoneme?"
Homer Simpson perhaps....
Re:comparison to Apple's technology? (Score:4, Interesting)
Old news (Score:3, Interesting)
I want to know what's going on with speech-to-text, and will I be able to dictate rather than type a novel any time soon? (Preferably with some form of intelligent speech recognition, so it doesn't end up with passages like "She, ah... walked, no strode into the room to find, uh, er, dammit, did I say Rob left the tape on the counter or the desk? Oh, bloody hell. Hello? No, I'm not interested in double glazing. How did you get this number anyway? Bye. Where was I? Oh, crap! Computer, pause-")
take a note from musicians (Score:1, Interesting)
With sampling technology, especially multisampling where for example each note can have different sounds associated to it depending on the accent, you could achieve some really stunning results in the text to speech market.
People like EastWest [eastwestsounds.com] have created such systems for virtual choirs...check out Voices Of The Apocalypse [soundsonline.com] as this is some pretty basic but revolutionary way of using samplers...
Hollywood applications for speech synthesis? (Score:2, Interesting)
Computer graphics have now advanced to the point where, given enough time and processing power, you can simulate almost anything with near-photographic realism. ILM, Digital Domain, Weta, et al can create completely convincing digital characters, but (leaving aside the issue of how a digital performance is based on the the 'actor' - e.g. Andy Serkis' 'performance' in LOTR:TTT, or Dex in SW:AOTC) they're still entirely dependent on human voice actors to complete the performance.
OK, the point of this article is on-demand realtime speech synthesis - roughly analogous to the 3D engines used in games. It has to compromise quality and detail for the sake of speed and responsiveness. Could there be a market for 'voice rendering' - a system which can take a script (possibly with some additional mark-up to indicate emotions, emphasis, timing, etc.) and generate an audio version which approaches a reading of the same script by a competent voice actor?
As well as the obvious 'virtual thespian' Hollywood angle, I'm thinking about stuff like low-budget audio drama [slashdot.org] - people who have the time and the technology to tweak a voice script, but can't afford professional actors to do it the old-fashioned way. There could be applications for creating audio books for the visually impaired, or to make life easier for students working through Shakespeare and Chaucer - I'm still amazed every time I hear Shakespeare read out loud how much meaning can be conveyed by the nuances of the human voice, compared to dry printed prose.
Is anyone actively working on anything like this? If not, why not? Is it really that hard to fool the human ear? Or is it just a case that it's still cheaper and easier just to employ people to read things into a mic?
--Re:I was expecting better... (Score:3, Interesting)
Re:This could be a hit... (Score:2, Interesting)
Re:I was expecting better... (Score:3, Interesting)
I wanted the same voice for my computer-controlled house, and tracked down where they got it. Now my handheld says, "Warning. Power failure immenent." when it's batter is about to die.
Re:comparison to Apple's technology? (Score:3, Interesting)
No matter how good these phoneme-based techniques are, they're limited to the original timbre of the recorded speaker - you cannot synthesize a brand new voice (with on the fly inflections that were never recorded, etc.) with that TTS method. There has been research into modeled speech synthesis [ogi.edu], where a mathematical model of lungs, windpipe, vocal cords, and mouth/tongue/lips, are manipulated in order to generate speech. Given the extreme amount of computing power today, you'd expect more people to use that type of TTS, since it's inherently more flexible. However, the biggest problem so far is nobody really has a good model for how all the various fleshy parts within the human speech apparatus interact together. Any open source people want to tackle this problem and start implementing some of these modeled synthesis speech algorithms?
Re:I'm not actually convinced phonemes exist, y'kn (Score:2, Interesting)
In fact, much of what you've said indicates the *eventual* possibility of a very conversable TTS/STT translating algorithm. (Whether or not these will be the same algorithm in reverse will be for the future to decide).
"Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.
Right there, you've laid out a *very complicated* but by no means difficult way of looking at phonemes individually. In my class, we have some 20 people who all have great difficulty in just figuring out allomorphs, which some Slashdotters might not know are phonemes either in complementary distribution, such as in the case of plural nouns:
or in free variation, such as the Lisa/Liza name which mean the same thing, and are derived from the same root, but which have split due to geographical/cultural/other reasons.
Now, where the average English major might not always recognize similarities and patterns, the average Slashdotter has trained him/herself to do so, and some are likely saying to themselves, "where else does this happen?" and "where is this not true?", which are useful, scientific questions.
You yourself present the answer to the problem you raise: we have to look at the surrounding phonemes in order to figure out how to make one particular sound fit the word it's in. This is *damn hard*, but not impossible. It's like the fact that stress affects a phoneme in certain languages: we just need to adapt to thinking about language in different terms than simply speaking it and spelling vague representations of it (by first realizing how vague those representations are, which is why the phoneme set is taught first in Linguistics classes).
Personally, I think the problem lies in the fact that we all want TTS/STT and we want it *now!*, and why can't the computer just say it or hear it the way we do, and all the other questions that come from a lack of understanding, both of how the machine represents everything and the garbled way in which our language is represented. Phonemes are the obvious solution: the software should only have to do STP/PTS conversion, and our language should conform to that, really, since it's the creative dialectical shifts that create a problem, but we'll end up devising a creative solution for that, too.
Now, we all know what happens with lossy compression...
Yes, we get a slightly inaccurate but highly useful jpeg of the Andes, or someone's new desktop widget set, or a very listenable 192kbps mp3 of "Hurt" covered by Johnny Cash (even sadder than the original, IMHO).
And TTS/STT will have its flaws as well, but a digital (though wide) set of sound symbols like phonemes will help us to break things down somewhat until we figure out that something *smaller* about those sounds is very functional, *and* how to represent *that* level of speech, just as we represented matter by some informal type, then by molecules, then atoms, and now we know quite a bit about how the electron, proton and neutron work, and are working on a smaller level.
To say you "don't really believe in phonemes" oversi
Re:I'm not actually convinced phonemes exist, y'kn (Score:2, Interesting)
You argued in another post for models of 4+ phonemes. Why we don't see this is because it's not a huge theoretical leap from triphones (thus boring researchers) and there are computational/storage/training efficiency requirements to consider. This is why one doesn't record an exhaustive library of every possible utterance in the first place. I think once you get to 7-phones, you may be better off trying to understand the phrase from higher level of abstraction.
Have we correctly identified the right compact expression of speech? I doubt it. Getting speech stuff to work involves a lot of tweaking that is theoretically ungrounded. Tweaking in a methodical and science-biased way _is_ engineering, however.
BTW, I seem to remember a prof saying that X-ray cinematography more-or-less proved the existance of vocal tract target configurations in speech, which correspond to phonemes. Not to mention that you can encode a message in IPA and have it understood by someone else. Even if they're not totally correct, phonemes may be a sufficient basis for building speech systems.