Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
IBM Science

Phoneme Approach For Text-to-Speech in SCIAM 197

jscribner writes "Scientific American is running a feature on IBM Research's Text-to-Speech technology. It discusses the current state of affairs in this field, and describes IBM's phoneme based 'Supervoices' approach. The IBM site provides a demonstration, allowing users to enter text to be rendered to speech, as well as providing several examples in other languages."
This discussion has been archived. No new comments can be posted.

Phoneme Approach For Text-to-Speech in SCIAM

Comments Filter:
  • by watzinaneihm ( 627119 ) on Monday March 17, 2003 @08:04AM (#5528053) Journal
    Does the poster have something against IBM ... to link an application to a slashdot post?
    Even the guys who dont read the articles might be now tempted to try clicking to "enter text" link.
  • by Anonymous Coward on Monday March 17, 2003 @08:38AM (#5528154)
    The IBM product seems to take the recording of a long text read by a human and automatically produce the data collection that is the artificial voice. It uses speech recognition methods to align text and recording. It also stores more than just a simple collection of phonemes: Where older text-to-speech solutions would modify the sample of a phoneme to reflect a certain position in a sentence, IBMs solution appears to use a phoneme sample from the same context, making the result much less monotone. This approach does however beg the question whether "phoneme based" is still its most important characteristic. There are only 40 phonemes, not 10000 (the number of samples used by the IBM "voices").
  • by jdoeii ( 468503 ) on Monday March 17, 2003 @09:38AM (#5528410)
    Apparently IBM bought the formerly AT&T's later Lucent's Watson project. The web page is even called webtts.watson.ibm.com. Obviously the quality of TTS has not improved much since 1996.

    Can someone please tell me why this 8 y.o. project is considered news?
  • by Bertie ( 87778 ) on Monday March 17, 2003 @10:01AM (#5528535) Homepage
    I have a master's in linguistics, specialising in speech processing and the like, and I don't really believe in phonemes.

    In the beginning, there was the word. And the word was spoken. A long, long time later came writing. Most early forms of writing seem to have been pictographic. Eventually that started to be a bit too complicated for most, and somewhere along the line we switched to trying to represent the sounds of the words that we used. These writing systems had to be sort of retrofitted onto the sounds we used, and so they were never going to amount to a perfect transcription of the sounds used. Huge alphabets quickly become unwieldy, and while there is a great deal of variation between languages in terms of how they deal with these issues, in most cases sounds end up being shoehorned into one category or another - "oh, that's sort of a /t/, I'll write it down like that". You know yourselves how often words in English bear no relation to their spoken forms.

    Anyway, a long time after that, people got interested in phonetics. Conditioned as we were into thinking of words as collections of letters, along came the concept of the phoneme, which, as somebody said above, is the smallest individual unit of speech which can be distinguished from other such units. Phoneticists set about mapping all the sounds of all the languages in the world to phonemes, and we got the international phonetic alphabet.

    Later still, we managed to invent machines which allowed us to analyse sound spectra. Run a spoken utterance through one of these and what you'll see most certainly isn't a succession of distinct sounds. Truth is, our brain does so much work on the raw sound that our perception of the sounds is entirely different from the reality. "Phonemes" don't just start and end neatly - they overlap massively. A single vowel can affect maybe the preceding four segments and the following six because of the effects of reconfiguring your vocal tract. The next sound might do the same. And the next one... As you can probably imagine, it's a pretty messy picture really. Believe me, I have suffered greatly trying to segment voice spectra by hand.

    The point of all this is that when we started speaking yonks ago, we were making use of the vocal tract nature (God, natural selection, take yer pick, I don't want to get into an argument about it) gave us. We weren't thinking of phonemes and stuff, we were just making noises subject to the limitations of the equipment we had. The notion that this is a nice, ordered system of sounds is an artifical one imposed by us in an attempt to make sense of it all, and it amounts to an expanded version of an oversimplified system (the alphabet). Now, we all know what happens with lossy compression...

    Simply drawing lines down the spectrogram in the name of making it easier to work with just throws away subtlety, so that when you use a phoneme-based TTS system you get a series of disjointed sounds with perhaps some token effort at coarticulation (i.e. the phenomenon of overlapping sounds described above), and it's always going to sound awful. The consequences for speech recognition are much worse (sure, your hidden Markov model-based systems working with sequences of two or three phonemes are pretty effective, but they'll never be 100% successful in my opinion).

    In short, what you have here's an engineer's approach to art. It's like taking a painting by your favourite artist and turning it into a 256-colour bitmap, then analysing the result and trying to make new paintings in the same style.
  • Is it just me? (Score:2, Insightful)

    by evronm ( 530821 ) <<ten.cnictd> <ta> <mnorve>> on Monday March 17, 2003 @11:55AM (#5529169) Homepage

    Or does anyone else not understand what the big deal about text to speech is?

    I had a program for my C64 circa 1983 that did pretty good text to speech. Granted the voice was pretty robotic, but I'd think that 20 years later, this should be a cinch.

    Speech to text, on the other hand...

  • by prowley ( 587280 ) on Monday March 17, 2003 @04:10PM (#5531235)
    The way to smooth out the lumps is to not use phonemes at all, but diphones. Imagine recording two phonemes uttered by a human speaker in sequence, and then slicing through the middle of each phoneme to and discarding the ends. That gives you a diphone. Diphones are far superior because phonemes do not change in the middle, so there are no "lumps" at the splice. On the other hand phonemes do change depending on what phoneme is uttered next, simply because in articulating different phoneme sequences the human vocal tract must perform different gymnastics. The only downside is that a full set of diphones is much larger than a full set of phonemes - and they are all buggers to record.

There are two ways to write error-free programs; only the third one works.

Working...