Neural Net Outperfoms Human in Speech Recognition 203
orac2 writes "Here's a press release (with a real video clip) on a neural net that can recognise speech better than humans - even in noisy environments. The network uses just 11 neurons. They did it by incorporating an aspect of biological neural networks normally ignored by artificial networks; the timing of signals between neurons. Beyond the immediate application to speech recognition the wider implications for all neural networks are obvious. " Neurons. Mmm.
HAL-9000 (Score:1)
Re:Article misses biggest (and scariest) use ... (Score:1)
Tape surveillance and a human filter can only go so far. Parabolic antennae are bulky and easily recognized. Humans using digital filters require training and equipment booking, which could be expensive depending on the amount of noise in the recording and the expertise of the user.
These speech recognition devices could put surveillance within reach of anyone, regardless of their expertise, without hiring potentially expensive human filters. You or I, without any surveillance experience or knowledge of digital filtering technology, could trail certain people and tape their conversations even in crowded areas.
How does this new technique affect learning time? (Score:1)
Re:The scariest thing... (Score:2)
In essence, what was created is little more than a super-hearing-aid. Certainly a good thing for the hard-of-hearing (and this one, it would seem, could significantly boost the hearing of anyone, even those with "normal" hearing to start).
Re:How to extend this: (Score:1)
There is some very definite predefined structure to the brain between the gross and cellular anatomy. It isn't just a raw neural net that is trained by physical pleasure and pain. Somehow, consciousness, intent, recognition of success and failure are built in at the highest level and recognition of the same sound in different pitches and visual recognition independent of image location on the retina are built in at a low level.
I suspect if you could read and analyse all of the connections, it wouldn't look like one big mess of seemingly random connections, but a lot of small neural nets arranged and interconnected in hierarchies, sorting pipes, buses, and the like (which aren't especially neural mechanisms, they're probably better done in the familiar ways we've developed that suit silicon).
I agree about using more neural nets on the problem. I never meant to imply that it would be anything more than a single component in a larger system. At some point in the process, you need to recognize sounds, whatever you do with them later.
What I dislike is the way some people treat neural nets as a magic bullet, as if we only need to make a big enough neural net and it will solve any problem. I think only small neural nets really work well; beyond a few dozen neurons, an external structure is needed to get anything to work.
(IMHO, the most important thing for *-recognition programs to start doing is admitting that they didn't understand and asking for clarification; "best guess" is not a good strategy)
The imperative nsa connection (Score:2)
And how many still believe that Echelon is not capable of recognizing words in conversations automatically?
-
This Kills Intel -- and AMD (Score:1)
Re:Lies, Damned Lies, and Neural Nets (Score:1)
Re:Call me crazy... (Score:2)
No, it's just preachy BS.
It's obvious you are just tripping. The creator of the first one that says "I'm Sorry Dave" will think he is Dave, or that he has been mistaken for someone who IS Dave.
As far as "glorifying themselfs" for mimicking what they already saw with the human brain, WTF are you talking about? Of course they're proud of themselves! They did something with a computer that nobody had ever been able to do before. That right there is fucking cool! I'm proud of them too, and I go to UCLA! (the researchers in question were from USC, in case anybody missed that)
Re:Is everybody a pessimist?? (Score:2)
Almost equally disturbing is the apparent ability of the Berger-Liaw system to distinguish individual voices from background noise, which raises the specter of governments being able to use almost unimaginably faint sounds to avoid more intrusive methods of bugging, and the monitoring of conversations in crowds. Combine that with existing off-the-shelf technology for face recognition...
Let's just say that I will be very surprised if the first customers for this technology aren't in Beijing and even more surprised if they aren't quickly followed by the dolts in Washington.
And hey, if I can reconstruct what you say inside your home from the weak sound waves that drift out into the street, that might not even require a warrant...
Re:I get the impression (Score:2)
NO, IT DOES NOT!!
It all comes down to statistics. Speech is a non-white signal. Noise is white. If you have two microphones/ears, you simply search for the linear combination of the two signals that is the most un-correlated temporally, and voala! You have found the speech signal. This is known as blind separation.
All this and a counting horse (Score:5)
a) There is no statement of the train/test procedure for the neural net. It's fairly easy to get good performance if you're training your system on the same dataset that you test. Without this information, you cannot make a reasonable judgement.
b) If you listen to the audio samples in the video at
http://www.usc.edu/ext-relations/news_service/r
You can notice a significant difference in the times of the samples (e.g. "stop" is shorter than "yes"). A fairly unsophisticated NN can pick up on the length of a sound sample and generalize from there. I didn't hear any statement saying that in the official training and testing all sound samples were of the same length.
It's really a mess. If someone has a journal article or other piece of reliable information on this research, a pointer would be appreciated. Until then, I'll be feeding Clever Hans.
Is it really recognizing speech? (Score:1)
Couldn't the system simply be detecting the length of the signal and interpreting it that way? With so few sample words it seems hard to tell how the thing is really working -- something which is not really explained.
Maybe I'm just way off tho.
Re:Long way to go, but cool for AI (Score:1)
This is my feeling, too. This might be just the tip of the iceberg.
Can anyone see how something like this could be made using software?
Kythe
(Remove "x"'s from
This is cool but... (Score:2)
I'm more concerned that USC is trying to patent the "system and the architectural concepts on which it is based". As a computational biologist who uses neural nets in my work, I rely on the AI community to develop the underlying algorithms. If they get a patent on the algorithm and not just their hardware, that would severely limit the use of this breakthrough in other scientific areas.
JMC
Re:Long way to go, but cool for AI (Score:1)
Doesn't the English languages use only a few dozen sounds ("phonems" or something)?
Once you can recognize those sounds I'm pretty sure it's easy convert a list of those sounds to a written sentence. I'd bet it could be done in under 200 lines of Perl. :)
But I'm no speech recognition expert.
Re:HAL-9000 (Score:1)
So unless he's been built secretly in this reality its too late.
What's on top of houses? Dog: Ruff (C'mon...) (Score:1)
All it did was match her voice recording of her saying his name as key to a database against her saying his name. Sheesh, hype hype hype hype hype hype.
Patent? (Score:3)
Re:This Kills Intel -- and AMD (Score:2)
Missing the point (Score:1)
Ok, ya, this thing can do voice regognition. But the advantage of that is not so that I can dictate this speach instead of typing it (although I would) the advantage is "changing the different user interface".
Right now, computers are very specific in their instrution taking. Click a little to the left of a button, and the computer has no idea of what you are doing. Type copy instead of cp into most unix's and it won't even copy the dumb file.
If this neural net can distinguish my voice from others in a room, I can talk to it. "Computer, check all TV channels for the baseball scores and display on this screen please."
Not only does this make the computer easier to use, and more usable for more poeple, it makes it more useful. ***If frees us from sitting at a workstation.*** Notice in Star Trek they can say, "Computer, what is the population of Earth?" and it will respond.
Granted, Artificial Intelligence must be improved to allow the computer to understand this instruction, but the voice communication is ESSENTIAL.
I look forward to the days when I can chat with my computer.
Morons still my favorite (Score:2)
----------
sprechen mit dem Computer (Score:1)
Imagine calling your computer up with a phone.. any phone, a pay phone, etc.. and getting it booted up and ready for you.. how convenient!
Oh great, just what the world needs. (Score:3)
Now, instead of requiring at least 2 people to invade your privacy and listen to everything you say, one supercomputer and a bunch of listening devices let The Man (tm) listen to thousands of people at once and scan the transcripts for keywords and sentances.
I get the impression (Score:5)
Here's the original link
http://ww w.usc.edu/ext-relations/news_service/releases/sto
If I'm right about that, then this development (while still insanely cool - don't get me wrong) might not be so surprising. As I recall from college brain-and-mind psych courses, humans use a variety of factors when singling out a lone voice or conversation in a noisy environment. These include spacial orientation, visual cues, etc. My prof called the "cocktail party effect". Rob them of these cues, and it isn't suprising that they are hobbled.
Also, computers have the mixed blessing of ignoring information patterns unless they are instructed to do otherwise. A person, listening to white noise, would subconsciously attempt to find meaning in every bleep and scratch. A computer, listening only for certain cues, can disregard the majority of the signal.
I would be interested in learning what rate of word recognition this system achieves. Current technology manages about 90%, which means one in every ten words is heard incorrectly. If they could improve that to 99.9% or even just 99%, we might actually get some speech-processors in Office desktop products.
-konstant
Bummer! (Score:1)
Only 11 neurons? (Score:1)
I wonder how long before we see silicon for a net like this... or Sony incorporates it in the next Aibo...
Re:Remember in _Snow Crash_... (Score:1)
Time to open you wallet dude!. html
Check out this link:
http://www.hp.com/jornada/products/430se/overview
An entirely frivolous application? Maybe not... (Score:1)
My wife's constantly complaining that I don't listen whereas I think the problem's increasingly that I don't hear well; I'm over 35 and have been putting off going to audiological screening for awhile now. This article makes me wonder: Will we eventually see hearing aids that specialize in recognizing and resynthesizing speech? (In case you care, what triggered my pondering was the mention that this works well even in noisy environments, and in any kind of background noise at all, I'm having real trouble understanding speech lately.)
Re:Morons still my favorite (Score:1)
Use your cell phone at a loud rock concert (Score:1)
I can just image such a call now...
"Yes! No! Fire! Stop! No! Fire! Yes!"
Of course, you'll need to have only 11 neurons to understand the conversation.
Hmmm.. Interesting picture (Score:1)
Looking closer at this pic and zooming in a bit.. I'm noticing something..
11 Neurons and 30 connections, hmm? Well, in the center (the big black circle) there's 11 little circles (or twelve if you'd call the third from the top on the left a circle.. looks like a mistake to me). Count all the lines going between these, and include the lines coming in from the left (the red ones) and the black one going to the big black circle and you have 30 lines.....
Anyone more knowledgable that I care to figure this one out?
---
Re:The scariest thing... (Score:1)
Re:The scariest thing... (Score:1)
That was proven a long time ago when the first computer was invented (I'll include calculators). You try to multiply 1935219 * 32946214 and compare the time-response with a computer and you'll agree with me....
Re:The scariest thing... (Score:1)
Re:Patent? (Score:1)
My understanding of patent law is that a patent isn't just for a device, but for a use of that device. Please correct me if I'm wrong.
Given that this idea (temporal information in neural networks) has so many really cool possible applications, they'd have a very difficult job patenting all the uses.
I can see this being useful in just about any real-time control system, such as autopilots, assembly lines, controlling the temperature of your shower. Anything involving streaming data really.
Actually, the more I think about it, the more *really* crazy ideas for this I come up with. I've got a problem with automatic garbage collection in a system at the moment where this might help... Oh, dear - they probably can't patent it for that now. Whoops.
Incidentally, are there any regular
Re:Patent? (Score:1)
My children will not have their brains surgically removed at differentiation to avoid infringing USC's patent.
Re:All this and a counting horse (Score:3)
What I noticed (and makes me wish they actually had a technical paper linked to the article to appease my methodological curiosity) is that the 'random background noise' was exactly the same for each word in a given round of testing.
If they were training by those samples, the entire story is bogus because the pure, unmasked original word could be extrapolated by taking one sample, inverting the wave, and adding a second sample.
to put it another way, the net wouldn't be learning how to interpret the word "no" or "fire" in a crowd. It would be learning how to understand that particular soundbyte of cocktail party babble and be able to distinguish in what way the original cocktail party sound was modified.
This is completely useless because you'll never have a need (or the opportunity) to have two (or four) differnt words masked over the exact same soundwave. The background noise will always be different from sample to sample in a real world test.
Re:The scariest thing... (Score:1)
Re:The scariest thing... (Score:1)
What is it that makes everyone think that the government agencies don't already have this technology?
If you don't think these agencies are capable of keeping important technologies all to themselves you should dig up the history of undersea channel microphones.
And what is this business about machines suddenly overtaking humans on a "perceptual" basis? Technology has long held the advantage over humans in most fundamental categories.
The historical problems with noise filtering were entirely due to the fact that we didn't know how to solve the problem. Finally these cloistered scientists have got off their asses and figured out what they were doing wrong all along.
The only "breakthrough" here is that we learn that the "insurmountable" human advantage accrues to a neuronal system which can be replicated (and even improved) with a model consisting of eleven neurons.
Wow. What a huge edge the human being has over the machine.
But don't worry. Our "innate" advantage in searching complex problem spaces is probably safe for another ten years.
Eleven Neurons? (Score:1)
This AI machine certainly has more neurons than I have delegated to speech recognition as I am in the process of patenting a process in which you can switch 95% of your brain capacity over to web surfing, useless pop culture quotes, and slashdot posting.
I figure if everyone was less intelligent, it would be way easier to create artificial intelligence.
Neural Networks -- a farce or fact? (Score:3)
I've recently done quite a bit of research on Neural Networks, including coding and simulating them by hand... There are some (qutie drastic) flaws with neural networks...
I started my research doing a classic 5 pixel by 5 pixel OCR (optical character recognition) on the domain of digits on a single layer perceptron type network (similar to what these guys were using minuns the delayed firing rate)
Not suprisingly, the training algorithm converged to an answer quite quickly and I proceded to run tests with noisy data, to test the genrealazation of the network.
This isn't shocking in itself until you realize that once you go above fifty percent distortion rates you are actually INVERTING the digit!
I retrained the network with inverted digits as well as the normal digits and re-ran the tests on the same set of data (note: The net WILL NOT converge on normal & inverted 5x5 digits with only ten cells).. The correctness rate was only twnety-per cent throughout the whole domain of noise levels.
I then retrained again using TWENTY cells (9 more than this articles) and it converge quite nicely and gave me a quadratic function with an R-Squared value of .9995 or so.
People view Neural networks sometimes as a fix-all solution.. The article on /. earlier about "eveloutionary computing" is the same premise as neural networks : try stuff randomly (or using calculus) until we get a decent solution.
I'm sorry kiddoes, but that just doesn't cut it. A neural network can't ever outperform a Turing machine so there can't be any chance in hell it will ever outperform us in non-specilized tasks.
Of course, I'd probably be more optimistic if these guys would have released there algorithms, papers, source-code, etc so we could actually figure out HOW the HELL they can get an 11 cell network to recognize speech...
The moral of the story? understanding speech is a hell of lot harder than recognizing ten digits!
Re:Long way to go, but cool for AI (Score:2)
But even when you've got the phonemes, you've still got a fair ammount of work cut out for you. A number of phonological processes take place. For instance 'in plain sight' in may be pronounced 'im'. These kind of transformations (and more complicated ones) are happening all over the place, in every spoken language.
Linguists generally describe this kind of thing by writing context-sensitive rules to enumerate the transformations. Similar syntactic translations are context-sensitive.
Computer programming languages' syntax (er, not counting types, and identifier agreement (which are special cased)) are not even typically generic context-free languages, but instead are almost always part of the LL(1) or LR(1) subsets, meaning that they have the special property that you can determine what's going on just by looking ahead one character. Otherwise you end up with N^3 parsing time, and that's for context-free languages. Parsing of context-sensitive languages is way more problematic (think halting problem).
Unless you can parse the syntax, you can't really resolve ambiguities (to/two/too, there/they're, or even things which merge because of phonology (bitter/bidder/bit her)). Note that humans don't do so great with these issues always either, so a partial solution will be still qutie amazing.
But the fact still stands that turing samples into phonemes is only the first step in a very complicated process towards even something as simple as taking dictation. In fact, I'd say that syntax->semantics may be a smaller step than phonemes->syntax.
They're solving a much easier problem (?) (Score:1)
I think the really important thing here is that the neural system almost certainly knew there were only four possibilities, and never had to respond 'none of the above'. So this is a comparatively simple two-bit classification problem, which is a far easier thing than what Dragon Dictate (or people) are trying to do, ie recognise a arbitrary string of phonemes, giving a combinatorial explosion of possible words. So the performance of this system probably is actually not that impressive.
But there is a huge interest building in biological neural networks' sensitivity to the temporal sequence of input spikes (rather than just the average rates of inputs spiking, which is what software neural networks try to model).
There was a talk I went to in London in June by Terry Sejnowski, who's head of the computational neurobiology lab at the Salk institute in California. Apparently, rather than neurons learning that signal A correlates with signal B (Hebbian learning), it's apparently surprisingly easy to wire two neurons up so that they are correlating signal A occurring just before signal B -- becoming more sensitised to this, the more times they see it, so they effectively they learn to predict signal B as soon as they see signal A.
This obviously appears to be very important for tracking objects at a low level, and as here in identifying temporal patterns (Sejnowski's suggestion was bats' echolocation); but it may be even more important at a higher level, for recognising causality (if this thing happens, then that good thing/bad thing) may happen, and perhaps for learned behaviour (if I do this, under these circumstances, then that happens).
Re:Long way to go, but cool for AI (Score:2)
"Pulsed Neural Networks". [amazon.com] I know Amazon has a copy (that's where I got mine a few months back).
Re: noise levels (Score:1)
Cool! (Score:1)
Can we please drop the conspiracy theories? (Score:1)
Go e-mail 'michael' about it. I'm sure he'll be happy to write up another Your Rights Online editorial thing where all you folks can go discuss the latest evils between yourselves, but let's keep the conspiracy theories out of "normal" articles, OK?
Re:Use your cell phone at a loud rock concert (Score:1)
Of course, you'll need to have only 11 neurons to understand the conversation.
Why, this'll stand me in good stead, then! I've only got about eleven left after all the booze and drugs I did at all the rock concerts that caused the degenerative hearing loss.
If my employer sniffs this packet I'll probably be taking a piss test Monday morning...
Potential for AI... (Score:1)
Anyway, these results are _quite_ significant in that they really show an advantage to using this new type of NN, and also make it clear to people that if we integrate these sorts of sensors into ourselves, or an AI such as CYC (check it out...), the resulting system will be able to process sensory information much more effectively than humans...
Of course, we've always known that the vision of hawks is like a couple hundred times more acute than that of humans, but some people never made the connection -- If hawks have better vision, and they have NNs to process that data, and we can learn how to make good, well trained NNs, then our AIs can have better vision than us, based on a biological model...
And on a similar note, I think it's amazingly cool that they've been able to show that a neural net trained by humans for a special purpose can -way- outperform biologically evolved neural nets... :)
Re:Things are not so simple. (Score:1)
/si/ vs.
Any sound made by a human can be called a phone. Many of these crop up in language. These sounds can be classified into groups. These categories of sounds are semantically the same - switching from one s to the other does not alter the meaning of a word. These are called phonemes.
Some phones within a phoneme can be chosen by the speaker, these are said to be in free variation. Others are determined by context (and sound funny otherwise) - these are called allophones. (Your
Spanish does not distinguish between b and v, similarly to the way Japanese lumps r and l (two separate phones) together into what in Japanese is the same phoneme.
I omitted to mention this added layer of complexity - the sonic properties of a given phoneme (which is really what you want to extract, in order to build morphemes) can vary a lot, to a degree dependant upon the language, dialect, and accent.
Nice to see some other language geeks here - keep me on my toes.
Re:Patent? (Score:1)
*I* Think It's Exciting and Promising (Score:1)
Whether speech recognition has advanced greatly with this particular claim is yet to be seen. Powerful speech recognition, however, has many great potential benefits.
Science marches on!
When it really gets rolling, encrypted voice communication will be more of a necessity than a paranoid indulgence. Conspiracy theory? Try this test: Would you use this tech to spy on people?
When you say, "Dude, the conversation at the next table triggered my autogrep of the word 'computer'," you could be talking hardware instead of wetware autogrepping.
Imagine donning headphones and hearing only a computer-enhanced (probably a little time-delayed) version of the surrounding sounds where selected voices are augmented. The same tech could probably be applied to identifying and reducing known noises. Chatting at a dance club wouldn't have to be a shouting match. (But, then there's less excuse to get close to their necks...)
"Yes! No! Stop! FIRE?" I wonder who's sponsoring this, or to whom these researchers are whoring themselves... "Yes! No! Retreat! Use the nerve gas!" War marches on!
It's been said already, but man, I have to echo this. Practical speech recognition + language analysis & translation + voice synthesis will rock. Just imagine being able to hit on an lovely Italian by telling her that you like her hairstyle and that's she's a pretty lady: "A lot I appreciate your style of hats. You are one Mrs. much graceful one. Beep." A whole new era of international misunderstanding.
The idea of Ctrl-key-free chorded typing still excites me. I'll pop you in the speech-recognized mouth with my data gloves.
Re:Long way to go, but cool for AI (Score:1)
Micheal is going to get you!#$^ (Score:2)
Honestly, do we have anything to fear from the technology as it is now? No, of course not. However, you have to expect plenty of fear on the part of people from
This mass paranoia against governments isn't bred because someone reads Farenheight 451 and says "shock!", (although it probaly does happen in SMALL quantaties) It's because we see it in our government today. We see corruption, and special intrests, and all sorts of scary, scary things, in government TODAY. The fact that this could be used to track all of the recordings a person ever made is scary.
Is it a long way off? Sure. Can you blame them for being overprotective of their rights? No, of course not.
Nothing personal but I don't see how you can mock or make fun of anyone for holding these fears.
-[ World domination - rains.net ]-
Re:One step closer to Star Trek every day... (Score:1)
I wonder if that would be terribly successful. Apparently, the first car phones marketed were speaker phones, which sounded like a good idea because both hands would be free for driving. The idea flopped because people looked kind of odd talking to themselves while driving.
I bet there would be a similar effect (at least for a long time). People walking down the sidewalk talking to themselves usually get some pretty strange looks
Dana
Re:AAAAAAAAAARRRRRRRGGGGGHHHHHHH! (Score:1)
Re:White noise? (Score:1)
Actually it does not matter whether the background signal is completely white, or not. As long as the speech signal is the most correlated one, you can find it. The coctail party problem (to isolate one speech signal in a crowd of speakers) is of course more difficult. The technique can be extended to separate more sources, if one adds more microphones/ears (see independent component analysis), one extra microphone per source you want to isolate, but that would be to cheat, wouldn't it
... (Score:2)
Of course, I could be mistaken, and that drawing is really a graphical representation of the most sophisticated neural net ever made. *g*
--
About time they got it right. (Score:1)
I'm a little surprised at how few neurons and links it took, though - and how general purpose (as in different languages) it is. Different human languages contain somewhat different sets of phonemes - what may be two distinct phonemes in one language are considered the same in another. (E.g., Chinese has a sound between the "p" and "b" of English, considered differetn from either. Hence the difficulty anglicizing the name of the city Peking/Beijing.)
Not Going to Change the World (Score:3)
Voice recognition only becomes useful to me if natural language parsing and enough cognition power are available for me to command my computer in plain english to a fair degree of abstraction.
In mobile computing, it might be a lot more useful, especially for a device, say the size of the Palm Pilot, where various factors make voice far more convenient and less difficult than other forms of input.
There are a lot of human use factors that complicate voice recognition (making the computer recognize when you want it to parse your speech and when you don't want it listening). Human interface issues often make these things less wonderful than they appear.
Not that I'm saying this isn't a wonderful development and there aren't people out there who could really use this (in specialized environments or people who have mechanical difficulties), but I don't think voice recognition is going to change the world the way some people think it will.
How to extend this: (Score:2)
Some people are saying that you can't make a really big neural net efficiently (at least without specialized hardware), but I don't see why you couldn't have hundreds of seperate neural nets each reporting on whether one word was said.
A very tiny, very simple computer could handle the task of managing a few neural nets. You could make it out of a few thousand surface features on a chip, so you could pack thousands of these processors on a chip. For that matter, they probably don't need to be terribly fast, so you could make them like memory chips. Imagine a megabyte chip, but instead of 1024K dumb memory, with 1024 minimal neural processors, each with 512 bytes of RAM.
Broadcasting the incoming data is pretty simple, and I don't think the networking issues of one or two of these processors reporting every few seconds would be too severe.
Training wouldn't be all that hard, either. You need a few man-years of samples, but the training could be done in parallel. It would cost a few million dollars (unless there was a dedicated online effort, which is entirely possible), but not billions. Imagine going down to the mall and asking people if they would read a few hundred words for $20; no problem, just repeat it all over the place so it deals well with accents.
There has never been a task better suited to massive parallel processing.
Oh yeah, I suppose I have to say: hey, we can do it with a Beowulf cluster, |)00|)Z!
Re:moderation (Score:1)
Now, this debate _is_ offtopic... But perhaps there's no better place to debate it. The problem with mass moderation, is that too many people are suddenly find themselves with power they're not experienced with.
/* Steinar */
No usefulness? Sha. Right. (Score:2)
Re:All this and a counting horse (Score:2)
Looks like I have a bad posting day today... (Score:1)
"If any choice was to be removed..." You can't moderate a choice
/* Steinar */
Re:/.'ed ? (Score:1)
Of course computer control via voice would generally happen in a controlled environment and would probably not have to involve a huge vocabulary as long as the computer could be trained on basic phonics and cross reference against a good dictionary.
-Rich
Re:I get the impression (Score:3)
I can see it now:
"Joe, I'm reading 14% hubbub coming over this line--can you try to reduce it to 5%?"
Or even make it an actual unit of measure:
"Man, the rating on that party must have been 23.6 Khb." (Kilohubbubs)
Of course, that's assuming it'd be a metric measure. If it gets adopted here in the U.S. of A first, the above example might be 8 11/16 hb.
We need more technological terms like this
--
The military and tech (Score:1)
/* Steinar */
Recognizes better than who? (Score:1)
Her: Honey, where are you at, its so noisy? And who is that with you?
Me: Um, nowhere and nobody, its just a business meeting...
Her: Oh? Does she work with you?
Me: Um, who?
Her: The 26 year old brunette wearing the green dress who just said your name two tables away. I'm not deaf, you know...
Deosyne
Re:I get the impression (Score:1)
ViaVoice & Xvoice will do it for you... (Score:1)
Is your NN project online? (Score:2)
I like following the progress of projects around the world --- I was in academia myself a decade ago, in a department where colleagues who were working with NNs would discuss their processing requirements and architectures with me. The work you describe sounds interesting.
White noise? (Score:1)
White noise is certainly random - but background noise in real world situations is hardly going be that random. Rather, it's going to be a chaotic blend of non-random signals - each of which may (or may not) be a valid speech signal in it's own right.
--
Re:AAAAAAAAAARRRRRRRGGGGGHHHHHHH! (Score:1)
This kind of science is littered with cool ideas that worked for the simple problems, but just didn't work for anything bigger. We already HAVE good systems that have high accuracy (> 85%) for speaker independent recognition in the 60k vocabulary range. These guys should publish something when they get half that good on any axis.
Re:Long way to go, but cool for AI (Score:1)
And I bet it could be done in 2 lines of perl!
Re:moderation (Score:1)
I guess that's really a solution though, people without a sense of humor would just moderate another way. So maybe what I'm saying is all moderators should be required to have a sense of humor. (just kidding, if I needed to say it)
Things are not so simple. (Score:2)
While you say there are only a few dozen phonemes in most languages what you are missing is the fact that each phoneme is context sensitive. So if I say "See" and "Sue", the 's' sound in each morpheme is spectrally quite different. They are both the
Really, if you think about it, humans do not learn to understand words by rote memorization of the acoustic properties of each word. That would be far, far too inefficient. Think about the fact that you could still understand someone's voice, even if they inhaled helium. That skews the spectral/acoustic properties of the person's voice into a very high frequency range compared to their normal voice. Also, if you tried to listen to non-native speakers who are missing phonemes or substituting phonemes, how could you possibly understand them? What you do is you figure out the missing or corrupted phonemes from the context of the morpheme. Some research supports the addition of other, extraneous acoustic information (such as the spectral shift of
There is an awful lot that speech research has not yet uncovered. One of the problems that I see in the field of computer speech recognition/perception/production is the lack of solid speech research and implementing the trickier research into these projects. Training neurons to recognize individual morphemes doesn't work. It's like brute force calculation of chess; the system is too complex to tackle with such a simple model. It's just too damned inefficient.
Besides, homophones will always be a problem with speech research, until language makes an appearance. How many times do you want to have to correct "their", "there" and "they're" in a document?
Um, this has already happend. Was: Oh great... (Score:3)
Re:Micheal is going to get you!#$^ (Score:2)
Voice recognition technology does not suddenly mean government agencies can now affect wiretaps of all the people in the United States on a whim.
When the FBI rammed through a law that telcom providers would have to provision tapping 10% of all communication (knowing very well that it's not possable for all of the judges in the U.S. to even rubber stamp that many court orders), some people said 'don't worry, they don't have the man power to listen to that many conversations'.
Here is the 'man' power to produce the transcripts. So tell me again, Why shouldn't I worry? Keep in mind, machine parsing of English text for meaning is available now as well.
Re: (Score:2)
Re:I get the impression (Score:2)
Even the best existing systems fail completely hen as little as 10 percent of hubbub masks a speaker's voice. At slightly higher noise levels, the likelihood that a human listener can identify spoken test words is mere chance. By contrast, Berger and Liaw's system functions at 60 percent recognition with a hubbub level 560 times the strength of the target stimulus.
With just a minor adjustment, the system can identify different speakers of the same word with superhuman acuity.
I see where konstant is going with his bit about computer listening for cues, hence the "minor adjustment" mentioned above. I cannot agree one way or the other without seeing uh hearing the actual tests.
But I can theorize wihtout any proof whatsoever =)
But since neral nets are trained, wouldn't make sense to train the net to listen at low noise levels, and then steadilly increase the level of white noise as performance includes? Baby steps. The net has to know what it is listening for inside of the noise before it can actually pick it out.
Anyone know about neral net trainging, or has more info on this project? Maybe saw it in a lab?
Article misses biggest (and scariest) use ... (Score:3)
The article misses another interesting, albeit scary, use of this technology. If these could be made small enough and cheap enough, they could be placed in key locations across the country, forever listening in on passers-by.
Avoiding all the issues of privacy, consider the following scenario. The police want to arrest a suspect for some crime (drug traffiking, conspiracy, etc.) but have no proof and can't tap his phone lines since he encrypts all his phone conversations. Through some method, they train this speech-recognition device to the suspect's voice and either have someone with the device planted on them track the suspect or have an array of said devices placed in public areas where the suspect is known to hang out (bus terminals, bars, etc.). Sooner or later, the suspect might slip up and the authorities have enough evidence needed for an arrest.
Regarding privacy concerns, it seemed to me that this device could only track a handful of known voices ... probably requiring vast processing power to track every voice in a room. So it might be a while yet before everybody's conversations in bugged places get transcripted.
Damned cool technology, though.
Re:Only 11 neurons? (Score:2)
Ringworm (tinea) is a fungus that covers the skin, causing discomfort, itching, and leaving an unsightly rash. Microsoft has managed to reproduce this behavior in software without using neural net technology at all.
My puters playing hooky (Score:2)
I have to wonder, one of the major basis for the success of neural networks is that they are trained, rather then programmed in the traditional sense. This works fine while your researching and developing a singular system. But how do you mass-produce these systems? You can't just apply the same code across millions of them. Will there be classrooms filled with little computers learning how to be computers? What happens if one becomes a bully? What if one can't do math? And will there be trauma counselors on hand should one Blue Screen?
Dear Sir/Madam
I am writing to inform you that your network failed to show up for English Class today. We cannot stress enough how important regular attendance is key in achieving a proper education.
Please attend to this matter as this is its fourth missed class.
Thank you,
011100110
Principal - School of Advanced Network Training
Re:My puters playing hooky (Score:2)
Of course you can. The information "learned" by a neural net is contained in a big list (really one or more matrices) of numbers. Training can be perfomed once to get a set of numbers or parameters that performs a task well, and then a product can be mass produced with that specific configuration. Sometimes, the product may use a know good configuration as a starting point and allow more learning. But neural network learning certainly can be reproduced.
[I realize the original comment was meant to be more funny than correct, but I think I should point out faulty premises.]
Still got a ways to go. (Score:2)
The real components. (Score:4)
"The network was configured with just 11 artificial neurons, and in a sub-stage a live goat brain. The brain was activated through a patented process involving a castle and a lightning storm.
The researchers said one day they hoped that all humanity could benifit from the power of lighting.
Then they laughed kind of ominously."
Hotnutz.com [hotnutz.com]
Skins (Score:2)
"You have 3 tasks left incompleted on your to-do list, you Naughty little boy! This calls for a vigorous spanking!"
(whipcrack) GrrrrrrOWl!
Is everybody a pessimist?? (Score:2)
Remember, it's only a few words... (Score:4)
Re:Only 11 neurons? (Score:2)
Of course the task of net could also be to separate the noise signal from the speech, aka blind separation, a problem that has been solved before (for instance by independent component analysis [cis.hut.fi])
If this is merely ICA with a time coded neural net, it is IMHO still pretty cool, and much more impressive than all those commercial systems that rely on dumb correlation and processing power.
Anyway, instead of just having me guessing, could someone please point to their paper
Long way to go, but cool for AI (Score:2)
On the other hand, this could be a great leap for neural networks in general. Realizing that the timing of synapse signals is a critical factor in neuron firing is going to shake up some things in AI. (At least, I was never familiar with neural networks that used timing cues. If I am wrong, please let me know.) Of course in a large neural network, you're going to have lots of propagation latencies as signals bounce around the net, and it makes sense that even more important than which neurons fire is when neurons fire. It actually seems to justify the complexity of neural nets because the timing data can represent a much larger data/search space than the simple fire/dormant state of each neuron.
This could be exciting.
Re:I get the impression (Score:2)
If you look at the chart provided in the video [usc.edu] you'll see the 'Dynamic Synapse' ALWAYS beat the human subject pool. In the zero background noise test, the net was accurate 100%, while the humans were right only 90% of the time. However, to be fair they should create the same number of 'Dynamic Synapse' listeners as humans in the pool and then compare the average results of the 'Dynamic Synapse' pool to the average results of the human pool.
worthless without peer review (Score:3)
While the press release doesn't say much about neural networks or whether the state of the art in speech recognition has improved, it tells us something about a disregard by USC for standards of scientific conduct: scientific publication by press release is improper.
Re:How to extend this: (Score:2)
You could do this, but you would be diverging considerably from the way the human brain actually works. And considering that the human brain is currently the best speech-processing device we know of (notwithstanding this experiment, which sounds awfully limited to me), that's probably a bad idea.
Think about it: humans have, what, about a 10,000 word vocabulary? (Yes, there are a lot of different ways of measuring vocabulary, but that's a reasonable figure.) I'm willing to accept that somebody could combine 10,000 eleven-node neural nets to approximate the same vocabulary. But the average human would have no trouble recognizing a word like "picklemobile", or "Vulcanophone", or "Rodmania", even though he has never heard these words before. (Hopefully.) Or any of the millions of possible proper names, although I'm not sure that that's a fair example. (And no, the examples that I gave cannot be dismissed as simple compounding or affixation, as far as I remember from linquistics. As a matter of fact, if anybody can explain to me what the hell is going on in "picklemobile", please let me know...)
I don't mean to knock neural nets. I think they're on the right track, but they need to be moving towards more complexity and structure, not less. Maybe have one net for phonology, one for syntax processing, one for vocabulary, etc., and then link them using conventional computation. In other words, more like the way we do it.
A small dose of insight. (Score:2)
If you have two hypotheses e.g. A and B, corresponding to 'two words' which were said, then it is easy to build systems which can recognize signals corresponding to A and those corresponding to B embedded in lots of noise. Basically you measure the likelihood ratio p(B)/p(A) using some sort of estimators that you've trained to light up with either A or B. If you gave me the data, I could do this with a number of different semi-conventional numerical techniques on a digital computer. I've seen similar things presented at conferences a few years ago---recognition of specific chaotic waveforms (specifically dolphin and whale song) embedded in lots of noise.
This is known as a "simple hypothesis test".
The more general circumstance, however is that the alternative is not A vs B, but A vs a huge multitude of other possibilities. This task is much more difficult, and correponds to the actual large-vocabulary speech recognition task. Now it becomes much more difficult to set a reliable threshold which will come on only when A is actually present, and not when A is absent. There is a tradeoff of false negative and false positive errors depending on your choice of threshold.
There is no possible way that this thing can recognize 50,000 words. There are only 30 connections, there is fundamentally not enough information processing power intrinsically in there.
What you would do is to have all sorts of these subunits lighting up their own 'word finder lights', and the result of *those* (i.e. the p(A) detectors) would then be inputs into higher level semantic networks of perhaps a similar type. These networks or hidden markov models or whatever are the ones that know which sorts of words follow other sorts of words, and thus let you get better recognition than the individual word finders themselves.
So, what is the accomplishement of this paper??
That they've apparently found an extremely efficient and well-performing low-level subunit using this time-domain information. From our own experimental observations (not on speech but on real live neurons from recently-living animals) this is very important. The fact that it is only 30 connections might mean that it is quite feasible to put 10 or 20 thousands of these subunits on a single chip, running in hardware. Given the factor of a thousand speed increase of electronics over neurons if you could time-division multi-plex different recognizers (blue sky dreaming here!) you could have that much many more of them during the milliseconds to seconds of audio-frequency processing time that we speak at.
If you notice, Professor Berger said that no other speaker-independent system outperformed humans, even in small test bases. Presumably that means in the small Bayesian post-hoc sorts of likelihood test regimes taht I described before. And in addition, it appears that this is not a simulation but that they built it on an actual physical computer chip, another very substantial advance.
My colleagues are going to ask the authors for the actual paper. The title and press release may be overblown, but this smells like real science and a significant advance here to me.
Take home message: even small groups of good neurons can do interesting and useful things. With the right architecture, a small group of neurons can outperform conventional "neuroid networks" of hundreds or thousands of nodes linked by linear transformations of sigmoidal basis functions. We may just be beginning to crack real-AI.
We see major body functions of lower animals being regulated by say ten neurons. Real neurons are much smarter than you think. :)
If small groups of neurons can do this, it makes you appreciate what a hundred billion might be able to do.
Re:Micheal is going to get you!#$^ (Score:2)
If you're worried that "cheaper" also means "easier to convince a judge of the need", then perhaps you need to oust your current judges.
This should in now way affect the requirements to obtain a court order/wiretap order from a judge.
Re:The military and tech (Score:2)
This allows the military to wait until some bright young entrepreneur to come up with a great solution, then they swoop down and tell the poor sap he can't talk about his patent for 10-15 years, and next thing you know the military comes out with some really cool speech recognition device.
So while there are brilliant people outside of The Man's Territory, their ideas can be and are stolen, and no one can talk about it.
I can think of better ways for the world to work...
Re:Article misses biggest (and scariest) use ... (Score:2)
I for one want a small device to listen for the sounds of coworkers down the hall muttering, "Hmmm, maybe phil knows about this," and report it to me, so I can hide.