New Algorithm for Learning Languages 454
An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."
just thought.. (Score:3, Interesting)
would probably help with the problem of either downloading a small, incomplete dictionary, a dictionary with errors, or a massive dictionary file.
Didn't Google already do this? (Score:5, Interesting)
SCIgen (Score:5, Interesting)
Speaking as someone working on NLP (Score:5, Interesting)
It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.
With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.
So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.
Grammar depends on the input (Score:3, Interesting)
Protein sequences? (Score:1, Interesting)
Hieroglyphics? (Score:3, Interesting)
Re:Isn't This the Universal Translator Idea (Score:2, Interesting)
But for this, I have one word: Dolphins.
Incredible (Score:1, Interesting)
Re:Speaking as someone working on NLP (Score:3, Interesting)
Native speakers by definition speak correctly, and that is all the child is hearing.
Programming Language (Score:2, Interesting)
No the didn't (Score:5, Interesting)
I will believe this new program when I see it.
Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.
Finaly (Score:2, Interesting)
Re:Speaking as someone working on NLP (Score:5, Interesting)
Chomsky is to linguistics as Freud to psych. He had great ideas for the time (many still stand), and the science would be nowhere close to where it is without him. However, A) he's backed off alot of supporting his own theories and B) he's published papers contradicting his original ideas so that is some question there for their veracity. Since so many linguistics undergrads hold him as the pinnical of syntax none are really deviating drastically from him.
WRT the unscientificness, to make his view fit English, there has to be "do-support" which basically is that when forming an interrogative "do" just comes in to make things work without any explanation. In other words, it is in our grammar, but our view of syntax does not account for it.
Universal Translator? (Score:2, Interesting)
Electronic babelfish anyone?
Re:Speaking as someone working on NLP (Score:5, Interesting)
And I agree that this algorithm doesn't seem that it would be entirely successful in learning grammar. But this is not because it's statistical. I don't understand how you can look at something as complicated as the human brain and say "statistics does not come in at all".
If this algorithm worked, then it could be statistical, symbolic, Chomskyan, or magic voodoo and I wouldn't care. There's no reason that computers have to do things the same way the brain does, and I doubt they'll have enough computational power to do so for a long time anyway.
No, the flaws in this algorithm are that it is greedy (so a grammar rule it discovers can never be falsified by new evidence), and it seems not to discover recursive rules, which are a critical part of grammar. Perhaps it's learning a better approximation to a grammar than we've seen before, but it's not really doing the amazing, adaptive, recursive thing we call language.
Re:No the didn't (Score:3, Interesting)
"I'm leaving you."
"Who is she?"
However, in written text, where the author can assume that the reader brings no shared assumptions, nor can the author rely on any deefback, 'speakers' usually do a good job of including all necessary information in one way or another -- especially in texts meant to convince or promote a particular viewpoint. I'll bet these kinds of texts are more easily translatable than conversation.
Re:Grammar depends on the input (Score:3, Interesting)
I guess the missing thing is that a human can evision the meaning of the words as a concept or image, while the computer simply sees the words as, well, just words (or binary to specific).
Re:Didn't Google already do this? (Score:5, Interesting)
IIRC, Google's translator works from a source of documents from the UN. By cross referencing the same set of documetents in all kinds of different languages, it is able to do a pretty solid translation built on the work of goodness knows how many professional translators.
What is a little more confusing to me is how machine translation can deal with finer points in language, like different words in a target language where the source language has only one. English for example has the word "to know" but many languages use different words depending on whether it is a thing or a person that is known. Or words that relate to the same physical object but carry very different cultural connotations -- the word for female dog is not derogatory in every language, for example, but some other animals can be extremely profane depending on who you talk to.
Or situations where two entirely different real-world concepts mean similar things in their respective language -- in English, for example, you're up shit creek, but in Slavic languages you're in the pussy.
I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate. English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.
Consider this:
I was having breakfast when my sister, whom I hadn't seen in five years, called and asked if I was going to the county fair this weekend. I told her I wasn't because I'm having the painters come on Saturday. They'll have finished by 5:00, I told her, so we can get together afterwords.
These three sentences use six different tenses: past continuous, past perfect, past simple, present continuous, future perfect, and present simple, and are further complicated by the fact that you have past tenses refering to the future, present tenses refering to the future, and the wonderful future perfect tense that refers to something that will be in the past from an arbitrary future perspective, but which hasn't actually happened yet. Still following?
On the other hand, English is much less precise in things like prepositions and objects, and utterly inexplicable when it comes to things like articles, phrasal verbs, and required word order -- try explaining why:
I'll pick you up after work
I'll pick the kids up after work
I'll pick up the kids after work
are all OK, but
I'll pick up you after work
is not.
Machine translation will be a wonderful thing for a lot of reasons, but because of these kinds of differences in languages, it will be limited to certain types of writing. You may be able to get a computer to translate the words of Shakespeare, but a rose, by whatever name, is not equally sweet in every language.Dolphins? (Score:2, Interesting)
Re:No the didn't (Score:2, Interesting)
Being more serious, how do you think humans learn the rudiments of language? It's pattern analysis, i.e. precisely the technique this algorithm tries to replicate. It is true that the algorithm won't then progress onto the next stage, which is using that rudimentary grasp of the language to be taught its finer points, but if you genuinely doubt the capacity of this method to produce an understanding of language you are contesting the experiences of every human on the planet.
Returning to your example, "I really wanted to see you tonight" is what you discerned that sentence meant from its context. You can hardly expect a machine translator to know that it was a woman you were out with at night who said it (which seems to be the basis for your insertion of "tonight", "really" and "you"); fortunately, this algorithm is intended to translate written, not spoken, language. Since writing would have to include that detail (in order to be independent of its context), the problem you identified is not even relevant.
Give it a real challenge (Score:4, Interesting)
Pug
Re:No the didn't (Score:4, Interesting)
The idea being that you take any input language, Japanese for instance, and get a working Jap Esperanto translator. Being as Esperanto is so consistent and reliable in how it is designed, it should be easier to do than a straight Jap Eng translator.
To finish, you write a Esperanto English translator. By leveraging the consistent language of Esperanto, researchers thought they could write a true universal translator of sorts.
Don't know what ever came of it, but it was an interesting idea.
Can it decipher ancient languages? (Score:1, Interesting)
Spam filter? (Score:3, Interesting)
Re:No the didn't (Score:1, Interesting)
My tutor got his doctorate in machine translation, and that was erm mid-early 80s? His "not for a long time" prediction (as seems to apply in general to AI) likely remains correct --- I'll believe the techniques (as AI in general) brings us more than extremely specialised uses when I see more than press releases and claims of software that isn't available for me to test.
In fact, fellow nerds, just give me a link to ONE impressive piece of AI software (that isn't a chess player) and I'll be bowled over. PS I'm posting this using Dragon NaturallySpeaking, which is one of the only examples of vaguely AI research reaching the home/office...
O(n^n^n...)????? (Score:4, Interesting)
From TFA: The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.
If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches, and then gosh only knows how many comparions thereafter to find all contiguous stretches of matches.
But if you were to take some set of embedded strings, and rotate them against a second set of global strings [where, in a worst case scenario, the set of embedded strings would consist of the set of all substrings of the set of global strings], then you would need to perform a staggeringly large [for all intents and purposes, infinite] number of byte comparisons.
What did they do to shorten the total number of comparisons? [I've got some ideas of my own in that regard, but I'm curious as to their approach.]
PS: Many languages are read backwards, and I assume they re-oriented those languages before feeding them to the algorithm [it would be damned impressive if the algorithm could learn the forwards grammar by reading backwards].
Is that really a big problem? (Score:1, Interesting)
Say, if I normally would have typed "stroll" to say "walk" and I would notice that when I press 787655 on my phone's keyboard, the T9 dictionary misunderstands me, I would just start typing 9255 for "walk" instead. I think the same would happen here. If somehow the person typing the messages would get instantaneous feedback from the system about a "commonly misunderstood" structure, he would quickly learn to avoid these structures while typing.
On a related note, things like "fly like an arrow" are the most difficult thing to learn in my opinion in a language, and thus foreign speakers do not use or know them. And still, "badly spoken english" can be comprehensible among the people speaking it. One thing I have noticed myself is that it is the british who have most problems understanding a foreigner speaking english badly. Other foreigners would understand the same person just fine. Something to do with the way the brain is wired to wait for certain words after another I guess.
Of course, the problem is that we would get rid of all the things that make language "alive". But here I am typing a message on another language than my own and still many people can to some extent understand what I mean...
Re:Noam Chomsky (Score:3, Interesting)
Even then, Gold showed a long, long time ago (1967) that the task of inducing an arbitrary CFG using only generated strings from the language is basically hopeless [Gold, E. Mark. 1967. Language Identification in the Limit. Information and Control, 10:447-474].
That said, this doesn't even seem to be that novel (to me). Andreas Stolcke wrote a very nice PhD dissertation in 1994 on learning arbitrary PCFGs from langage strings [Stolcke, Andreas. 1994. Bayesian Learning of Probabilistic Language Models. PhD Dissertation. University of California at Berkeley.]
This is probably a better, more efficient method that Stolcke produced back in '94, but I would be *very* surprised if it revolutionized the way computers interact with language, or anything else of the sort. People working in computational linguistics have a nasty habit of making grand pronouncements, only to fall far short of what they claimed.
For the record: IANAL, but i play one on TV, by which i mean i'm an applied mathematician with a couple published papers in computational linguistics.
Re:No the didn't (Score:3, Interesting)
there's one flaw in your analysis is that humans learn language/grammar faster when their young and it becomes a lot harder when they get older. There's many different speculations on why that happens from children starting from a clean slate to children learn languages better as their brain develops. I mean pattern analysis would definitely be an advantage for grown ups, no? Why are children's pattern analysis better in this case if what you saying is true.
From what I've seen, to actually learn grammar and a foreign language, there's 2 requirements. One is you must have a passion for it. 2nd is that you must be constantly practicing. I've noticed if you attend classes but never use it in your real life, you'll never learn it. Find a group of people who are also learning and try communicating only with that language and you'll see how much faster you'll pick up. It also helps to have a friend who's fluent in the language to correct you (though it might not be that good for your pride). What I've noticed is that grammar nazis are the best for learning a new grammar. They pick on EVERY SINGLE MISTAKE YOU MAKE, so you'd think twice before making the same mistake again.
At college, I've actually seen flyers asking for help in english and in return they'll help you with the language they're fluent in, be in french, german, chinese, japanese, etc. So those people would meet maybe 3x a week and spend an hour in each language each time, which I thought was a really neat idea. Here you're helping a foreigner with english and there they are helping you with a foreign language you want to learn.
Re:just thought.. (Score:5, Interesting)
Re:Random test ... (Score:3, Interesting)
But as things stand, I'd spend more time knocking the bad translation into shape than if I translated the whole thing from scratch.
Translators are often asked to copy edit other translators' work (customers tend to call it this "proof reading", presumably to devalue it and get it done on the cheap, but it involves much more than hunting typos). That's fair enough if you want a quality check. But some smart-arse people try sending machine translations for copy editing. And you can bet they get sent straight back!
DNA Analysys (Score:2, Interesting)
If they can use it for analysing proteine sequences, maybe they can tackle "the grammar of Life" and kickstart the whole Bioengeenering sector into a new life...
OTOH, the integrist christians will probably denounce this as an evil thing...
Re:grammar isn't enough (Score:3, Interesting)
* The clown threw a ball.
(Probably, a tennis or basket ball)
* The clown threw a ball,....for charity.
(Okay, sorry, a ball a party.)
* The clown threw a ball,....for charity...., and hit the target.
(Okay, sorry again, the tennis ball hit the dunking target and someone fell in the water. Got it. We're in a carnival.)
* The clown threw a ball,....for charity...., and hit the target....of 1 million dollars.
(Scratch that. It really is a charity party and we've collected 1 million in donations. There's no way the meaning can change again.)
* The clown threw a ball,....for charity...., and hit the target....of 1 million dollars....by striking out Babe Ruth.
(Oops again. The clown got 1 million dollars in pledges if he could strike out Babe Ruth, and he succeeded. We're talking about a base ball again. I give up.)
Re:Noam Chomsky (Score:3, Interesting)
There have been basically two prongs of arguments in favor of the existence of a Universal Grammar in the debate. The first is that the task of learning an infinite grammar from a finite subset of sentences (and then only from positive evidence) appears to be too difficult to accomplish solely through statistical means. The second is an effort to show that language learning is biologically- rather than experience-based. This is the effort to show that there is a critical period in language development, which would suggest that there is a strong biological (i.e., genetic) component to langauge learning.
In my opinion, the first prong isn't very strong, since it relies on assumptions about statistical learning to make its claims. Their claims to me seem to stem more from a lack of imagination than from anything we can pin down as logically necessary. Shimon Edelman's work would work against this prong, showing that yes, it is possible to learn a language via statistcal means. (It would still have to be shown that the knowledge the computer possesses is qualitatively similar to that learned by humans... it may learn languages in a completely different way).
His findings wouldn't affect the second prong at all, though, which to my mind is the stronger of the two approaches. There have been lots of studies which suggest that there is a biological timecourse for language acquisition, suggesting that we do have an innate capacity for it.
So to sum up, while I find it a very exciting and important finding, I don't believe it by itself will disprove the theory of Universal Grammar.
Re:O(n^n^n...)????? (Score:2, Interesting)
Re:Speaking as someone working on NLP (Score:1, Interesting)
This much you can easily prove yourself. But being unscientific doesn't mean wrong; you could reason the Earth must be a sphere because the most beautiful shape is a sphere and you would be right wrt shape of the Earth, even though your reasoning is unscientific junk. If you want proof that Chomsy is wrong, rather than just using a useless methodology, I'm afraid you won't find it without spending a lot of time on it.
It takes quite a bit of time to give background information on developmental psychology, pyscholinguistics, neuroscience and biology in general to someone outside the field (not that I'm an expert, but I have a degree on Cogsci.) It takes at least as much time to establish which flavor of Chomskian linguistics is rubbish and why (Chomsy made so many contradicting models of language and mind that almost everything you say against him can be countered with a simple "ah, but you don't know his X theory") So you very probably won't get any such response unless you claim a specific chomskian theory is true and sound like you know what you are talking about.