New Algorithm for Learning Languages 454
An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."
Re:Speaking as someone working on NLP (Score:2, Insightful)
However, statistics does not come in at all with what they hear.
Utterance in pattern A is heard more often than utterance in pattern B; utterances in patterns C and D are not heard at all. How is that not statistics?
Re:Finally some progress (Score:2, Insightful)
Patent Pending (Score:1, Insightful)
Re:Let's see what it thinks of this (Score:2, Insightful)
Markov models are perhaps the easiest language acquisition model to implement, but also one of the worst at coming up with valid speech or text.
Interestingly, they do much, much better as recommender systems.
Re:just thought.. (Score:5, Insightful)
This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons :).
How "intricate"? (Score:2, Insightful)
And the "rules" of a language are NOT what children "learn". First of all, children acquire a language, they do not "learn" it. That is a large attribute to the child's ability to speak it--not whether or not they understand gerunds and the pluperfect.
Second, in a language such as English whose words for the most part lack any necessity to the order in which they're placed to understand they're meaning and, even worse, lack declension forms to distinguish subject from object of the preposition, with what success can a language recognition program have "learning" such a language when prepositions themselves mainly can be omitted? To teach a computer Latin is easy.
Third, what's the hope of the computer ever understanding something like Shakespeare, Joyce, or Dante, whose uses of language rely extensively on erudition for word placement as opposed to typical usage? While a computer might be able to learn Latin because of its rigourous rules, I doubt it could faithfully render a text from Ovid.
Re:Noam Chomsky (Score:5, Insightful)
Instead of a language module with specialized abilities tuned to learn rule-based grammar, we have an an unsupervised learning system has surmised the grammar of the language merely from the patterns inherent in the data it is given. That a system can do this is evidence against the notion that an innate grammar module in the brain is necessary for language.
Re:just thought.. (Score:2, Insightful)
It's going to come from large bodies of text that exist in mmultiple langueages. Things like the Bible, the constitution, etcetera. The whole point of this technology is that by drawing conclusions from those texts, the program infers the underlying rules of the language and can therefore translate other things. Google was doing something similar. An online dictionary is completely different. First, it has to be compiled by someone. Second, it only helps for translating words verbatim. This technology would self-teach itself to translate languages, even if none of the researchers working on the project could even speak those languages themselves. That's the beauty of it.
Re:Noam Chomsky (Score:5, Insightful)
If there were no rules, I could write a post using random letters for random sounds in a random order, or just using a bunch of non-letters. That wouldn't convey anything. Saying "I'm writing on slashdot" is more effective than writing "(*&$@(&^$)(#*$&"
Re:Didn't Google already do this? (Score:3, Insightful)
I'll pick up you after work
is not.
It can be, depending on context or emphasis. "I'll pick up the kids after lunch. I'll pick up you after work."
Re:Finaly (Score:3, Insightful)
Ov brug termat akti mak lejna trovterna.
And tell you that "termat" and "lejna" are nouns, "akti mak" is a 'composite' verb, "brug" and "trovterna" are adjectives... it still doesn't say anything about the actual meaning.
Re:Noam Chomsky (Score:5, Insightful)
Sorry about the rant, but like I said, my prof did *not* like the Chomskyan view of linguistics.
Oh, and as far as the notion of the "language module" goes, it might be premature to call it a module, but there *is* neurophysiological evidence to suggest that humans are physically predisposed towards learning language from birth, so that much at the very least is tenable.
I don't think we disagree much (Score:2, Insightful)
In order for a program to translating to translate accurately, it needs to know who is speaking/writing, who is the audience, what their relationship is, and their location. Some of this may be given to the computer explicitly, or easily found in the text/speech (for a human at least) but some of it may not. This is not going to be an easy problem to solve.
Writing is never free from its context. I know before I even start whether I am reading a fiction novel, a satire, a scientific journal, an email from my boss, or a text message from my date this Saturday. The meaning of the words can change a lot in those cases.
Even Google translator, which was trained on multi-lingual UN reports, could not produce comprehensible English from simple Japanese business emails.
As for my chinko, that's a long story.
It's actually a new language study (Score:4, Insightful)
For example, a classical Pragmatics scenario:
John is interested in a co worker Anna, but is shy and doesn't want to ask her out if she's taken. He asks his friend Dave if he knows if Anna is available to which Dave replies "Anna has two kids."
Now, taken literally, Dave did not answer John's question. What he literally said is that Anna has at least two children, and presumably exactly two children. That says nothing of her avalibility for dating. However, there's nobody who reads that scenario who doesn't get what Dave actually meant to communicate: That Anna is married, with children.
So that's a major problem computers hit when trying to really understand natural language. You can write a set of rules that comletely describes all the syntax and grammar. However that doesn't do it, that doesn't get you to meaning, because meaning occurs at a higher level than that. Even when we are speaking literally and directly, there's still a whole lot of context that comes in to play. Since we are quite often at least speaking partially indirectly, it gets to be a real mess.
Your example is a great one of just how bad it gets between languages. The literal meaning in Japanese was not the same as the intended meaning. So first you need to decode that, however even if you know that, a literal translation of the intended meaning may not come out right in another language. To really translate well you need to be able to decode the intended meaning of a literal phrase, translate that into an approprate meaning in the other language, and then encode that in a phrase that conveys that intended meaning accurately, and in the appropriate way.
It's a bitch, and not something computers are even near capable of.
Re:Ah, you don't know Chomsky. (Score:2, Insightful)
Re:It's actually a new language study (Score:3, Insightful)
So how does that sort of thing work? Well, in mathematics you can have something like y=f(x) and substitute f(x) whenever you see y or vice versa. You also know various other rules that are of the same form, e.g. a(b+c) = ab+ac. Then, you can brute-force trying different combinations (or be smart about it and modularize some set of translations to create a new compound rule which is true, e.g. a lemma).
It may not be so easy in languages, but there are transformations you can apply to sentences. For instance, you can do some rearrangements like:
A is under B Under B, there is A.
And there are ways that these relations (spatial relations especially) distribute:
A is in B, B is under C -> A is under C.
So to understand 'Anna has two kids' you have to know: 1. That you want to evaluate the truth/falseness of 'is Anna available to go out' and 2. Various pieces of social information about 'going out', people who are married, people who have kids, etc.
If you have 2 you should be able to use a method in the same vein as a computer algebra system to determine how what was just said applies to your question.
How about Dolphinese? (Score:3, Insightful)
How about Dolphinese? Research shows that they seem to be able to scout and transfer information from one individual to his/her pod. If there's some grammar it would be pretty good nut to crack.
Re:O(n^n^n...)????? (Score:3, Insightful)
If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches,...
No you don't :-)
If you want to find all singleton matches, it's enough to sort the string into ascending order (order n.log(n)), and then scan through for adjacent matches (order n). For example, sorting "the cat sat on the mat" gives "cat mat on sat the the"—where the two "the"s are now adjacent and so easily discovered.For finding longer matches the sorting method still works, except that you sort fragments of the sentence rather than individual words. Clearly there is more work involved, but (depending on exactly what you're counting) there are still order n.log(n) comparisons to be performed.
This means that searching for substring matches can be performed relatively efficiently. I don't know about how the language-learning algorithm works, but you may be interested to know that the compression algorithm used by "bzip2" works in exactly this way (google for "Burrows-Wheeler transform" for more details!)
Re:No the didn't (Score:3, Insightful)
Re:It's actually a new language study (Score:2, Insightful)
Funny, I read that answer as a "yes, she's available," but added additional information: don't ask her out unless you are willing to accept the entire package.
In a different language, I could still see a literal translation of the question and answer as communicating the same information. The "higher level meaning" is not embedded in the words or language. The exchange, "available?" "kids." does not mean "not available," but is more of a trinary response.