Slashdot is powered by your submissions, so send in your scoop


Forgot your password?
Software Science

New Algorithm for Learning Languages 454

An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."
This discussion has been archived. No new comments can be posted.

New Algorithm for Learning Languages

Comments Filter:
  • PDF of paper (Score:5, Informative)

    by mattjb0010 ( 724744 ) on Thursday September 01, 2005 @12:10AM (#13451536) Homepage
    Paper here [] for those who have PNAS access.
  • Noam Chomsky (Score:2, Informative)

    by MasterOfUniverse ( 812371 ) on Thursday September 01, 2005 @12:12AM (#13451547)
    This is a perfect apportunity to remind that its Chomsky's contribution to Linuguistics which enabled this amazing (if true) achievement. For those of you don't know Chomsky, he is the father of modern linguistics. Many would also know him as a political activist. Very amazing character. []
  • Dupe (Score:3, Informative)

    by fsterman ( 519061 ) on Thursday September 01, 2005 @12:27AM (#13451629) Homepage
    We just had an article on this. There was a shootout by NIST. At least I think, /. search engine blows, hard. Either way, here a link [] to the tests. This is one that wasn't covered by the tests, so I guess its front page news.
  • by ImaLamer ( 260199 ) <john,lamar&gmail,com> on Thursday September 01, 2005 @12:33AM (#13451661) Homepage Journal []

    Used this (easy to compile) C program: []

    to create these: []

    Mod points to whomever can tell us what texts they use. (No mod points can actually be given)
  • by dmaduram ( 790744 ) on Thursday September 01, 2005 @12:33AM (#13451664) Homepage
    Unsupervised learning of natural languages

    Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman
    School of Physics and Astronomy and School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel; and Department of Psychology, Cornell University, Ithaca, NY 14853

    We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

    Many types of sequential symbolic data possess structure that is (i) hierarchical and (ii) context-sensitive. Natural-language text and transcribed speech are prime examples of such data: a corpus of language consists of sentences defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents (1); at the same time, a distributional analysis of partially aligned sentential contexts (2) reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the natural languages; recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. We introduce an unsupervised algorithm that discovers hierarchical structure in any sequence data, on the basis of the minimal assumption that the corpus at hand contains partially overlapping strings at multiple levels of organization. In the linguistic domain, our algorithm has been successfully tested both on artificial-grammar output and on natural-language corpora such as ATIS (3), CHILDES (4), and the Bible (5). In bioinformatics, the algorithm has been shown to extract from protein sequences syntactic structures that are highly correlated with the functional properties of these proteins.

    The ADIOS Algorithm for Grammar-Like Rule Induction

    In a machine learning paradigm for grammar induction, a teacher produces a sequence of strings generated by a grammar G0, and a learner uses the resulting corpus to construct a grammar G, aiming to approximate G0 in some sense (6). Recent evidence suggests that natural language acquisition involves both statistical computation (e.g., in speech segmentation) and rule-like algebraic processes (e.g., in structured generalization) (7-11). Modern computational approaches to grammar induction integrate statistical and rule-based methods (12, 13). Statistical information that can be learned along with the rules may be Markov (14) or variable-order Markov (15) structure for finite state (16) grammars, in which case the EM algorithm can be used to maximize the likelihood of the observed data. Likewise, stochastic annotation for context-free grammars (CFGs) can be learned by using methods such as the Inside-Outside algorithm (14, 17).

    We have developed a method that, like some of those just mentioned, combines statistics and rules: our algorithm, ADIOS (for automatic distillation of structure) uses statistical information present in raw sequential data to identify significant segments and to distill rule-like regularities that support structured generalization. Unlike
  • by Comatose51 ( 687974 ) on Thursday September 01, 2005 @12:39AM (#13451685) Homepage
    Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies.

    I really don't understand that. How are modern views of syntax unscientific? Also, if Chomsky is such an influence on linguistics, then maybe he's right about it. Aren't you essentially saying that we have no way of arguing with him so let's wait til he dies so he can't argue back? I would think the correct view should win out regardless of the speaker.

    Other than what I've studied in cognitive science, I am not in any way or form a linguist. However, what you say really confuses me and contradicts what I've learned. I can only assume that what you say make sense because of your deeper knowledge. So can you please explain what you mean for the rest of us?


  • by Anonymous Coward on Thursday September 01, 2005 @12:45AM (#13451711)
    Here's the link [] at Google's blog.
  • Not really (Score:1, Informative)

    by Anonymous Coward on Thursday September 01, 2005 @12:50AM (#13451730)
    Markov chains aren't the same as context free grammars.
    (CFGs can generate ((multiply) nested) bracket structures (and are like finite automata with stacks).) Markov chains are just finite automata without stacks, that generate random walks through vocabulary space.
  • by lawpoop ( 604919 ) on Thursday September 01, 2005 @01:05AM (#13451787) Homepage Journal
    IIRC, the part of Chomsky's theory that is relevant to this application is that universal grammar is a series of choices about grammar -- i.e. adjectives either come before or after nouns, there are or are not postpositions, etc. I think the actual 'choices' are more obscure, but I'm trying to make this understandable ;)

    According to the theory, children come with this universal grammar built-in to their mind (for some reason, Chomsky seems against genetic arguments, but good luck understanding his reasoning), and they only need to hear just a little bit of language in order to throw the choose the proper alternatives in the mind, and start building grammatically correct sentences -- the rest is just building vocabulary. What seems like a child learning language is actually the language part of the brain growing during development. I believe that these choices are called 'switches' by Chomsky.

    (An easy argument for universal grammar is that children make mistakes that are more rule-following than the accepted grammar -- words such as 'breaked', 'speaked', 'foots' or 'mouses' are in a sense rule-based corrections of exceptions in the spoken language. So the children follow the rules more closely than the adults -- they certainly didn't learn them from adults, so the must be applying the rules in their minds.)

    Anywho, to make a program like this, you would just have to put together the switches of universal grammar and then feed in sample data -- probably text spelled with those linguistic homophonic characters, instead of the horrendous English spelling non-system. Chinese characters might be a better sample data in that respect; I don't know.

    Note that, contrary to other posters, this would not be a system for building grammatical rules for any 'language' or formal system, such as C++ or xml. It is based on universal grammar, which is a set of option for constructing a human language. So there are built in assumptions that there will be subject, verbs, objects, indirect objects, etc. in the language that this program is decoding.

  • Re:PDF of paper (Score:3, Informative)

    by downbad ( 793562 ) on Thursday September 01, 2005 @01:10AM (#13451799)
    The project also has a website [] where you can download crippled implementations of the algorithm for Linux and Cygwin.
  • Better link for PDF (Score:2, Informative)

    by Anonymous Coward on Thursday September 01, 2005 @01:14AM (#13451816)

    PNAS wants you to subscribe to download the PDF.

    Or you could just go to the authors' page and download it for free: []

  • by t35t0r ( 751958 ) on Thursday September 01, 2005 @01:22AM (#13451848)
    In analyzing proteins, for example, the algorithm was able to extract from amino acid sequences patterns that were highly correlated with the functional properties of the proteins.

    NCBI BlastP [] already does this for proteins. Similarities and rules for things can be found but if the meaning of the sequence is not known then what good is it? In the end you need to do experiments involving biology/biochemistry/structural biology to determine the function of a protein or nucleotide sequence. Furthermore in language as well as in biology/chemistry things which have similar vocabulary (chemical formula) may in the end be structurally very different (enantiomers), which leads to vastly different functionality.
  • Re:No the didn't (Score:5, Informative)

    by superpulpsicle ( 533373 ) on Thursday September 01, 2005 @01:25AM (#13451855)
    Try this free website out. []

    I know it is fairly accurate because I have fooled my spanish speaking friends once in an IM conversation. I told them I learned spanish via hypnosis and basically just copy/pasted everything spanish into IM. The conversation went on for like 15 minutes full spanish before I told them I was using the website. They were pissing their pants.

  • by proteonic ( 688830 ) on Thursday September 01, 2005 @01:44AM (#13451923)

    If you take young children and expose them to rubbish for four or five years while they're learning to speak, they'll speak rubbish too. That's the problem with young children, they can't sort the good from the bad.

    But if you expose them to well strucutred language, they'll learn to speak it, without being EXPLICITLY TAUGHT THE RULES. Which is exactly what this paper is about. Unsupervised natural language learning. That's what makes the system good. It's able to build equivalency classes of verbs, nouns, adjectives, etc, with relatively few examples. The paper gives an example of the algorithm using 8 sentences to trian and be able to produce over 500 new, sensible sentences. Even a 4th order Markov chain can't do that (Megahal). The algorithm is really quite impressive.

    Your comment begs the question.. why would you train a system on garbage? Finding good quality written language is a non issue. Train it on good data and it'll probably do as well as a Markov model for distinguishing good vs bad language.

  • Re:just thought.. (Score:5, Informative)

    by Mac Degger ( 576336 ) on Thursday September 01, 2005 @01:46AM (#13451931) Journal
    What they've develloped is something which interprets grammar; the ruleset behind the organisation of buildingblocks, apparently buildingblock agnostic.

    A dictionary is just words. This algorythm cant assign meaning to the buildingblocks, it can only dicide how and in what order the buildingblocks go together.
  • grammar isn't enough (Score:5, Informative)

    by JoeBuck ( 7947 ) on Thursday September 01, 2005 @02:32AM (#13452081) Homepage
    The classic problem example is:
    • Time flies like an arrow.
    • Fruit flies like a banana.
    There are other, similar examples. Computer systems tend to deduce either that there's a type of insect called "time flies", or that the latter sentence refers to the aerodynamic properties of fruit.
  • Do-support, in brief (Score:1, Informative)

    by Anonymous Coward on Thursday September 01, 2005 @02:43AM (#13452105)
    You're making several mistakes there:
    1. First, you have to distinguish between what I'll call lexical verbs and auxiliary verbs. A lexical verb is the verb in the sentence that actually tells you what sort of action (or experience, or state, or whatever) the sentence is describing. An auxiliary verb is a verb that doesn't do that, but expresses some combination of information about modality (possibility, necessity, obligation), tense (past, future), aspect (progressive, perfect), negation and agreement.

      In the sentence Do you like me?, like is the lexical verb, and do is the auxiliary.

    2. The sentence Be you man or mouse? is archaic, and thus does not count as part of the data when one is analyzing the grammar of contemporary English.
    Now, what's do-support? It's essentially that the grammar of contemporary English is so that the only verbs that can be negated with -n't, or inverted with the subject (e.g., to form a question), are auxiliaries. You don't negate the sentence You like John in contemporary English by saying *You like not John, nor form the yes-no question as *Like you not John? (The asterisks in front of the sentences are linguistese for "the following is not a grammatical sentence." As another note, in more archaic English, on the other hand, do-support did not exist, so that did use to be the normal way of forming the negation and the question.)

    So, essentially, to form the negated sentence or the question, you need some auxiliary. If the basic declarative sentence already has one, you can just use that: from You will like John, you can form You won't like John or Will you like John? If the basic declarative doesn't have an auxiliary, then you need to use the auxiliary do in order to "support" the negation or question. In these sentences, the auxiliary do is otherwise a dummy word.

  • by ericbg05 ( 808406 ) on Thursday September 01, 2005 @03:27AM (#13452213)
    I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate.

    Yes! I'd have thrown a mod point at you just for this paragraph if I could.

    English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.

    Not really. Firstly, English only has two or three tenses. (Depending upon which linguist you ask, English either has a past/non-past distinction or past/present/future distinctions. See [1], [2]. The general consensus seems to be in favor of the former, although I humbly disagree with the general consensus.) It maintains a variety of aspect [] distinctions (perfective vs imperfective, habitual vs continuous, nonprogressive vs progressive). See [3]. Its verbs also interact with modality [], albeit slightly less strongly.

    It's a very common mistake to count the combinations of tense, aspect, and modality in a language and arrive at some astronomical number of "tenses". It's an even more common mistake (for native English speakers, anyway) to think that English is special or different or strange compared to other languages. In most cases, it's not -- especially when compared with other Indo-European languages.

    Secondly, and more interestingly IMHO, most languages do not have three distinct tenses. The most common cases are either to have a future/non-future distinction or a past/non-past distinction. In any case, the future tense, if it exists, is normally derived from modal or aspectual markers and is diachronically weak (which is linguist-babble meaning "future tenses forms don't stick around for very long"). See [3].

    English is a perfect example: will, of course, used to refer to the agent's desire (his or her will) to do something. Only recently has it shifted to have a more temporal sense, and it still maintains some of its modal flavor. In fact, the least marked way of making the future (in the US, at least) is to use either gonna or a present progressive form: I'm having dinner with my boss tonight. I'm gonna ask him for a raise. See Comrie [1] again.

    So as not to be anglo-centric, I'll give another example. Spanish has three widespread means of forming the future tense. Two of these are periphrastic and are exemplified by he de cantar 'I've gotta sing' and voy a cantar 'I'm gonna sing'. The last is the synthetic form, cantaré 'I'll sing'.

    Most high school or college Spanish teachers would tell you that the "pure" future is cantaré. Actually, it's historically derived from the phrase cantar he 'I have to sing' (from Latin cantáre habeo), and is being displaced by the other two forms all across the Spanish-speaking world. I'm told, for example, that cantaré has been largely lost in in Argentina and southern Chile (see [4]).

    In any case, the parent's main point still holds. It's a b?tch to deal with cross-linguistic differences in major semantic systems computationally. But good lord, it's fun to try. :)


    1. Comrie, Bernard. Tense. [] Cambridge, UK: Cambridge University Press, 1985.
    2. Davidsen-Nielsen, Niels. "Has English a Future?" Acta Linguistica Hafniensia 21 (1987): 5-20.
    3. Frawley, William.
  • Re:just thought.. (Score:2, Informative)

    by psm321 ( 450181 ) on Thursday September 01, 2005 @04:42AM (#13452362) Journal
    From what I understand, google's thing was using purely statistics (i.e. a matches with b all the time in translations so when you see a, translate it to b), while this one actually "understands" the underlying grammer.

"If you lived today as if it were your last, you'd buy up a box of rockets and fire them all off, wouldn't you?" -- Garrison Keillor