Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI Communications Software Science Technology

AI Goes Bilingual -- Without a Dictionary (sciencemag.org) 99

sciencehabit shares a report from Science Magazine: Automatic language translation has come a long way, thanks to neural networks -- computer algorithms that take inspiration from the human brain. But training such networks requires an enormous amount of data: millions of sentence-by-sentence translations to demonstrate how a human would do it. Now, two new papers show that neural networks can learn to translate with no parallel texts -- a surprising advance that could make documents in many languages more accessible.

The two new papers, both of which have been submitted to next year's International Conference on Learning Representations but have not been peer reviewed, focus on another method: unsupervised machine learning. To start, each constructs bilingual dictionaries without the aid of a human teacher telling them when their guesses are right. That's possible because languages have strong similarities in the ways words cluster around one another. The words for table and chair, for example, are frequently used together in all languages. So if a computer maps out these co-occurrences like a giant road atlas with words for cities, the maps for different languages will resemble each other, just with different names. A computer can then figure out the best way to overlay one atlas on another. Voila! You have a bilingual dictionary.
The studies -- "Unsupervised Machine Translation Using Monolingual Corpora Only" and "Unsupervised Neural Machine Translation" -- were both submitted to the e-print archive arXiv.org.
This discussion has been archived. No new comments can be posted.

AI Goes Bilingual -- Without a Dictionary

Comments Filter:
  • by Anonymous Coward

    Yet published on Slashdot because it centers around a buzzword.

  • No, it does not (Score:4, Insightful)

    by gweihir ( 88907 ) on Wednesday November 29, 2017 @12:34AM (#55641741)

    In order to go "bilingual", it would have to be able to understand one language first. However understanding natural language is so far beyond the demented automation ("weak AI") available today, it is not even funny anymore. May as well claim a squirrel is a "gourmet chef", because it can bury nuts, i.e. "process food". Whether actual intelligence is going to be available on machines, ever, is at this time completely unknown, because nobody knows what it is. It is pretty clear though that the only natural computing hardware known (the human brain) is not powerful enough to create the intelligence observable at the interface of the smartest instances, at least if any known computing paradigm is assumed to be how it works. So either a completely computing paradigm is needed (and no, "neural" nets will not cut it and they are really old), or the problem is even more complicated.

    The real problem here is that most people are not smart enough to recognize a moron if the moron is dressed up prettily and spews pseudo-profound bullshit. Just look at who people vote for.

    • Re:No, it does not (Score:5, Insightful)

      by ShanghaiBill ( 739463 ) on Wednesday November 29, 2017 @01:11AM (#55641825)

      In order to go "bilingual" ...

      The headline says "bilingual". Neither paper uses that term.

      it would have to be able to understand one language first.

      It is not clear if this is true. Translation accuracy has greatly improved, and is continuing to improve, despite the NNs having no understanding of how the languages map to reality. They only learn how the languages map to each other.

      "neural" nets will not cut it and they are really old

      What does age have to do with anything? Biological neural nets have been around for 600 million years.

    • by Anonymous Coward

      The real problem here is that most people are not smart enough to recognize a moron if the moron is dressed up prettily and spews pseudo-profound bullshit

      This definitely applies to comments on Slashdot, where "dressed up prettily"= scare quotes and overconfidence with a sprinkling of jargon.

    • Google Translate? (Score:4, Interesting)

      by Roger W Moore ( 538166 ) on Wednesday November 29, 2017 @02:16AM (#55642025) Journal

      In order to go "bilingual", it would have to be able to understand one language first.

      Google translate can map between multiple languages without understanding any of them...which, admittedly, is why it does not do a great job but it is usually good enough to be reasonably understandable.

      • Re:Google Translate? (Score:5, Interesting)

        by jouassou ( 1854178 ) on Wednesday November 29, 2017 @06:12AM (#55642519)
        It's good as long as all the languages are in the same language family, meaning that they share grammatic logic but have different vocabulary. But try translating English into a non-Indo-European language like Korean, with a fundamentally different way of expressing ideas, and it fails miserably. It's often not understandable at all.

        (For instance: English sentences require a subject in every sentence to be complete, meaning that you say "John is growing up" even though it's obvious who we're talking about. In Korean, you mention who you're talking in the beginning, and then it's implicit from context until you start talking about someone else, so you drop the subject in following sentences. Machine learning systems so far don't understand this distinction, so translating from Korean to English they keep inventing people in the sentences, so that "is growing up" might become "Dave is growing up" or "Alice is growing up", even though no Dave or Alice has been mentioned in the previous sentences, while they were mentioned a few times in the training material.)
        • I completely agree with this. Languages in Asia (especially South East Asia countries) have different language root compared to western languages. Culturally, the way people use the language, even in written style which is more formal and/or complete sentence, is different from the westerns. It is even worse in speaking language style because often times people don't exactly follow the language grammars but still understandable among them.

          Another point is that these languages usually have their own politene

        • Have you tried it recently? Their old phrase based translations were terrible for Asian languages. Ask it to translate Japanese into English and you'd get garbage. Then they rolled out their new system based on neural networks, and it suddenly got a lot better. Not perfect, but now you can tell what it's saying. It's always easier translating between closely related languages, but the NNs are surprisingly good even for distant ones.

        • by DRJlaw ( 946416 )

          (For instance: English sentences require a subject in every sentence to be complete...)

          Like hell.

          'eff you. /s

      • Meanwhile, if you use Google to translate "He is warm" into Czech, you still get the blind idiot translation which actually means "He is gay".
      • by gweihir ( 88907 )

        It can map between words and sentences. It cannot map between languages. It has no grasp of semantics.

    • May as well claim a squirrel is a "gourmet chef", because it can bury nuts, i.e. "process food".

      Or similarly a rat, because it can control a human in a kitchen by pulling on its hair -- possibly with some assistance from the food processor in your example.

    • Whether actual intelligence is going to be available on machines, ever, is at this time completely unknown, because nobody knows what it is.

      We got human level intelligence from old monkey brains by just fucking around for 100,000 generations.

      • by gweihir ( 88907 )

        That is actually unknown. Physicalism is a belief, not science. Actual science find the questions of intelligence and consciousness are currently getting more mysterious, not less so, as more data and facts become known.

    • Understanding (Score:4, Informative)

      by DrYak ( 748999 ) on Wednesday November 29, 2017 @06:17AM (#55642525) Homepage

      "Understanding" has multiple level.

      Even you, dear snowflake, don't have the level of understanding a language that a reknown writer and poet could have of its intricacies.
      Or, you only have a vague grasp of some concepts in a field of work outside of yours, whereas some body expert in the field has a much better understanding.
      Even the pets (cats, dogs) in your house can have some basic understanding of things around, even if they don't think in such abstract concepts as you.

      This software, due to the way it's build (basically word2vec and deep neural net), has some very basic form of understanding the language.
      It's a very simple artificial brain, that is entirely optimised for one specific subdomain (language) and thus completely lacks other forms of thinking (cannot dissert about a scientific article written in said language).

      But the way this system works, is that is able to implicitly and autonomously build relationships between things.
      The kind of knowledge built into some ontology databases, except that here, the knowledge isn't manually constructed by the scientist filling the database, the knowledge is discovered on the go, not unlike how very young babies would discover the world around them.
      Okay, it's a very stupid and limited baby in this case, but still.
      It's good enough to catch and understand links between concepts.

      • by gweihir ( 88907 )

        And fail. (Well, what do you expect from a cretin that calls people "snowflake" without any good reason...)

        Even a smarter pet (a dog, for example) has some understanding and model of the real world and can map language to that model and can make (limited) predictions because it feels like it. An artificial neural net has nothing like that. It just has statistical classification and that is not enough for a world-model of even the most simple type, regardless of how "deep" you make it.

      • Yes, but it's still being done by a computer... so it will never be "Real A.I. (tm)"

        In 10 years from now when we are composing A.I.'s out of multiple A.I.'s like this one, it still won't be "real A.I. (tm)" even if it can completely replace 38% of human workers leaving them unemployable because they are not smart enough or lack the willpower to outperform "Not Real A.I.'s (tm)" even with additional- completely free- training.

        Right now, today with "not real A.I.'s" we are looking at 38% of jobs going away i

    • The real problem here is that most people are not smart enough to recognize a moron if the moron is dressed up prettily and spews pseudo-profound bullshit.

      Oh, I think I've just spotted one..

  • by PPH ( 736903 ) on Wednesday November 29, 2017 @01:16AM (#55641855)

    n/t

  • I've been learning japanese for about 2 years, using SRS and reading. I can tell you these systems will be great for instructions on assembling a desk, or how to check your oil. Totally useless for storytelling. Anything containing references, jokes, wordplay, hell even pronouns where english just doesn't have as many will always be compromises.

    • by Actually, I do RTFA ( 1058596 ) on Wednesday November 29, 2017 @02:24AM (#55642041)

      That would be fine. The number of times I wanted a machine translated story in the past... I dunno, ever. 0. The number of times I wanted a technical paper, or instructions or tech specs are significant. Or even news. Storytelling, jokes and wordplay are the least interesting thing to translate, because there are people who actually already do that.

      • by Anonymous Coward
        Clearly you've never lived in a non-English speaking country.
      • by Anonymous Coward

        That would be fine. The number of times I wanted a machine translated story in the past... I dunno, ever. 0.

        I guess it just means you are a uninteresting person with no taste or desire to know other cultures.

      • by Kjella ( 173770 )

        Well if you mean stories like novels not news stories, I agree. For any language the nuances and particularities will be lost in translation, even in human translations they sometimes have to explain some untranslateable words or concepts in a footnote. But I think they could do a lot better translating articles and blogs about subjects that address a broad audience and speak rather plainly in the native language. Often it still ends up being very awkward Yoda-isms and strange or incorrect choices of words,

    • That's what I was also thinking. Sure, where are areas where these word maps look the same. I would only expect this though if this area developed similarly, e.g. technical areas in the recent past, where we had world-wide communication. Also it should work for the base of the language as far as the languages have common roots.
      I would not expect it to work for idioms or anything where languages developed different concepts to describe things. It won't understand an Eskimo that talks about snow (they have
  • A neat idea, but this is how you get things like The Jedi Council turning into The Presbyterian Church.
  • by Jezral ( 449476 ) <mail@tinodidriksen.com> on Wednesday November 29, 2017 @04:12AM (#55642291) Homepage

    These are very cool advances, but they don't solve the major problem of machine learning (ML): Having lots of data.

    While these approaches don't need bilingual corpora, they still need big monolingual corpora. Very few languages have those, and those that do usually also have bilingual corpora to one or more of the major world languages.

    This does lower the barrier to entry significantly for those doing ML machine translation. But, if one took the resources spent on gathering and curating corpora and instead invested in rule-based systems, you could get much further in less time.

    • by serviscope_minor ( 664417 ) on Wednesday November 29, 2017 @04:54AM (#55642375) Journal

      Depends what you mean by "lots of data".

      This weakly supervised stuff is especially nice for NLP, since there are almost no large, general bilingual corpa. A few exist, but they're often the result of some legalistic process, so they cover something of a subset of language.

      There are a lot more languages with a lot of written text than there are language paired with large amounts of correlated texted.

      Also do you have any reason to think that rule based systems world be better? A huge amount of work went into those in the past, and their capabilities seem tapped out. The other thing is what you mean by "much further". The point of this paper seems to me to push the bar on weakly supervised learning, rather than to get the best translation software ever.

      Very weakly supervised learning can do all sorts of cool things. See for example cyclegan the zebrifier (it turns pictures of horses into pictures of zebras).

    • While these approaches don't need bilingual corpora, they still need big monolingual corpora.

      Except that we have terabytes of unstructured and unlabeled monolingual text. You could train it on Wikipedia pages. In fact, there is an entire library of congress of data in ... the library of congress.

    • But, if one took the resources spent on gathering and curating corpora and instead invested in rule-based systems, you could get much further in less time.

      Really? Why do you think that? Rule based is how all machine translation systems worked until just a few years ago. They worked, but not that great. And that's after decades of optimizing. Then the NMT systems came out and blew them out of the water.

      And building a monolingual corpus is pretty easy. Have a shelf of books written in that language? Great, scan them in. Maybe there's a newspaper with an archive of back issues. There you go, you're set. Way easier than a bilingual corpus, where someone

  • by Anonymous Coward

    Can it translate Linear A? Cretan heiroglyphic?

    • That is what I was wondering. I'm betting the answer is "no". When you have very limited source material, and the correct translation of the source material is probably long lists of items like "3rd year, Nowhereville, 5 bushels wheat" I doubt this approach would get you anywhere.

      In every case which I am aware of, (hieroglyphs, Linear B, Mayan) decypherment of ancient scripts required that a close relative of the script language was known to the decypherers. (If anyone has counter examples, I'd love to know

      • by Anonymous Coward

        It wasn't until the discovery of the Rosetta Stone, that they were able to decipher Ancient Egyptian with confidence. They could guess what the symbols and glyphs meant but until there was some anchor point with all languages, they couldn't say for certain.

      • In every case which I am aware of, (hieroglyphs, Linear B, Mayan) decypherment of ancient scripts required that a close relative of the script language was known to the decypherers. (If anyone has counter examples, I'd love to know about them.) If the language of the script is completely extinct, we may never be able to decypher it.

        Sumerian: Language isolate. Decyphered through Akkadian (Semitic language, related to modern Arabic and Hebrew) because both languages used the same cuneiform script which is (mostly) phonetic in nature.

        Etruscan: Believed to be part of the extinct Tyrsenian language family. Decyphered through Latin and Greek (both Indo-European languages) because Etruscan alphabet is the intermediate step between Greek and Latin alphabets.

        You don't need a related language, you only need some reference point for the phonol

  • I call 'fake news' (Score:5, Insightful)

    by mrthoughtful ( 466814 ) on Wednesday November 29, 2017 @05:01AM (#55642395) Journal

    The assumption, that the world is the same, and languages are attached to it, lies at the bottom of the idea of this learning strategy. The example given - of 'table and chairs' demonstrates this. Most of these ideas belong to a 19th century eurocentric understanding of the world we live in. Modern neuroscience and other work points to the fact that the world we perceive is very much dominated by the language we use, and not the other way around.

    Concrete Example: For a large portion of the 19th-20th Century many Greeks measured distance in cigarettes - how many cigarettes I will smoke while travelling from one place to another. There is no cognate in English for this. Not only that, but the language usage indicates a specific timespan as well as cultural differences.

    "Idiom!" I hear you say. Consider cultures where there are many more tables than there are chairs - such as in Asia where most people sit on the floor or on cushions.

    "But there are some universals - we can still use those!" - generally, there are no universals, or so few that they are not worth talking about. Talk to an anthropologist about it. Not even the concept of 'mother' is a universal.

    • by Anonymous Coward

      Said someone who's probably never tried to create new knowledge. What you say is that it's not perfect. Indeed, the accuracy is much lower than the best attempt that has good data to learn from. But sure it's a new result, and something that can be useful.

    • Everything you describe sounds like a feature to me, not a bug. Such a system would not only translate language, but culture.

      For common speech, this is an incredible advancement. Sure, you'll run into trouble when you specifically want a chair and the local custom is to sit on cushions... but when you're asking which 'chair' to sit on it'll work just fine and you'll figure it out when you're about to sit.

      For a large portion of the 19th-20th Century many Greeks measured distance in cigarettes - how many c

  • Who is Al, and why does it matter if he's bilingual?

    #serifisimportant

  • Anyone who understands that there was a lot more to Bletchley Park than rotor combinatorics can't honestly say they find this result surprising.

    Especially when the languages chosen have a shocked degree of family resemblence.

    No word for "I" or "me" or "mine" [upenn.edu]

    It isn't because the Vietnamese are not passionate. Rather, there is no word for "I" or "you" in colloquial Vietnamese.

    People address each other according to their relative ages: "anh" for older brother, "chi" for older sis

  • What's exciting (to me) is that this method is what's necessary for the universal translators in Star Trek / other Sci-Fi to actually work. In Star Trek: Enterprise, for example, their universal translator had to listen to a lot of alien speech as it would gradually make phrases more and more understandable. We're still a long way to go, but this methodology brings that dream closer.

This login session: $13.76, but for you $11.88.

Working...