Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI Software Science Technology

Machine Learning Has Been Used To Automatically Translate Long-Lost Languages (technologyreview.com) 111

Jiaming Luo and Regina Barzilay from MIT and Yuan Cao from Google's AI lab in Mountain View, California, have developed a machine-learning system capable of deciphering lost languages, and they've demonstrated it on a script from the Mediterranean island of Crete. The script, Linear B, appeared after 1400 BCE, when the island was conquered by Mycenaeans from the Greek mainland. MIT Technology Review reports: Luo and co put the technique to the test with two lost languages, Linear B and Ugaritic. Linguists know that Linear B encodes an early version of ancient Greek and that Ugaritic, which was discovered in 1929, is an early form of Hebrew. Given that information and the constraints imposed by linguistic evolution, Luo and co's machine is able to translate both languages with remarkable accuracy. "We were able to correctly translate 67.3% of Linear B cognates into their Greek equivalents in the decipherment scenario," they say. "To the best of our knowledge, our experiment is the first attempt of deciphering Linear B automatically."

That's impressive work that takes machine translation to a new level. But it also raises the interesting question of other lost languages -- particularly those that have never been deciphered, such as Linear A. In this paper, Linear A is conspicuous by its absence. Luo and co do not even mention it, but it must loom large in their thinking, as it does for all linguists. Yet significant breakthroughs are still needed before this script becomes amenable to machine translation. For example, nobody knows what language Linear A encodes. Attempts to decipher it into ancient Greek have all failed. And without the progenitor language, the new technique does not work.

This discussion has been archived. No new comments can be posted.

Machine Learning Has Been Used To Automatically Translate Long-Lost Languages

Comments Filter:
  • by An Ominous Cow Erred ( 28892 ) on Tuesday July 16, 2019 @06:11AM (#58932940)

    Machine learning benefits from having large data sets to train with. This is a problem with Linear A as we have almost no corpus to work with. There's almost no long coherent bits of Linear A to analyze -- most of it are short labels of a few words written on small objects, presumably notes about who made it, or about use, what it was made for, or about who owned it,

    It would be pretty hard to decipher English if all you had to work with were labels like "Made in China", "Not a Toy", "For External Use Only", "Ronco Corporation", or "Ralph Stevens". You might definitely note patterns in what was written but still have no idea what was being said.

    • by Sique ( 173459 ) on Tuesday July 16, 2019 @06:59AM (#58933022) Homepage
      The last paragraphs in the article explain it:

      For example, nobody knows what language Linear A encodes. Attempts to decipher it into ancient Greek have all failed. And without the progenitor language, the new technique does not work.

      But the big advantage of machine-based approaches is that they can test one language after another quickly without becoming fatigued. So it’s quite possible that Luo and co might tackle Linear A with a brute-force approach—simply attempt to decipher it into every language for which machine translation already operates.

      • by crunchygranola ( 1954152 ) on Tuesday July 16, 2019 @10:40AM (#58933822)

        That would be our best, and probably last, attempt (though the same thing could be repeated with tweaks indefinitely). One thing to remember is that this language has a fair likelihood of it being an "isolate", a language with no relatives either in modern or documented ancient languages. The best known extremely ancient language, Sumerian, is like that.

        For this period, and this place in the world, the candidate related languages are a very limited set. It is not at all like there are a large number of known candidate languages it might be related to,

        And then there are is the fundamental unfixable problem of insufficient Linear A source material. Perhaps all will be able to do it is hypothetical assignments of values with various probabilities.

        • Estruscan is often considered an isolate, too.
          • by Sique ( 173459 )
            But the known amount of text in Etruscan is huge compared to Linear A.
        • Your points are all good (I'm a linguist). But Sique seems to be suggesting using languages for which MT systems already exist--which is to say, modern languages, all of which are some 3000 years removed from Linear A. I doubt that would work; with scattered exceptions (Hebrew, maybe, and possibly written Greek or written Tamil), modern languages bear very little resemblance to their ancient progenitors. Try reading Beowulf, which is less than half that age. It starts out (I had to leave off the macrons

          • by Sique ( 173459 )
            That's not what the article suggest.

            The article talks not about modern languages, it talks about every language for which machine translating already operates. Ugarit and Linear B are mentioned as examples. Here, the program could decipher about two thirds of the texts without being fed long translation lists first, just because of its own analysis. Hattite, Sumerian or Akkadian are easily possible. Why not train it with the large bodies of text in the (still unknown) languages of Mohenjo-daro and Harappa

            • "every language for which machine translating already operates" is basically modern languages, plus Latin

              "it tries to find abstract concepts within the text, not a 1:1 translation to a modern language." 1:1 translation is not how MT systems (or indeed, human translators) work; and the system they describe is not capable of finding "abstract concepts within the text" in monolingual text (which texts in undeciphered languages are, by definition).

              The best way of understanding what this system really does is n

      • So....... Akkadian and Sumerian and Egyptian? Should they add Japanese to the mix?
      • it's a mule, right ... it's like having a sweatshop full of little kids you don't have to pay without the human rights to it, lol, they just keep going at it, transistor by transistor ... chip by chip, without literally breaking a sweat ... i still fail to see current a.i. as more than a glorified database and search engine (which maybe explains why google seems to be good at it hahah) Because it's lacking about everything xept just that (a bit like people get shiney starry eyes on the word 'blockchain' or
    • by arglebargle_xiv ( 2212710 ) on Tuesday July 16, 2019 @07:47AM (#58933144)

      This is a problem with Linear A as we have almost no corpus to work with. There's almost no long coherent bits of Linear A to analyze -- most of it are short labels of a few words written on small objects

      That's pretty easy to translate then, it'll be "me and my besties at the pyramid-raising", "lol funny", "most amazing chariot-race ever", "OMG Assyrians!", and so on.

    • by Anonymous Coward

      What I really wonder is what will happen if this is turned on the Voynich manuscript
      which should have plenty of translatable context. This is pretty cool if it works.

      CAP === 'nonsense'

      • What I really wonder is what will happen if this is turned on the Voynich manuscript which should have plenty of translatable context.

        IF - and this is a big thing to assume - the Voynich manuscript is genuinely a script in an unknown (or unrecognised) language. The null hypothesis that it is a fake, or a manufactured (granted, elaborate) joke whose context has been lost and actually contains no coherent content remains on the table.

    • However we can apply some filters on ancient languages. For one Literacy was a special skill back then. So it was rarely used for menial things. So we would expect it would be official documentations. So things like "Not a Toy" wouldn't be posted or even lazy posts like this one.

      • For one Literacy was a special skill back then.
        That is the common mantra of historians who never really think about the topic.

        I assume that most "free men" could read, at least enough to check what the scribe wrote. And most likely the same is true for women. The idea that everyone in ages where scripts were common are illiterate is idiotic. Hint: street signs, price signs, contracts, maps for travel ...

        Considering that reading is perhaps 10 times more easy than writing, I really doubt such assumptions are

  • there's languages that are alive that don't have actually decent auto translation that you could rely on even a little teensy bit.

    or even ocr.

    • "Cuz I iz keeping it real wi ma ho ain't it blud yeah?"

      Anyone?

    • It would help if Thai wrote spaces (or *something*) between words. Finding word boundaries in languages that don't mark them does, I believe, make it harder to build an MT system (or any other NLP system, for that matter).

      • by q4Fry ( 1322209 )

        It would help if Thai wrote spaces (or *something*) between words. Finding word boundaries in languages that don't mark them does, I believe, make it harder to build an MT system (or any other NLP system, for that matter).

        Particularly when the language doesn't have linear phonemes. Thai syllables are written RTL, but vowels for a syllable might be on any side of it. Which is (almost) fine until you hit your first opportunity for a consonant cluster. Then, you have a decision: does the vowel parse with the first consonant, followed by the second? Or does it parse with both together, followed by whatever the hell comes after it? Then, as McSwell alludes, you don't even know if you've started reading the next word yet.

        Add to th

  • If it turns out Jesus was really just selling reverse mortgages, the world probably doesn't need to know that.
    • by Anonymous Coward on Tuesday July 16, 2019 @07:15AM (#58933062)

      Nah, but we do need to stop at ancient Egypt because it turns out all the ancient religions ripped of the Sumerians. The world doesn't need to know Noah used to be called Utnapishtim, for example.

      • It is neat reading Herodotus and seeing things like "the people who live here worship Zeus but call him [local sky god]", the ancients were cooler with that sort of thing than the monotheists tend to be.
        • You do know that the Old Testament (the Hebrew Bible) dates from back then, right? Indeed the third chapter of the book of Daniel recounts an incident in the Babylonian empire where the wrong kind of worship was punished with a particularly horrific death. (The story of course says that the sentence couldn't be carried out, but the Babylonians did try.)

          • You're right, I should have specified "the ancient polytheists" rather than the ancients, plenty of people got knocked off for worshipping the wrong gods throughout history, just some religions lend themselves to that sort of thing more than others. As I recall Zoroastrianism was one of the best known early exclusivist religions.
      • The world doesn't need to know Noah used to be called Utnapishtim, for example.
        Well, there is a small chance they are two different people, and probably survivours of two different floods even.

    • Jesus was the first doomsday prepper

      • Unless my very old Sunday-school indoctrination is failing me, Noah came well before Jesus.
    • I know a pastor who used to teach that. Of course the pastor got his cut, every time 'Jesus' sold one. Good business until he went to jail.
  • Anything linear should be easy for machine learning. It's the nonlinear languages I'm worried about.
  • by captbollocks ( 779475 ) on Tuesday July 16, 2019 @07:20AM (#58933076)

    Why can't it even do Spanish properly, and I mean the proper Es-Spanish which has a huge amount of data to work with.

    • ÂPor qué, el traductor de Google es perfecto! ;-)

      • Porque el traductor de Google es perfecto :P

        Por qué: why
        Porque: because

        • Ok, thanks!

          It seems like Slashdot's character handling is even worse than Google translate! :)

          • You are welcome :)

            Slashdot's lack of Unicode handling is legendary at this point, it's amazing that a tech site like this is still stuck on ASCII land.

        • by mjwx ( 966435 )
          Google translate is good enough to convey complex ideas in a language you don't know well but you need to know to steer away from slang, idioms, colloquialisms and phrases as the functional translation is not going to be the same as the literal translation. Given that slang and idioms tend to differ from culture to culture, even if they're the same language. I.E. "Que rico" is literally translated as "what rich", in context Google may translate it as "delicious", however in some Latin american countries it
    • Let's be honest, 68% is pretty near unintelligible. Imagine you randomly dropped 30% of the words out of your sentences. It's not easy to understand.
  • by Anonymous Coward on Tuesday July 16, 2019 @07:22AM (#58933084)

    Let's see how good this algorythm is and try it on the Voinich Manuscript.

    • by tattooed_pariah ( 1800040 ) on Tuesday July 16, 2019 @07:47AM (#58933140)

      Let's see how good this algorythm is and try it on the Voinich Manuscript.

      That was my first thought too! but I'm also a huge Lovecraft fan, and I could totally see this going Black Mirror style where now a machine has knowledge of other worlds/dimensions and how to merge them with ours or something ;)

      • Re: (Score:3, Informative)

        also, it's Voynich... no "i"
      • Perhaps this will feature in a future installment of The Laundry Files
      • The Voynich manuscript is translated since 20 years. No idea why it is always popping up as a "still not solved mystery" on /. and other sites since 10 years.

        FYI: it is written by a Portugiese monk/priest in an south american/mezzo american indian script/language.

        • No, it is not. If it were, the language would be identified, and one could decipher the manuscript based on the modern descendants of that language.

          The claim that you're probably referring to is that it was written in a Nahuatl language. There are several modern Nahuatl languages, plus we have written sources for Nahuatl shortly after the conquest--i.e. from the exact time this manuscript was written, according to this theory. In short, we have a good handle on what Nahuatl was like back then, and some i

    • by Anonymous Coward

      Let it go. The voinich is obviously a dangerous hoax.

    • Start it off with the Codex Seraphinius, it's probably easier as it was written by someone who would naturally think in 20C Italian.

    • Quite possibly the Voynich Manuscript does not contain a real language.

  • by LordHighExecutioner ( 4245243 ) on Tuesday July 16, 2019 @07:25AM (#58933094)
    ...Youtube automatic subtitles written in Linear B ?!?
  • by Harold Halloway ( 1047486 ) on Tuesday July 16, 2019 @07:28AM (#58933104)

    It's a language so old that they still don't have their own words for 'computer,' 'mobile phone,' and 'my round.'

  • by Anonymous Coward

    Try it against everything.

  • I came ot see if Linear A had been cracked. Maybe someday. At least there is hope short of finding a Rosetta Stone equivalent.
  • Not Progenitor (Score:4, Insightful)

    by crunchygranola ( 1954152 ) on Tuesday July 16, 2019 @10:24AM (#58933720)

    The summary's statement that they need "the progenitor language" of Linear A is a bit odd, since the founder language of Linear A is an extinct one of necessity. The Greek we know for example is a successor, not progenitor, of Linear B. I suppose they mean a reconstructed language, as has been done with Proto-Indo-European and Proto-Slavic, but not some other known language groups, What they should have said I think is the simply the language family to which it belongs.

  • I had to say it.

  • That's impressive work that takes machine translation to a new level.

    Extremely unclear this is impressive whatsoever. And it doesn't take MT to a "new level", it's sets an initial bar where no bar previously existed, because the prior state of the art was inapplicable.

    Seriously, I think Turing would have scoffed at this so hard, the guy at the next desk ends up huffing, "careful with your goddamn tea, Alan—and what is it this time, do tell!" And then the whole of Bletchley Park passes a very pleasant

  • now use it on the voynich manuscript and see if it spits anything readable back out.

  • by Major_Disorder ( 5019363 ) on Tuesday July 16, 2019 @07:00PM (#58936720)
    Let me know when it can translate bug reports into English. That would be useful.

A morsel of genuine history is a thing so rare as to be always valuable. -- Thomas Jefferson

Working...