Machine Learning Has Been Used To Automatically Translate Long-Lost Languages (technologyreview.com) 111
Jiaming Luo and Regina Barzilay from MIT and Yuan Cao from Google's AI lab in Mountain View, California, have developed a machine-learning system capable of deciphering lost languages, and they've demonstrated it on a script from the Mediterranean island of Crete. The script, Linear B, appeared after 1400 BCE, when the island was conquered by Mycenaeans from the Greek mainland. MIT Technology Review reports: Luo and co put the technique to the test with two lost languages, Linear B and Ugaritic. Linguists know that Linear B encodes an early version of ancient Greek and that Ugaritic, which was discovered in 1929, is an early form of Hebrew. Given that information and the constraints imposed by linguistic evolution, Luo and co's machine is able to translate both languages with remarkable accuracy. "We were able to correctly translate 67.3% of Linear B cognates into their Greek equivalents in the decipherment scenario," they say. "To the best of our knowledge, our experiment is the first attempt of deciphering Linear B automatically."
That's impressive work that takes machine translation to a new level. But it also raises the interesting question of other lost languages -- particularly those that have never been deciphered, such as Linear A. In this paper, Linear A is conspicuous by its absence. Luo and co do not even mention it, but it must loom large in their thinking, as it does for all linguists. Yet significant breakthroughs are still needed before this script becomes amenable to machine translation. For example, nobody knows what language Linear A encodes. Attempts to decipher it into ancient Greek have all failed. And without the progenitor language, the new technique does not work.
That's impressive work that takes machine translation to a new level. But it also raises the interesting question of other lost languages -- particularly those that have never been deciphered, such as Linear A. In this paper, Linear A is conspicuous by its absence. Luo and co do not even mention it, but it must loom large in their thinking, as it does for all linguists. Yet significant breakthroughs are still needed before this script becomes amenable to machine translation. For example, nobody knows what language Linear A encodes. Attempts to decipher it into ancient Greek have all failed. And without the progenitor language, the new technique does not work.
Linear A is a tough nut to crack because... (Score:5, Interesting)
Machine learning benefits from having large data sets to train with. This is a problem with Linear A as we have almost no corpus to work with. There's almost no long coherent bits of Linear A to analyze -- most of it are short labels of a few words written on small objects, presumably notes about who made it, or about use, what it was made for, or about who owned it,
It would be pretty hard to decipher English if all you had to work with were labels like "Made in China", "Not a Toy", "For External Use Only", "Ronco Corporation", or "Ralph Stevens". You might definitely note patterns in what was written but still have no idea what was being said.
Re:Linear A is a tough nut to crack because... (Score:5, Informative)
For example, nobody knows what language Linear A encodes. Attempts to decipher it into ancient Greek have all failed. And without the progenitor language, the new technique does not work.
But the big advantage of machine-based approaches is that they can test one language after another quickly without becoming fatigued. So it’s quite possible that Luo and co might tackle Linear A with a brute-force approach—simply attempt to decipher it into every language for which machine translation already operates.
Re:Linear A is a tough nut to crack because... (Score:5, Informative)
That would be our best, and probably last, attempt (though the same thing could be repeated with tweaks indefinitely). One thing to remember is that this language has a fair likelihood of it being an "isolate", a language with no relatives either in modern or documented ancient languages. The best known extremely ancient language, Sumerian, is like that.
For this period, and this place in the world, the candidate related languages are a very limited set. It is not at all like there are a large number of known candidate languages it might be related to,
And then there are is the fundamental unfixable problem of insufficient Linear A source material. Perhaps all will be able to do it is hypothetical assignments of values with various probabilities.
Re: (Score:2)
Re: (Score:2)
Etruscan relationships (Score:2)
There have been lots of suggestions for a relation with all sorts of languages of the Old World (none for a relationship with languages of the New World, afaik--but just wait). Evidence, less. You might look at the Wikipedia article about Etruscan.
Re: (Score:2)
Your points are all good (I'm a linguist). But Sique seems to be suggesting using languages for which MT systems already exist--which is to say, modern languages, all of which are some 3000 years removed from Linear A. I doubt that would work; with scattered exceptions (Hebrew, maybe, and possibly written Greek or written Tamil), modern languages bear very little resemblance to their ancient progenitors. Try reading Beowulf, which is less than half that age. It starts out (I had to leave off the macrons
Re: (Score:2)
The article talks not about modern languages, it talks about every language for which machine translating already operates. Ugarit and Linear B are mentioned as examples. Here, the program could decipher about two thirds of the texts without being fed long translation lists first, just because of its own analysis. Hattite, Sumerian or Akkadian are easily possible. Why not train it with the large bodies of text in the (still unknown) languages of Mohenjo-daro and Harappa
Re: (Score:2)
"every language for which machine translating already operates" is basically modern languages, plus Latin
"it tries to find abstract concepts within the text, not a 1:1 translation to a modern language." 1:1 translation is not how MT systems (or indeed, human translators) work; and the system they describe is not capable of finding "abstract concepts within the text" in monolingual text (which texts in undeciphered languages are, by definition).
The best way of understanding what this system really does is n
Re: Linear A is a tough nut to crack because... (Score:2)
Re: (Score:1)
Re:Linear A is a tough nut to crack because... (Score:5, Funny)
This is a problem with Linear A as we have almost no corpus to work with. There's almost no long coherent bits of Linear A to analyze -- most of it are short labels of a few words written on small objects
That's pretty easy to translate then, it'll be "me and my besties at the pyramid-raising", "lol funny", "most amazing chariot-race ever", "OMG Assyrians!", and so on.
Re: (Score:1)
What I really wonder is what will happen if this is turned on the Voynich manuscript
which should have plenty of translatable context. This is pretty cool if it works.
CAP === 'nonsense'
Re: (Score:2)
IF - and this is a big thing to assume - the Voynich manuscript is genuinely a script in an unknown (or unrecognised) language. The null hypothesis that it is a fake, or a manufactured (granted, elaborate) joke whose context has been lost and actually contains no coherent content remains on the table.
Re: (Score:2)
However we can apply some filters on ancient languages. For one Literacy was a special skill back then. So it was rarely used for menial things. So we would expect it would be official documentations. So things like "Not a Toy" wouldn't be posted or even lazy posts like this one.
Re: (Score:2)
For one Literacy was a special skill back then.
That is the common mantra of historians who never really think about the topic.
I assume that most "free men" could read, at least enough to check what the scribe wrote. And most likely the same is true for women. The idea that everyone in ages where scripts were common are illiterate is idiotic. Hint: street signs, price signs, contracts, maps for travel ...
Considering that reading is perhaps 10 times more easy than writing, I really doubt such assumptions are
I just wish a ocr and translator for Thai. (Score:2)
there's languages that are alive that don't have actually decent auto translation that you could rely on even a little teensy bit.
or even ocr.
Translation for rapper/ghetto speak would be nice (Score:1)
"Cuz I iz keeping it real wi ma ho ain't it blud yeah?"
Anyone?
Re: (Score:2)
It would help if Thai wrote spaces (or *something*) between words. Finding word boundaries in languages that don't mark them does, I believe, make it harder to build an MT system (or any other NLP system, for that matter).
Re: (Score:2)
It would help if Thai wrote spaces (or *something*) between words. Finding word boundaries in languages that don't mark them does, I believe, make it harder to build an MT system (or any other NLP system, for that matter).
Particularly when the language doesn't have linear phonemes. Thai syllables are written RTL, but vowels for a syllable might be on any side of it. Which is (almost) fine until you hit your first opportunity for a consonant cluster. Then, you have a decision: does the vowel parse with the first consonant, followed by the second? Or does it parse with both together, followed by whatever the hell comes after it? Then, as McSwell alludes, you don't even know if you've started reading the next word yet.
Add to th
Just stop at Aramaic please (Score:2)
Re: Just stop at Aramaic please (Score:4, Informative)
Nah, but we do need to stop at ancient Egypt because it turns out all the ancient religions ripped of the Sumerians. The world doesn't need to know Noah used to be called Utnapishtim, for example.
Re: (Score:2)
Re: (Score:2)
You do know that the Old Testament (the Hebrew Bible) dates from back then, right? Indeed the third chapter of the book of Daniel recounts an incident in the Babylonian empire where the wrong kind of worship was punished with a particularly horrific death. (The story of course says that the sentence couldn't be carried out, but the Babylonians did try.)
Re: (Score:2)
Re: (Score:2)
The world doesn't need to know Noah used to be called Utnapishtim, for example.
Well, there is a small chance they are two different people, and probably survivours of two different floods even.
Re: (Score:2)
Unlike Noah, no one claims such a story about Utnapishtim.
He simply is referred to as "the flood survivour". The myth is when his house lost its roof it was upside down in the water and he used it as a boat. Not so implausible if you know that e.g. the vikings when they settled for good, used their long boats as roofs for the new built houses.
But perhaps you are a Christian/Jew/Muslim and can not tolerate that your flood myth simply was a remake of an older one :D
Re: (Score:3)
Jesus was the first doomsday prepper
Re: (Score:2)
Re: (Score:2)
Well, there were lots more people back then.
You ask me how I know that? How many parents did you have? 2 And how many grandparents? 4 And how many great-grandparents? 8, I'm sure.
So clearly the number of people on the Earth double for each generation we go back. You couldn't deny that if you tried with both hands!
Re: (Score:2)
Re: Just stop at Aramaic please (Score:2)
Linear A? Pfft. (Score:2)
Re: (Score:2)
As in the heptapods' language?
--
Every silver lining has a cloud around it.
If this tech is so great (Score:4, Interesting)
Why can't it even do Spanish properly, and I mean the proper Es-Spanish which has a huge amount of data to work with.
Re: (Score:1)
ÂPor qué, el traductor de Google es perfecto! ;-)
Re: (Score:2)
Porque el traductor de Google es perfecto :P
Por qué: why
Porque: because
Re: (Score:1)
Ok, thanks!
It seems like Slashdot's character handling is even worse than Google translate! :)
Re: (Score:2)
You are welcome :)
Slashdot's lack of Unicode handling is legendary at this point, it's amazing that a tech site like this is still stuck on ASCII land.
Re: (Score:2)
Re: If this tech is so great (Score:2)
Re: (Score:1)
English with a heavy Yorkshire accent is damn next to impossible for a Yankee to understand. I know, I've tried.
Re: (Score:3)
na I reck nany Stralian cud cop ya lingo orright, no wuckers.
Re: (Score:2)
Re: (Score:2)
For the same reason Americans, Canadians, Australians and New Zealanders speak English...
Since when do Americans speak English?
Re: (Score:2)
"There are even places where English completely disappears. In America, they haven't used it for years." --Prof. Henry Higgins
Re: (Score:2)
For the same reason Americans, Canadians, Australians and New Zealanders speak English...
Since when do Americans speak English?
Since 1776... Unfortunate for the rest of the English speaking world, they didn't install any of the updates since then.
Re: (Score:2)
I don't know what New Zealanders speak, but it isn't English.
"Unglesh", maybe...
Re: (Score:2)
Some such languages are still spoken. Quechua (actually a family of languages) is the biggest, around 25 million speakers. Smaller ones include Shuar (southeastern Ecuador) and Tzeltal (Mayan, southern Mexico), both of which I used to know a little. Most of these languages are dwindling, but in my view primarily because the chances for bettering yourself and your family economically are far better in Spanish (or Portuguese) than in one of the indigenous languages. And of course few books are published i
Re: (Score:2)
Maybe the question should be, Why do so many people in Central and South America continue to speak the language of their alleged "colonizers" they complain so much about? Why do they continue to use the last names of these alleged "colonizers", as well? Why don't they return to the supposed "indigenous" languages and names supposedly used by their non-European ancestors?
The simple answer is because those languages didn't evolve.
Tagalog in the Philippines is a hybrid of the original Tagalog, Spanish and English because Tagalog didn't develop... or more accurately did develop by incorporating other languages. Pre-colombian languages haven't been so adaptable. The fact large numbers of the native populations were wiped out didn't help. There simply isn't another language to go back to, unlike the Philippines which had Tagalog and a variety of other languages.
Besides thi
Re: (Score:2)
"The simple answer is because those languages didn't evolve." Not true, if by "evolve" you mean "change over time." Where there is more than one language in a language family, it's usually easy (unless the time depth is too great) to tell that they have diverged (that is, changed) from some common ancestor. Quechuan languages, or Mayan languages, to take two obvious examples. (Where a language is an isolate, like Waorani of Ecuador, it's harder to tell that it's changed over time, but there's no reason
Do the Voinich manuscript. (Score:3, Funny)
Let's see how good this algorythm is and try it on the Voinich Manuscript.
Re:Do the Voinich manuscript. (Score:5, Funny)
Let's see how good this algorythm is and try it on the Voinich Manuscript.
That was my first thought too! but I'm also a huge Lovecraft fan, and I could totally see this going Black Mirror style where now a machine has knowledge of other worlds/dimensions and how to merge them with ours or something ;)
Re: (Score:3, Informative)
Re: (Score:2)
Re: (Score:2)
The Voynich manuscript is translated since 20 years. No idea why it is always popping up as a "still not solved mystery" on /. and other sites since 10 years.
FYI: it is written by a Portugiese monk/priest in an south american/mezzo american indian script/language.
Re: (Score:2)
No, it is not. If it were, the language would be identified, and one could decipher the manuscript based on the modern descendants of that language.
The claim that you're probably referring to is that it was written in a Nahuatl language. There are several modern Nahuatl languages, plus we have written sources for Nahuatl shortly after the conquest--i.e. from the exact time this manuscript was written, according to this theory. In short, we have a good handle on what Nahuatl was like back then, and some i
Re: Do the Voinich manuscript. (Score:1)
Let it go. The voinich is obviously a dangerous hoax.
Re: (Score:2)
Dangerous? I don't think its covered in spikes. Did it give someone a papercut? What makes it dangerous?
Re: (Score:1)
Start it off with the Codex Seraphinius, it's probably easier as it was written by someone who would naturally think in 20C Italian.
Re: (Score:2)
Quite possibly the Voynich Manuscript does not contain a real language.
So now we can have... (Score:3)
What about Welsh? (Score:5, Funny)
It's a language so old that they still don't have their own words for 'computer,' 'mobile phone,' and 'my round.'
Re: (Score:2)
I believe the ancient Greeks (who spoke the language written in Linear B) were quite fond of wine. It's entirely possible they had a concept of "my round."
So brute force it (Score:1)
Try it against everything.
Linear A - Not yet (Score:1)
Not Progenitor (Score:4, Insightful)
The summary's statement that they need "the progenitor language" of Linear A is a bit odd, since the founder language of Linear A is an extinct one of necessity. The Greek we know for example is a successor, not progenitor, of Linear B. I suppose they mean a reconstructed language, as has been done with Proto-Indo-European and Proto-Slavic, but not some other known language groups, What they should have said I think is the simply the language family to which it belongs.
Re: Not Progenitor (Score:2)
Multi-pass (Score:2)
I had to say it.
baa baa Bayesian (Score:2)
Extremely unclear this is impressive whatsoever. And it doesn't take MT to a "new level", it's sets an initial bar where no bar previously existed, because the prior state of the art was inapplicable.
Seriously, I think Turing would have scoffed at this so hard, the guy at the next desk ends up huffing, "careful with your goddamn tea, Alan—and what is it this time, do tell!" And then the whole of Bletchley Park passes a very pleasant
good (Score:2)
now use it on the voynich manuscript and see if it spits anything readable back out.
That's nothing. (Score:3)
Re: (Score:2)
FWIW, the authors of the original paper (at least one of whom is a native speaker of modern Hebrew) did not make that mistake.