Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Science

XML for Ancients 118

Andrew writes: "More than 5,000 years ago, the very first information revolution occurred when some unknown research team in Mesopotamia found a way to download and store language through a killer application called "writing.". The cuneiform digital library will have 60,000 texts ready in a couple of years. Using SVG and XML to represent their documents. Similar efforts are underway for hieroglyphics."
This discussion has been archived. No new comments can be posted.

XML for Ancients

Comments Filter:
  • by MisterPo ( 520698 ) on Thursday November 08, 2001 @01:09AM (#2536470)
    I have been working in IT since 1997, yeah I know a mere blink of an eye for some Unix Wizards (ie. beards, strange clothing and their own arcane language). What I have noticed is that every year my handwriting has been getting progressively worse. What with my PDA, laptop, PCs etc. I just have no need to wield a pen no more :)

    Apart from signing my name on credit card chits, the only time I am required to write is for birthday/Christmas and other assorted cards. Its getting so bad now that I start to write a long word and just give up. My once pristine handwriting now looks like a doctors prescription scrawl.

    Any else get this too?

    Po
    • After 8 years as a developer, my handwriting is fine (well, my printing... I never really was one for cursive.) After four years in the Navy, though, while all my other handwriting skills remained more or less consistent, my signature went from something readable to an almost completely illegible scrawl. At about the same time, the exact same thing happened to my wife's signature - she was working as a social worker in a nursing home, and signing something every five minutes.

      I don't think either one of us could actually produce a readable signature anymore even if we tried.

    • I've always assumed it was the other way around - people with terrible handwriting start using computers just so they can use a keyboard rather than a pen. I don't think I've ever met a proper computer geek who can write legibly.
    • There couldn't possibly me any other computer geeks out there with bad handwriting ;) My signature has become the first letter of my first name and then a somewhat recognizable first letter to my last name (which has 8 letters in it) followed by a line. This is probably why my CS professor has started more and more to require assignments to be typed. Why anyone would turn in a *written* assignment anyhow is beyond me :P

      Hmm...that was pointless..

      Cheers,
      jw
  • Site appears to be slash-dotted already...

    So.. Are these 5000 year old documents going to be freely available or will the database of texts be copyrighted/restricted?

    • Why do I always people saying things like: "Slashdotted already! What a pity... It should have been cached."

      But when I click on the link anyway, the site loads with on problem. This is the rule not the exception. The amount of times I can't get to a link from slashdot is surprisingly low.
      • by Anonymous Coward
        If you are behind a proxy server you might see the site even if it is already down. Even if you haven't set it up in your browser there are ways an ISP can setup a proxy so that it is completely transparent to the users.
        • there are ways an ISP can setup a proxy so that it is completely transparent to the users.

          Actualy, that is not true. I can testify for that once I'm one of the victims of the so called "Transparent Proxy". The only thing transparent about it is that you don't have to configure your browser to use it. Also, you have no option about NOT using it. So, we have problems trying to check if a site is up, or if the proxy server overloads. Or even if it crashes.
          I, for once, I totaly agains these monsters.

      • Why do I always people saying things like: "Slashdotted already! What a pity... It should have been cached."

        But when I click on the link anyway, the site loads with on problem. This is the rule not the exception. The amount of times I can't get to a link from slashdot is surprisingly low.

        That's because those people are the ones who do the actual slashdotting. Usually by the time normal people like you and me click on the link, somebody at the other end has noticed that their site is down due to a DBS (Denial by Slashdot) attack and has set up a couple of mirrors that that future requests can be redirected to. After all, it's not somebody would lie about a thing like that.
    • So a 5000 years original text should be no problem.

      The case will happen if you ask for the translation (What, you are not Cuneiform litterate ? Talk about education 8)
      • Copyright is 70 years on books

        No, 95 years on all works first published on or after January 31, 1923. See also Sonny Bono Copyright Term Extension Act [everything2.com]. And it'll get even longer before 2020 as Di$ney frantically bribes Congre$$ to pass yet another corporate-welfare copyright extension.

        The case will happen if you ask for the translation

        ...even into XML.

    • >>So.. Are these 5000 year old documents going to be freely available or will the database of texts be copyrighted/restricted?

      If you can read cunieform you have access. If you don't, you better start learning. This is not a project for non-professionals - like Linux people, epigraphers would tell you to RTFM before you complain about not understanding what is written.
  • With all these ancient language/hieroglyphic texts being archived, I have a feeling that we'll be hitting that 65536 character wall very shortly, since someone in the future might need that Cunieform version of M$ Word (hey, it could happen). Is it time for UTF-32?
    • Actually... (Score:4, Interesting)

      by recursiv ( 324497 ) on Thursday November 08, 2001 @01:19AM (#2536508) Homepage Journal
      Unicode is often referred to as a 16-bit system, which would allow for only 65,536 characters, but by reserving some code points for mapping into additional 16-bit planes, it has the potential to cope with over one million unique characters.

      The current version (3.1) of the Unicode Standard, developed by the Unicode Consortium, assigns a unique identifier to each of 94,140 characters
    • We've always had UTF-32. Due to some hacks in UTF-16, Unicode can include up to a million characters, more than anyone anticipates needing. Cuniform has already been (very) tenatively allocated to U+12800-U12C80. Apparently, no one has come up with a complete proposal for including cuniform, though.
    • See "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations" [hastingsresearch.com] for more information on this. It argues that even Unicode 3.1 will not contain enough characters for just East Asian languages, never mind dead, Middle Asian ones.

      The main reason seems to be that in East Asia, there are reduced character sets in daily use which contain only a couple of hundred or thousand glyphs, but to read and study classical texts, the number required quickly goes up into the tens of thousands, for each of a number of languages. Not having these glyphs in the Unicode set would be like asking English-speakers to use alphabets reduced by five or six characters (M and N are similar, X, Q, C and Z could be replaced by one character as well) and dictionaries from which three out of four words have been deleted due to redundancy or age.

      The reason for this mis-design, the article argues, is political: the nationalities in question have never been asked how many characters they would need together -- for each single language, Chinese, Korean, or Japanese, a scholar would say "Sure! 50,000 characters is enough for us!"

      • The reason for this mis-design...is political: the nationalities in question have never been asked how many characters they would need...

        This is certainly a true statement, but it gets at a basic engineering tradeoff: performance verses inclusiveness.
        Total inclusiveness isn't desireable for two reasons.
        a) When it comes to dead languages, you have scholars who make their living arguing over fine points pertaining thereto, thus making a 'standard' a moving target. Attempts at total inclusiveness are an exercise in windmill jousting.
        b) Even in a "broadband for all my friends" environment, the market (where the loot is) favors svelte technologies.
        Prediction: the market partitions itself with the low end covered by Unicode, and more exotic technologies to favor the scholarly crowd.
      • Not having these glyphs in the Unicode set would be like asking English-speakers to use alphabets reduced by five or six characters (M and N are similar, X, Q, C and Z could be replaced by one character as well)

        Spelling reform. China (outside Taiwan) has had it. It's perfectly possible to write English with only 18 letters [everything2.com].

        and dictionaries from which three out of four words have been deleted due to redundancy or age

        So? Desk dictionaries aren't nearly as comprehensive as Oxford English Dictionary or even the unabridged Webster's Third New International Dictionary.

      • See "Why Unicode Will Work on the Internet" [slashdot.org]. Basically, Unicode has more characters than just about any other character set - it includes 70,000 Han ideographs. All unified by a Japanese unification principle agreed to by all the pertinent Asian countries. All the Asian classics have been published in Unicode with their characters. This all, with over 800,000 code points to add new characters, if needs be.
      • That article is complete crap. I can't believe anyone takes it seriously.

        The author of that article doesn't seem to understnad the fact that Unicode is a character set, not a font. He also doesn't seem to understand how Unicode's surrogate pairs work (which allow for encoding of more than 1 million characters). He doesn't seem to understand that Unicode is an evolving standard (i.e., 3.1 is hardly the final version). And he doesn't seem to understand that UTF-8, UTF-16, UTF-32, etc. are all just different formats, and they actually represent the exact same character set.

        But most importantly, he is flat-out wrong about how and why the decisions were made regarding encoding of East Asian languages. He needs to learn about the history of Han unification for CJK characters. If he did, he would know that linguists and computer scientists from East Asian countries have been involved in Unicode since the beginning. The unification of East Asian characters was done on purpose, and has the full support of linguists, scholars, and computer scientists from those countries.

        If the author of that article had just spent a few minutes reading the a copy of The Unicode Standard, he would not have made those mistakes. He didn't even have to read the whole thing! Just the Introduction and Appendix A would have set him straight on the issues I just mentioned. The fact that he didn't means this guy really shouldn't be doing work for a company with the word "Research" in the title.

        Oh, and even though that page says the article has not been modified since June 4, you can see from the google cache [google.com] that they have since removed their promise of responding to criticism.

        And one more thing: Since he derides those mean old Westerners on the Unicode committee for being insensitive towards the peoples of East Asian countries, perhaps he should ask himself if it is considered impolite or insensitive to sweepingly refer to such peoples as "Oriental", which he does in the first few paragraphs.

    • There are Unicode character sets in the 32-bit range; the first 16 bits is only supposed to be used for current languages in active use. So cuniform, along with linear B, runic, and possibly Tolkien's runes (and, unofficially, klingon), will probably end up in the 0x1xxxx range.

      UTF-8 is actually perfectly sufficient for 32-bit characters. (And you meant UCS-32; UTF-n is an n-bit/character encoding of >n-bit characters, while UCS-n is the n-bit character set).
  • Remember when the pictures of Mars came out, and someone found the "face on Mars" in one of the prints.

    Wonder how long it will be before someone finds something interesting here, and how long it will take to "doctor" it?

    Alternately, how long will it take for someone to fake something.

  • "Using SVG and XML to represent their documents. Similar efforts are underway for hieroglyphics."

    They're using XML? They could integrate this with some sort of retrieval language and couple it with Jabber [jabber.org] clients. That way you could send some sort of command-line search/retrieval command to the database using a regular Jabber client and have the XML data sent back, since Jabber natively supports the standard.

    • Well, the best answer to this one was provided on the hieroglyphics page. I'm not sure if it was slashdotted after I got it (the UCLA one was down just after the first comment was posted) so I'll post the majority here.

      XML is a format which allow both to describe an encoding and to write encoded files. It was chosen for a number of reasons. First, it's easy to extend an XML format. Second, it's easy to parse an XML file, an there are a lot of tools for it: people will be able to manipulate XMLMCD files without being graduate in Computer Science. Third, XML is being used for a growing number of applications --- for instance web browsers. Fourth, there's a user community for XML in the philological world : two interesting examples are the Text Encoding Initiative and the recent conference on XML and Ancient Near East.
  • ... (Score:4, Funny)

    by evel aka matt ( 123728 ) on Thursday November 08, 2001 @01:13AM (#2536483)
    How Snowcrash.
  • by Teancom ( 13486 ) <`david' `at' `gnuconsulting.com'> on Thursday November 08, 2001 @01:13AM (#2536484) Homepage
    they are also writing their tcp packets on clay tablets, and attempting to send them down the wire. That was the quickest /.'ing I've *ever* seen.
  • Cunieform writing (Score:5, Informative)

    by Alien54 ( 180860 ) on Thursday November 08, 2001 @01:16AM (#2536495) Journal
    Slashed already

    [smile]

    Scientific American [sciam.com] has this article on Information Technology, 2500 B.C. [sciam.com] on what life was like for the information worker of that day.

    As many as half a million cuneiform tablets, hand size up to book-page size, are now available around the world. Surely many more are waiting to be found. Those samples are of every quality: once prized accounts and receipts, schoolboys' lessons, litigation profound or droll, literary essays, erotica, mathematics--and entire ancient epics, centuries older than Father Abraham's. A mostly unread treasury, comprising the equivalent of tens of thousands of large printed volumes.

    Looks like there could be a lot of fun and good stuff there.

  • "640 clay tablets is enough for anyone!"


    -- William "Scorpion King" Gates

  • Wow (Score:2, Funny)

    by Anonymous Coward
    "More than 5,000 years ago, the very first information revolution occurred when some unknown research team in Mesopotamia found a way to download and store language through a killer application called "writing.". The cuneiform digital library will have 60,000 texts ready in a couple of years. Using SVG and XML to represent their documents.


    Sooo... this project has been going on for about 5,000 years, they're finally going to be making a large release in a few years, and we're *JUST NOW* hearing about this?

    My *god*, talk about keeping the PR lid on tight!
  • by Anonymous Coward
    ...or else the uhh.. because... uhmm..
    Oh, what the hell.

    Micro$oft sucks.
  • by Waffle Iron ( 339739 ) on Thursday November 08, 2001 @01:27AM (#2536535)
    IIRC, cuneiform writing is composed entirely of angle brackets. To write this in XML, every character is going to have to be escaped!
    • Thats why.. They are using SVG! The XML can be used to store meta data or even the document with the SVG references for the cuneiform characters.

      Run through an XSLT transformation.. Voila... HTML or PDF representing the cuneiform document (Do texts written in cuneiform qualify as documents??!? ;).

      Jeremy
  • by rfsayre ( 255559 ) on Thursday November 08, 2001 @01:32AM (#2536551) Homepage
    <!DOCTYPE JAMM SYSTEM "justified.dtd" [xmission.com] >

    The cuneiforms are justified and ancient.
    and well formed.

    XML is gonna rock you.
  • I believe the ancient Egyptians avoiding using XML at the time because of concerns over RAND licencing and prefered the patent-free ideograms.

    No, really.
  • I was worried I might end up here [goatazte.cx] instead...
  • Correct me if I'm wrong, but what is XML doing that some homegrown solution couldn't? Obviously clients would have to know the protocol, but with XML that is also the case.

    I use XML all the time, maily because of XSLT, but I think its less functional and more hype. Feel free to enlighten me.

    • by xant ( 99438 )
      Clients would have to know and implement the protocol. But since XML always looks the same, implementing the protocol is just a matter of linking the standard XML library in the language of your choice and using the DTD to decide what you want your client to understand.

      There's other advantages, but that's a big one.
    • Re:XML Overrated? (Score:2, Interesting)

      by ukryule ( 186826 )
      When you're coding up ancient writing, you want to store much more information about each character or word than with normal text (colour, angle, depth etc.). XML is quite good at storing these attributes, so it makes sense to use it.

      Taking a quote from the heiroglyphics link [univ-paris8.fr] (can't comment on the cuneiform link as it's /.ed):

      Let's illustrate these points. In the current MCD, data about an individual sign is scattered around it. Look for example at :

      =A1\\r1 -i

      It means "Sign Gardiner A1", as both grammatical and word ending, reversed, rotated. fine positional data, colour data, and more are hard to add. On the other hand, the current proposal would represent the same sequence as

      <hieroglyph code="A1" gramend="y" wordend="y" rot="90" reversed="y">
      <hieroglyph code="i">

      Of course, as with any use of XML, you could do it with a 'homegrown' solution, but the point is that using XML gives you a well known (and well supported) framework which everyone can standardise on. (And yes I know the XML in the example is malformed ...)
    • Hmm, how about becuase XML and SVG are well defined standards that already have a huge amount of software available for it?

      Or, because XML is increasingly used in other applications, hence interoperability is not only high right now, but is also getting higher?

      But perhaps it is because XML is very well suited to representing diverse forms of data.

      I dunno..
  • the very first information revolution occurred when some unknown research team in Mesopotamia found a way to download and store language through a killer application called "writing."

    Talk about dead projects. I mean, freshmeat has nothing on these guys. 5,000 years, and how many upgrades? I'm STILL using writing 1.0, for chrissakes, not because it's better, but because there are no other versions!
    • I know it's bad form to criticise someones writing on /. but it really is time for you to catch up with modern developments ... Version 1.0 (codename cuneiform) has long been superceded by 2.0 (codename Heiroglyphics), 3.0 (greek), and 3.1 (latin).

      While there is still some support for all sub-releases of version 3, I suggest you upgrade to the latest release (3.1.27 - 'joined up alphanumeric').

      Of course there has been some criticism of the 'open source' nature of the writing project with claims that it leads to too many active branches (most notably with interoperability issues with the popular 'Chinese', 'Arabic' and 'Roman' branches).
  • Unfortunately, the documents must be transscribed, which means that we may well miss out on the doodles and other things that gets written with writing.

    Consider, for example, the carry dots that some people use to add up numbers. Dots and things like that in the text may well uncover the way that calculations were done.

    • The marks need not necessarily be missed out on transcription. If they're using the Text Encoding Initiative [tei-c.org] guidelines. TEI allows extra-linguistic marks to be captured alongside the text.
    • Considering that these materials were typically baked or kiln-fired to ensure permanency, it is unlikely that there is much in the way of doodles and annotation. Such ephemera were lost with the next rains.

      Interestingly, the developers of cuneiform also developed the first envelopes. The main message was kiln-fired and then wrapped with a new layer of clay, the address incised and the result merely air-dried. The recipient then gave the lot a crack against a nearby stone and brushed away the *envelope* to read his mail.
  • I haven't looked in almost a year now, but the last time I did, there was an alpha (rendered lots of graphics correctly, lots incorrectly) patch for Mozilla and no SVG support for IE or any other browser. Did everybody catch up while I wasn't looking?
    • Adobe has a plug-in [adobe.com] for IE and many nice SVG demos [adobe.com]. Unfortunately the plug-in is not integrated into IE, so you have to download it.



      IE directly supports VML (try it here [microsoft.com] if you are using IE), which does more or less the same as SVG except that it's older, not standardized, and only supported by Microsoft.

      • The Adobe SVG Viewer plug-in is included with the Acrobat Reader download now, so it should be on a lot more computers soon.

        Also, there's Batik, which is a Java-based SVG viewer plus some other tools.

        VML does much less than SVG; it's pretty primitive in comparison. And it seems to have stagnated -- MS hasn't updated their support for it in IE for a long time.

  • SVG is XML (Score:1, Troll)

    by joonasl ( 527630 )
    SVG (Scalable Vector Graphics) is subset of XML. Stating that something is stored in SVG AND XML format is a tautology.
  • by brad-d ( 30038 ) on Thursday November 08, 2001 @05:10AM (#2536962)
    All I can think of now is the new book series:

    "XML for Mummies"

    At least in this case when you see the reviews "this book will put you to sleep" it really doesn't matter.
  • ICE (Score:2, Troll)

    by zephc ( 225327 )
    the xml.org link for cuneiform encoding initiative is at http://www.jhu.edu/ice/ [jhu.edu]

    There is an initiative for almost every ancient language that is know (and decipherable). I'm sure digging thru xml.org will turn up a bounty of results =]
  • Gosh, here's the perfect topic for Ye Olde Curmudgeon!

    Weren't the old drum storage systems used in the 1960's a ceramic structure coated with magentic surface? And that was an improvement on those birch bark 80 column cards.

    But now we have advanced ceramics used in various other electronic media. And we measure our mean time between failure in hours.

    So, how far have we really come in the last 5,000 years? They had fire and clay and their data remains readable after 5,000 years. We have lightening and clay and can't read data from 15 years ago and hard drives can fail in a flash.

    Why aren't we planning storage and retrieval systems that can last thousands of years? Is it because our technical culture only values the last 2 to 3 years? How will we answer to our children when they can't figure out what we did 25 or 50 years from now? And I don't think we can blame it all as a planned obsolescence feature of Microsoft...well, maybe not all of it!

  • i certainly hope its a freely available public resource... i've been studying cuneiform texts (mainly Sumerian myths) for a couple years now and an archive like that would.. well the idea lossens my bowels and excites my senses! in other words i damn near crapped myself with joy when i read that.
  • Then I can write a washing bill in Babylonic cuneiform
    • Then I can write a washing bill in Babylonic cuneiform

      But it still won't help you learn about Caractacus's uniform. You've got to keep these things in perspective.

  • Did they find exactly who invented writing? What are the earliest statements they've found? What do they say?

"Engineering meets art in the parking lot and things explode." -- Garry Peterson, about Survival Research Labs

Working...