Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Science

A Genome Mark-up Language 84

There's an interesting story running about the need/development of genetic mark-up language. It's called GEML - Gene Expression Mark-up Language and is basically a DTD [?] . Obviously, with working with things like genes, GEML is useful - and a good example of why DTD is muy bein.
This discussion has been archived. No new comments can be posted.

A Genome Mark-up Language

Comments Filter:
  • by Anonymous Coward
    They are not for text fomratting. HTML was _misused_ for text formatting - the original idea was you tell the computer about what he thing _is_ and the _computer_ formats it. Like a simplified LaTeX (if you didn't know, in LaTeX, you basically say "I'm writing a book now", "This is the first chapter", "This is a footnote", etc., and the computer decides the formatting. That is to say, where to put everything. Or are you confused as to what formatting means?). Then XML came along. It's main use is telling the computer in an extensible manner, the meaning behind a piece of data - e.g. "This is a spec for a fireplace", "This is an address", "This is how you make a fruit fly". The computer can take appropriate action based on that, then (currently via XSLT and CSS to format + present the data appropriately) - but it can do other things too. It's a way of processing semantic content. It's a step towards AI.
  • by Anonymous Coward
    Mark Pesce ought to spend more time researching what he's writing about rather than plugging VRML. From the article:

    The "reporter" tag defines a sequence of codons (the four amino acids that comprise DNA) -- TACAGTGTCAGAATTAACTGTAGTC --




    Elementary Grade 9 biology here, Mark. A codon is a sequence of three nucleotides (ex: GCC) that are in turn expressed into the 20 amino acids that constitute the building blocks of all our proteins. Don't just regurgitate what was in the press release!

    Anyway, GEML is useless for real exchange and analysis of genetic information. For that purpose, I agree with a previous poster about packing 2 nucleotides per byte. It's an optimization that must be accepted as a standard before we can start doing on-demand heavy processing of genetic results.

  • "I would agree that bioxml servers as a much better licensing model for the community than GEML, its worth mentioning that at the current time they do not compete. GEML appears to be about gene expression, and bioxml has no DTD's addressing this."

    True. I do think that bioxml's goal is the same as GEML, but they're just not as far along as GEML (yet). It's just bothersome to me that a company-owned and controlled format like GEML could become very prevalent. I would still much rather see something like bioxml succeed instead. I hope they don't give up because of this...
  • "there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case."

    Apples and Oranges.

    HTML is controlled by the w3c--a standards body more or less independent of any particular company. Sure, M$ and Netscape had a lot of pull on HTML, but they *should* have, given that they *were* the browser market for a long time.

    In this case, we have a particular bioinformatics company graciously offering up their own "public domain" DTD as a standard for the rest of the industry (how generous). And a major scientific journal latching on to it. The only problem is, that same bioinformatics company must approve any and all changes to the "standard"! It would be the same if HTML were a copyrighted property of Netscape, Inc.

    It would be nice if the bioinformatics community could organize and form it's own XML standards body, a la the w3c. An agreed-upon standard is almost always better than a legislated standard.
  • by Tim ( 686 ) <timr AT alumni DOT washington DOT edu> on Thursday January 11, 2001 @10:34PM (#512254) Homepage
    The bioxml [bioxml.org] project has been trying to do this very thing for quite a while now. Previous to that, there was the biomolecular sequence markup language (BSML), and I don't think it ever came close to becoming a standard. The problem that these efforts always run into is the sheer diversity of opinion on how biological data should be represented. Molecular biologists and computational biologists can't even agree on the basic things, like how to represent sequence regions, let alone more complex issues, like annotation syntax.

    Why Nature chose GEML as a standard is unclear--the article doesn't present a compelling argument for it over the alternatives, and the choice seems a little arbitrary. It'll be interesting to see what impact this has on the other projects, and how open the standard will be to extension and modification.
  • by Tim ( 686 ) <timr AT alumni DOT washington DOT edu> on Thursday January 11, 2001 @11:37PM (#512255) Homepage
    From the GEML terms of use [geml.org]:

    The GEML Format is a free, public-domain, open standard created and licensed by Rosetta Inpharmatics, Inc. ("Rosetta") in order to define a single, distinct format for handling gene expression data and avoid proliferation of incompatible variations.
    ...
    You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML Format or documentation without written permission from Rosetta.


    So nobody can fork the standard without first consulting with Rosetta Inpharmatics. Wonderful. I just love their definition of "open standard."

    This looks like another corporate-buddy move by a major scientific journal, much like the Science/Celera deal a few weeks back...

    Go see bioxml [bioxml.org] for a truly open alternative.
  • todos podemos hablar con Miguel de Icaza en su propio lenguaje.

    You mean object-oriented C?
    __
  • los Estados Unidos de América, cuyo lenguaje oficial es el inglés

    Really? Does it appear in the US Constitution?

    And isn't Linux and Perl Slashdot-official? Should we limit ourselves to discut these?
    __
  • You could also inform that the proper Spanish phrase is "muy bien".
    __
  • WTF is "muy bein"? Haha.

    --
  • Hemos, when comparing things, use than, not then. For instance, this article should've been from the "it's-better-than-the-web!" dept. The word then is used to describe a time sequence or other ordering, as in "first this, then that." The word than is used to compare things, as in "this is better than that." Got it?

    *sigh*

    --Joe
    --
  • That reminds me of this grammar puzzler. Add punctuation to the following to make it grammatically correct:

    JIM HAD HAD HAD WHILE JOHN HAD HAD HAD HAD HAD HAD HAD HAD A BETTER EFFECT THAN HAD HAD HAD
    --Joe
    --
  • Uh, no. Bad grammar bothers me.

    --Joe
    --
  • The genome is much like human language-
    a fair amount of regularity plus a lot of special
    cases. In fact the latter throws off decoding
    robots and you see statistics like 98% decoded, etc.
    The scientific papers are full of nifty
    exceptions to what was believed before.

    The markup language would have to be flexible
    enough to encode all the exceptions- perhaps as
    a procedural attachment.
  • by weston ( 16146 ) <westonsd@@@canncentral...org> on Thursday January 11, 2001 @09:19PM (#512264) Homepage
    While all of this is fairly unreadable -- even by geneticists -- it is easily read by a computer

    GEML? Hard to read? Bah! What we should *REALLY* do is figure out a quadrary (you know, after binary and trinary) encoding scheme for all the other info and just pre-pend it to the beginning of the amino acid sequence. Maybe even insert it in some points, with some sort of delimiting sequcne, of course. None of this wimpy markup language stuff.



    --
  • Unfortunatly, they often tend not to do that :(
    At least life scientist do not.

    Instead, the use (the much dreaded) Word and wonder why all their betas, gammas, indices etc. tend to always disappear in the wrong moment...

    I once wrote an web application where people could submit an abstract for a congress on developmental neurobiology. I allowed for subsets of HTML or simplifed LaTeX for text formatting. It was hell - even the brightest people in their field failed to understand the concepts. I believe I spend more time searching texts for missing tags or closing braces than for anything else...
  • Hehheheh =:-) I prefer the _time honored_ method of exchanging genetic material =:-) [sorry, couldn't resist...]
  • The GEML Format is a free, public-domain[...]

    You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML[....]

    IT seems somebody doesn't understand the legal meaning of "public domain": that anybody can modify what is in the public domain, without restriction. That is why free software and Open Source Software AREN'T "public domain"!
  • Sí. Esto es verdad. Pero estoy estudiande español y como para practicar siempre que pueda.
  • The previous poster suggests that the incorrect muy bein should be spelled muy bien. This is a correct spelling, but misses the grammatical error (hey, this is Slashdot!). bien is (generally) adverbial (meaning "well"), and since we're talking about a DTD, we want to use an adjective ("good"). In other words, the sentence should read "DTD is muy bueno."
  • <genes>
    ttaacattgagctaacgataggatacgattacattgagctaacgatag ga
    tacgattacattgagctaacgataggatacgattacattgagctaacg at
    </genes>

  • by upstateguy ( 90019 ) on Friday January 12, 2001 @03:47AM (#512271)
    As a molecular genteticist, I am familiar with the markup languages that *already* exist for annotating genome sequences. Free software from NCBI even helped you format your sequences for submission to databases.

    Sorry, I'm too lasy to annotate this myself :-):

    Link to NCBI [nih.gov]

    FASTA looks remarkably like the example given in the article.

    Quicky description of FASTA (just one of many schemes but one of the most popular and oldest. [cornell.edu]

    Perhaps rather than writing a trendy article trying to get buzzwords like genomics and bioinformatics together with geek speak, he should have done a tad more research.

    Not to say there can't be huge improvements and trying to show the interplay (temporally AND physically) between genes. But don't do a half-assed job by ignoring what has already been used for decades.

  • GEML just sounds better to the kind of people who would be in charge of this kind of thing. bioxml has no capital letters, is half-pronounceable and half-gotta-be-spelled-out, etc. GEML is all capital letters, can be spelled out or pronounced as a whole, etc. I think that why they chose GEML as a standard is far from unclear; rational is another matter.
  • But those capital letters! And just because most Slashdotters can pronounce every imaginable acronym smoothly doesn't mean anything for other communities...Imagine being an average Joe and driving down the street, seeing a big sign with GEML on it and another one that just says 'bioxml'. I'd think...wow, those idiots don't even have the self-esteem to capitalize their own name, whereas these other guys...wow, they must have a great product if they can handle four capital letters.
  • I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help.

    Admittedly RDF hasn't been used much YET. After all - it's only a year since bog-standard XML took off. I'm a contractor; Dec '99 I couldn't sell XML skills to anyone, Jan 2000 my phone melted. By Easter 2000 everyone else was an XML "guru".

    Wrox don't shift their first RDF book until October. You can't store production-grade quantities of RDF in a database yet. How can you say it's "past", when we haven't even finished building the infrastructure tools yet ?

    OTOH, the one widely distributed RDF app that is out there (RSS) is even part of Slash. Take a look at those Slashboxes - they aren't running DocBook.

    Added to which you can employ namespaces to form compound documents from many schemas,

    That's just a quicker recipe for tag soup. The ability to have five different ways to express an author's address doesn't make it any easier to move data between applications or avoid "Dear Mr. Occupier" errors.

    "It's the Semantics, Stupid"

    Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics.

    What document semantics ? DocBook doesn't do semantics, and it has a structure that thinks everything is a computer manual. A schema that has a <GUIMenuItem> element, but doesn't have a means of expressing a target readership age ? Rights management that's a bare copyright element with an implied recommendation to attach generated text of "All Rights Reserved" when you render it ? (What if the rights _aren't_ all being reserved ?)

    DocBook is a pile of bodges and hacks, and I only use it because I don't know anything else that's out there, and I'm reluctant to roll my own and add another one to the pile.

    DocBook is Perl for text documents; lot's of "There's More Than One Way To Do It", and not a lot of "Done. Sorted.".

    My current project (the next version of ARKive [arkive.org.uk]) is a huge graph of linked nodes, most of which are either text or rich-media. The directed nature of the graph blows plain XML out of the water - there's just no way to handle the referencing problem in XML; you're either fooling around with the inadequate ID & IDREF, or you do it through either XLink, or your own href attributes and lose support for any notion of document structure based on these links, unless you code it yourself at the application level. With RDF, I just talk to an API like Jena and when I make things related, they stay related (and the underlying engine will hand them back to me on demand, as whatever relevant fragment of the document I might need).

    I am using DocBook to represent the text content nodes. It's not much more advanced than HTML though - I need a huge amount of markup on each node to select the appropriate set (what it refers to, what it says about it, whether it's written for 7 or 17 year olds) and I hold this trivially in RDF, with DocBook under a content property.

    There's simply no way I could express this in DocBook alone. I could express it in DocBook with embedded LOM markup, and I could do that very easily just by namespacing two schemas as you suggest. Ther trouble with that approach though is that the only code that could ever make sense of it would be my own. With RDF, any RDF app (like the Redland app framework) can wander through it and make a pretty good use of it, even if it hasn't seen the documents before.

    XML has no mechanism for a semantic schema. Attempting to use the structural schema it does have, as one, doesn't work well and it certainly doesn't travel.

  • I see your point, but semantics are never enforceable anyway.

    Who cares ? If you're publishing the latest fat stock prices, then it's in the user's interests to get it right. Semantic publishing needs a reliable means of making them available to those who want them, it doesn't need to follow them up and enforce getting it right.

    you haven't told me how RDF gets around this

    Take a look at RDF Schema [w3.org].

    Of course, semantics aren't enough on their own. It's not too useful to know where the "creator" value is in two schemas, if you can't distinguish between one's "author" and the other's "translator". This is where an ontological understanding is needed, and there's a couple of projects out there working on that too; DAML [daml.org] & OIL [ontoknowledge.org].

  • DTDs are going to be required for defining new XML grammars

    Rubbish. I haven't written a DTD in over 18 months. Tool support is better than DTD, mainly because Schemas also use XML as their expression syntax and so it's trivial to build tools (often with XSLT) for them.

    Schemas are still brand new, and tool support is weak to nonexistant.

    Schema has been a Candidate Recommendation since October. Maybe it's not signed off yet, but it's pretty stable and usable out in the "real world".

    I thank M$oft for this one. Dropping early versions of XSL and Schema onto developers a long time ago put a rocket under the W3C. This might have ended badly, except M$oft then did something unusual for them and fell back into line with a developing standard. Credit where credit's due...

  • This is another example of What's Wrong With XML (and particularly, what's wrong with proliferating schemas all over the place).

    A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.

    Look at real user of RDF [purl.org] for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.

    A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.

  • by jrg ( 98378 )
    a quote from the article:

    "The 'reporter' tag defines a sequence of codons (the four amino acids that comprise DNA)"

    sheeesh! can't they even get the basics right? a codon is a unit of three nucleotides that encode a single amino acid (there are three out of the 64 that do not code for an animo acid, rather, they code for the translation stop signals).

    four nucleotides comprise DNA. there are 20 amino acids.

    this type of error is shameful.

    james

  • "Any [sic] just why not is this amazing?"

    because it's XML.

    james

  • It will have the additional benefit that you could do gene therapy by applying an XSL stylesheet in the transporter.

    Oh dear, this is beginning to sound like a Voyager plot.

  • I think you mean "quaternary"?

    What would be REALLY cool would be code for a protein which decompressed the rest of the stream.... ;)
  • There are lots of ways to extend and modify the behavior of an XML dialect and an associated DTD/schema without touching the core standard. That's the Xtensible part. They are merely holding veto power over back-propagation of enhancements into the original work.

    The point of XML is to standardize the manner of extension. Even SGML allowed for internal subsets of markup declaration to extend the core DTD. The goal of such a standard is not to eliminate incompatibility but to minimize the pain of dealing with it.

    Forking a DTD is like forking pudding, it doesn't do anything.
  • Their license would appear to prohibit that which their chosen technology is intended to facilitate.
  • If GCC is a codon, what is bash ?

    [Assume mandatory smiley here]

  • by myc ( 105406 ) on Thursday January 11, 2001 @09:11PM (#512285)
    since classical genetics has been around for a lot longer than computers and ascii, many classical genetic nomenclature use nortoriously asii-unfriendly symbols. For instance, as many of you know, Drosophila (fruit fly) geneticists can basically name genes anything they want to, and nomenclature to denote specific mutant alleles of genes use all sorts of evil things like subscripts, superscripts, Greek letters, etc etc. In short, it's just a total mess. Similarly, although yeast geneticists do have a standardized nomenclature, it's very ascii-unfriendly, due to things like Greek letters, superscripts, subscripts, etc etc. Nomenclature for mammalian systems such as mouse and humans is even worse, there is basically no standard. for instance some gene names use all CAPS while others only capitalize the first letter, and some use the common three-letter convention plus a number (BMP1, BMP2, BMP3, etc etc), while others use a Drosophila-type naming scheme (e.g., agouti, shaker are mouse mutant names)(there is some uniformity that is given to gene assignments in large sequencing projects, but those are just an alphanumeric sequence, it's not very descriptive).

    Constrast this with a relatively more recent model genetic organism, the roundworm Caenorhabditis elegans. Standards were set early whereby all gene names were standardized by basis of their phenotype (eat-4 is a worm with a mutant feeding behavior, unc-6 describes a worm with uncoordinated movement, lin-41 describes a mutant with mutant cell development lineage, etc etc), and is ascii-friendly. As a result, C. elegans people enjoyed standardized and searchable computerized gene databases for much longer than other geneticists in other fields.

    I hope that a standard becomes set and rapidly adapted; lab chiefs (to us grad student peons anyway) can often seem like PHB's in IT when it comes to adapting new methods and paradigms.

  • There's a bigger syntax error. The tag doesn't have any closure:

    <body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties="endowed"/>

    It's probably not useful to express hair color as full RGB values, though.

    You were being serious, right? Oh.

  • XML Schemas have the benefit of being written in XML. That should make XML Schema support fairly easy to manage. Of course, the parser has never been the hard part with XML.
  • I pronounce bioxml as by-ox-mil. The more I try to pronounce GEML as a word, the more it sounds like GML.
  • Can't wait for Lincoln Stein's GEML.pm module, with handy shortcuts and image creating functions :)

    I bet I can make a script that then creates a life form. aww yeah.
  • From the Feed article [feedmag.com]:

    GEML ISN'T alone. It has a competitor, another DTD known as CellML, used to define the complex interactions that take place within cells. CellML takes an integrated approach to describing all of the processes within a living cell -- its genes, proteins, enzymes, and chemical reactions, the pathways and connections between each part of the whole. CellML seems well suited to the kinds of work that supercomputers do -- creating simulations of incredibly complex systems -- while GEML only defines the genetics that create the cell.

    Doesn't this seem a more apt way of describing a living organism? Sure, it's undoubtedly more complex and expensive (financially and computationally), but if you were to set an E10000 or Cray (or maybe a high-end Sun farm) to work on CellML, wouldn't it do more in less time than having to work everything out manually with GEML?

    --
  • Yes, but only in binary transmitted files, text files would be uneditable because of all the control characters. But 3 nucleotides would be easy packed into base64 encoding, each one would matches an amino acid.
  • I'm not fully up on the XML scene, but aren't DTDs being replaced in the very near future by XSDs (XML Schema Definitions)? They at least are a dialect of XML, so to use XML you only have to learn one (easy) language.
  • over a year ago I described a human with XML tags for fun, something like this:
    &ltXML&gt
    &ltHUMAN GENDER="m/f"&gt
    &ltHEAD&gt
    &ltBRAIN&gt&lt/BRAIN&gt
    &lt/HEAD&gt
    &ltBODY&gt&lt/BODY&gt
    &ltLEGS&gt&lt/LEGS&gt
    &lt/HUMAN&gt
    &lt/XML&gt

    etc etc etc, maybe at some point null transportation technology will describe a human completely with his genetics, memory and personality with XML, and transport the person as energy over wireless media to put it all together at the other end.

    Hopefully fast XSLT engines will exist by then and hopefully the whole thing will not be based on MS implementation of XML document.

  • Answer: Just in case we ever need to view our genome sequence on IE

    And if the human genome has about 3 gig wouldn't wrapping quaint bits of information blow it up by quite a bit? sorry but the idea seems to rank on the same idiocy level as XML :)
  • Yes, CellML is a very nice way to describe all the processes within a cell. Also very important. But my understanding is that even with immense supercomputers (today) it still takes significant time to calculate something as banal and commonplace as protein folding. So CellML has its place (say for the AIDSResearch@Home project, in some future incarnation) but it's a bit too much information for geneticists. They'll both have their place. I should hope...
  • That's a good question. I don't know why Nature adopted GEML - it may be a case of a "tipping point" where enough geneticists and genomics firms finally realized that there was a need for a standard and some "cheerleader" got out there and started waving the GEML flag. If anyone knows the whys of this, I'd be interested to hear...
  • Yes, you're right, a codon is a 3 amino acid sequence. I should have used the words "base pairs" there instead.

    Mea culpa.

  • That's not a bad thing. Standards should not be arbitrarily pulled apart - particularly by competing commercial organizations (reference my XML article on FEED from a few years ago for points on this matter). The VRML97 ISO spec is "owned" by the Web3D consortium, in fact to make spec changes basically "illegal". Whatever that means.
  • very well, it's spansh. they use it to say good or very good


    ________

  • Open standard or not - there's absolutely no value in forking a DTD. Unless you think there was maybe some value in all of the "modifications" Netscape and Microsoft made to the HTML DTD, for a simple example - its the same in this case.
  • For the time being, DTDs are going to be required for defining new XML grammars - Schemas are still brand new, and tool support is weak to nonexistant.

    DTDs will probably stick around in one form or another for the next few years - its unfortunate that Schemas couldn't have been part of XML 1.0 - unfortunately the co-existance of DTDs and Schemas will cause code bloat as tools will basically need to support both.

  • A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.

    Are you telling me that someone who doesn't have my data doesn't have it? Your astounding conclusion seems to be some sort of convoluted identity function.

    Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.

    No one is doubting that poorly implemented schemas will degrade productivity, but I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help. Added to which you can employ namespaces to form compound documents from many schemas, so your limitation doesn't exist in any case.

    A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.

    You're vastly overestimating the dynamic nature of these schemas - this isn't the HTML DTD we're talking about. Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics. Once again proof that a properly designed format weakens your counterarguments, and in any case, RDF isn't going to ever, EVER take off, so its probably time to quit flogging it.

  • I see your point, but semantics are never enforceable anyway. At the end of the day, if people want to take your document and completely invert your semantics, they are going to do it.

    Added to which, you haven't told me how RDF gets around this, or are you saying that the issue should be avoided altogether?

  • It's nice that the genome has been "sequenced in its entirety" and is presently undergoing "error checking" which should "continue for the next year".
    Last time i checked at ncbi [nih.gov] the genome was at 30.4% finished. and the rough draft assembly is in 148307 pieces according to the golden path [ucsc.edu].
    And of course the finished target for the human genome is three years from now!
  • So, how do you write Hello World (or its equivalent) in GEML?
  • by Fervent ( 178271 ) on Thursday January 11, 2001 @09:10PM (#512306)
    Insurance provider: Well Mr. Johnson, I'm afraid you have the <stupid person> tag.
    Mr. Johnson: No!
    Insurance provider: Yup. It's right between the <bald ugly-looking guy> tag and the <most likely to drink beer after finding out his wife gets fatter with age> tag.
    Mr. Johnson: Oh God.
    Insurance provider: I'm sorry.
    Mr. Johnson: Is this hereditary? What can be done about my kids?
    Insurance provider: Well, we can comment out the little buggers if we try. Some GScript may work to prevent them from passing the traits onto their children. Hell, we may even be able to use some Gava to touch up their faces so they won't be as ugly as you.
    Mr. Johnson: And as for me?
    Insurance provider: Your body is 2.0, Mr. Johnson. As far as we're concerned, noone supports you anymore.
  • The article mentions that the GEML fragment on display may be incomprehensible to even geneticists, but is readable by computers. It goes on saying that the value of the GEML is allowing computers to share data. I am confused, since XML is either offering a verbose definition of computer data that even humans can understand, or allowing human data ( David is an Employee of IBM) to be expressed in self-describing computer accessible form.

    Since the genetic code is already digital, transforming in into something that computers can process seems rather pointless, what is wrong with AGTCTTCGADC? making it verbose for humans is also not very useful, because what the GEML seems to offer is very raw data, essentially a wrapper around raw sequences.

    Maybe the issue is really hype. I.e. a clever gimmick to drive companies to share information by offering them bandwagons they can't refuse to climb?

  • <GEML>
    <body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
    </GEML>

    You have a syntax error in 'crotchproperties', 'crotchproperties' set to "0"

    Coming soon, MS GenomePage 2006, so you can really start screwing things up.
  • We are really close to being able to modify human genome.

    From CNN [cnn.com]: Genetically modified monkey - named ANDi carries in him an extra bit of DNA from a jellyfish. ANDi is the first primate to be similarly modified.

    See CNN story [cnn.com] for full details.

  • The problem is with standards is that you can not just declare them. They have to be built with community agreement. At the moment there are "standards" in biological information. Lots and lots of them. Yesterday for instance I was struggling with sequence file formats. I can think of at least 15 different formats, all slightly different.

    The scope of GEML seems quite limited. Its about gene expression data, which is currently very sexy. Its also been licensed in a fairly restrictive manner. Not the way to go if you asked me.

    Phil

  • Incidentally can anyone find any statement from Nature about this? I cant!

    Phil

  • "sorry but the idea seems to rank on the same idiocy level as XML "

    If you can not see the value of structuring data into a format which is easily parsable, and whose semantics are formally defined in a standard format, then I fear that your own idiocy level is fairly high.

    XML is potentially about a lot more than viewing web pages.

    GEML incidentally is pretty much useless for viewing genome sequences. Whilst this is no doubt mainly the fault of the bloke who wrote the article for getting it totally wrong, GEML is not designed to represent genomic information, but gene expression data. Two very different things.

    Phil

  • "Go see bioxml for a truly open alternative."

    I would agree that bioxml servers as a much better licensing model for the community than GEML, its worth mentioning that at the current time they do not compete. GEML appears to be about gene expression, and bioxml has no DTD's addressing this.

    As for nature, well I expect that there publishers are worried. Sooner or later paper journals are going to disappear. Perhaps they are diversifying, and have a stake in the company. This is not necessarily a problem. Even nature does not have the power to make a standard.

    Phil

  • The problem with most of the markup languages used in biology is that the are simple two letter at the begining of the line schemes. They tend to be very unexpressive as there are no relations between the tags (a line is one thing or another, and each line is independant of the last). The main problem with this unexpressivity is that it means "all the biology is in the comment field", or in other words unstructured free text. To extract this information out in a machine readable way, you get straight into natural (or as this is biology fairly unnatural) language parsing, and hit the same brick wall that AI has for the last 30 years.

    I agree that the article linked to is half-assed, and badly researched. But the sad fact is that most of the database formats in existance also seem to be fairly half assed. I think that XML might help us to get around some of these problems.

    Phil

  • Yes, you're right, a codon is a 3 amino acid sequence. I should have used the words "base pairs" there instead.

    Actually, a codon is just what they had "TACAGTGTCAGAATTAACTGTAGTC". A tri-base codon is special and is called a "triplet codon". That is TAC, etc...

  • pure luck my friend :P id like to give props to pieceofshit and anyone else who knows me
  • For that purpose, I agree with a previous poster about packing 2 nucleotides per byte. It's an optimization that must be accepted as a standard before we can start doing on-demand heavy processing of genetic results.

    There being four possible nucleotides (unless you're looking at something real exotic) surely you can get 4 per byte? Sticking to a base64 ascii encoding you can still get 3 nucleotides, so a single codon, per character, which is possibly a more elegant optimization.

    Anyway, this shouldn't be necessary and goes against the XML philosophy. Although humans on the whole aren't meant to read XML directly, computers should be doing that, it should always remain *possibly* to do so, and I think this would muddy the human-eye view somewhat. It is accepted (by the people setting the standards) that this results in a larger raw stream, but that the correct way of dealing with that is to layer XML over storage-level and transport-level compression schemes to recover some of the entropy wastage. See REC-xml [w3.org], section 1.1, and points 3 and 5 of XML in 10 points [w3.org].

    Heavy processing won't be done directly on markup - it'll be done on the in-memory representation after the markup is loaded, which can be assumed to be more compact than the markup if required (or less compact if there is a neat time/space tradeoff in the processing.)

  • that is pathetic
  • <GEML>
    <body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
    </GEML>



  • Yeah, the human genome is several gigs; but the vast majority of it isn't "coding." If you were going to present the whole human genome (as opposed to, more realistically, a short sequence of particular interest to your research) you'd be able to convey a LOT more information by presenting the 1% of it that codes for amino acids, along with markups to provide links to crystal structures of the proteins, little sub-charts showing the frequency of medically relevant site specific substitutions (recall, there isn't _a_ human genome, there are many different ones) and so on and so forth. Yeah, it might blow up past the size of the raw genome, but it would contain actually useful information.
    That said I can't think of any features you could want in such a language you can't do with just old html. Shrug.

    UCSC Molecular Biology
  • There are already great perl tools for bioinformatics, check out www.bioperl.org
  • I have my Gene Expression Mark-up Language, my HTML, and XML. I can express any form of text in the world. I can not die happy.
  • GeNeTeX

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • Yeah and CaML. Wonder what sort of genes that has...

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • The Span shell.

    But then again, WTF is "muy bein"?

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • I think we should encode HTML documents into the
    genes of living people so that they can go
    travelling and when they get to the destination
    they can be re-assembled with all the other
    people-fragments and viewed with some sort of
    CAT-scan-browser. HTTP-over-Humans.
    And if you want better bandwidth, you can sign up
    for Broad-band: Only Females will be used to
    transport the data.

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • What kind of nonsense is this? Everyone knows that we will use the safe, reliable Microsoft standard GEML to encode our genes that safely and reliably allow us to live. We wouldn't have it any other way!

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • Actually it makes me wonder: Instead of literate programming, are we getting programmable literature? I mean come on, why are these Markup Languages being used so damn widely in so many damn silly ways. They're for Text Formatting, not for making complex simulations! Next we'll be seeing editorial-based programming instead of functional or object-based. People!

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • So long as my child doesn't turn into a Javascript popup window.

    And then there's the parallel between reproduction of the species and that damn close-browser-window-makes-more-windows-popup trick that some sites pull on you. And I don't mean the fact that its usually a porn site that does it.

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • This is why scientists write documents in LaTeX, not ASCII.

    On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
  • Yet another slashdot spelling mistake... If you're going to try to be witty and use other languages to try to increase people's perception of your intelligence or chic-ness, at least do it right. And this is a first post- MY first post, not the story's first post...


    --
  • MSXML3.DLL supports both old-style and new-style schemas; ditto with XSL (the "/1999" and "/TR" versions). Sucks that tools will still support DTD's; I hate writing them.

I tell them to turn to the study of mathematics, for it is only there that they might escape the lusts of the flesh. -- Thomas Mann, "The Magic Mountain"

Working...