A Genome Mark-up Language 84
There's an interesting story running about the need/development of genetic mark-up language. It's called GEML - Gene Expression Mark-up Language and is basically a DTD [?] . Obviously, with working with things like genes, GEML is useful - and a good example of why DTD is muy bein.
Re:CellML (Score:1)
Pathetic research by the author. (Score:2)
The "reporter" tag defines a sequence of codons (the four amino acids that comprise DNA) -- TACAGTGTCAGAATTAACTGTAGTC --
Elementary Grade 9 biology here, Mark. A codon is a sequence of three nucleotides (ex: GCC) that are in turn expressed into the 20 amino acids that constitute the building blocks of all our proteins. Don't just regurgitate what was in the press release!
Anyway, GEML is useless for real exchange and analysis of genetic information. For that purpose, I agree with a previous poster about packing 2 nucleotides per byte. It's an optimization that must be accepted as a standard before we can start doing on-demand heavy processing of genetic results.
Re:It's a closed standard. (Score:2)
True. I do think that bioxml's goal is the same as GEML, but they're just not as far along as GEML (yet). It's just bothersome to me that a company-owned and controlled format like GEML could become very prevalent. I would still much rather see something like bioxml succeed instead. I hope they don't give up because of this...
Re:DTDs shouldn't be forked - thats the point (Score:3)
Apples and Oranges.
HTML is controlled by the w3c--a standards body more or less independent of any particular company. Sure, M$ and Netscape had a lot of pull on HTML, but they *should* have, given that they *were* the browser market for a long time.
In this case, we have a particular bioinformatics company graciously offering up their own "public domain" DTD as a standard for the rest of the industry (how generous). And a major scientific journal latching on to it. The only problem is, that same bioinformatics company must approve any and all changes to the "standard"! It would be the same if HTML were a copyrighted property of Netscape, Inc.
It would be nice if the bioinformatics community could organize and form it's own XML standards body, a la the w3c. An agreed-upon standard is almost always better than a legislated standard.
Not the first... (Score:5)
Why Nature chose GEML as a standard is unclear--the article doesn't present a compelling argument for it over the alternatives, and the choice seems a little arbitrary. It'll be interesting to see what impact this has on the other projects, and how open the standard will be to extension and modification.
It's a closed standard. (Score:5)
The GEML Format is a free, public-domain, open standard created and licensed by Rosetta Inpharmatics, Inc. ("Rosetta") in order to define a single, distinct format for handling gene expression data and avoid proliferation of incompatible variations.
You may not modify, lease, loan, sell, charge for, or create derivative works of the GEML Format or documentation without written permission from Rosetta.
So nobody can fork the standard without first consulting with Rosetta Inpharmatics. Wonderful. I just love their definition of "open standard."
This looks like another corporate-buddy move by a major scientific journal, much like the Science/Celera deal a few weeks back...
Go see bioxml [bioxml.org] for a truly open alternative.
Miguel's language (Score:1)
You mean object-oriented C?
__
Flamebait (Score:1)
Really? Does it appear in the US Constitution?
And isn't Linux and Perl Slashdot-official? Should we limit ourselves to discut these?
__
Mejor (Score:2)
__
muy bein? (Score:1)
--
From the it's-bad-grammar-time! dept. (Score:1)
Hemos, when comparing things, use than, not then. For instance, this article should've been from the "it's-better-than-the-web!" dept. The word then is used to describe a time sequence or other ordering, as in "first this, then that." The word than is used to compare things, as in "this is better than that." Got it?
*sigh*
--Joe--
[OT] Grammar (Score:1)
That reminds me of this grammar puzzler. Add punctuation to the following to make it grammatically correct:
--Joe--
Re:From the it's-bad-grammar-time! dept. (Score:1)
Uh, no. Bad grammar bothers me.
--Joe--
lots of "exceptions" to the coding rules (Score:2)
a fair amount of regularity plus a lot of special
cases. In fact the latter throws off decoding
robots and you see statistics like 98% decoded, etc.
The scientific papers are full of nifty
exceptions to what was believed before.
The markup language would have to be flexible
enough to encode all the exceptions- perhaps as
a procedural attachment.
GEML? Bah! Quadrary Encoding! (Score:3)
GEML? Hard to read? Bah! What we should *REALLY* do is figure out a quadrary (you know, after binary and trinary) encoding scheme for all the other info and just pre-pend it to the beginning of the amino acid sequence. Maybe even insert it in some points, with some sort of delimiting sequcne, of course. None of this wimpy markup language stuff.
--
Re:standards are important esp. for biologists (Score:2)
At least life scientist do not.
Instead, the use (the much dreaded) Word and wonder why all their betas, gammas, indices etc. tend to always disappear in the wrong moment...
I once wrote an web application where people could submit an abstract for a congress on developmental neurobiology. I allowed for subsets of HTML or simplifed LaTeX for text formatting. It was hell - even the brightest people in their field failed to understand the concepts. I believe I spend more time searching texts for missing tags or closing braces than for anything else...
Re:exchange of genetic information [sorry =:-)] (Score:1)
Re:It's a closed standard. (Score:2)
IT seems somebody doesn't understand the legal meaning of "public domain": that anybody can modify what is in the public domain, without restriction. That is why free software and Open Source Software AREN'T "public domain"!
Re:Sí! (Score:1)
Re:Muy Bein... wow (Score:1)
What it looks like. (Score:1)
ttaacattgagctaacgataggatacgattacattgagctaacgata
tacgattacattgagctaacgataggatacgattacattgagctaac
</genes>
Article ignored what is already used! (Score:5)
Sorry, I'm too lasy to annotate this myself :-):
Link to NCBI [nih.gov]
FASTA looks remarkably like the example given in the article.
Quicky description of FASTA (just one of many schemes but one of the most popular and oldest. [cornell.edu]
Perhaps rather than writing a trendy article trying to get buzzwords like genomics and bioinformatics together with geek speak, he should have done a tad more research.
Not to say there can't be huge improvements and trying to show the interplay (temporally AND physically) between genes. But don't do a half-assed job by ignoring what has already been used for decades.
Abbreviations... (Score:1)
Re:Abbreviations... (Score:1)
RDF hasn't woken up yet. (Score:1)
I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help.
Admittedly RDF hasn't been used much YET. After all - it's only a year since bog-standard XML took off. I'm a contractor; Dec '99 I couldn't sell XML skills to anyone, Jan 2000 my phone melted. By Easter 2000 everyone else was an XML "guru".
Wrox don't shift their first RDF book until October. You can't store production-grade quantities of RDF in a database yet. How can you say it's "past", when we haven't even finished building the infrastructure tools yet ?
OTOH, the one widely distributed RDF app that is out there (RSS) is even part of Slash. Take a look at those Slashboxes - they aren't running DocBook.
Added to which you can employ namespaces to form compound documents from many schemas,
That's just a quicker recipe for tag soup. The ability to have five different ways to express an author's address doesn't make it any easier to move data between applications or avoid "Dear Mr. Occupier" errors.
"It's the Semantics, Stupid"
Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics.
What document semantics ? DocBook doesn't do semantics, and it has a structure that thinks everything is a computer manual. A schema that has a <GUIMenuItem> element, but doesn't have a means of expressing a target readership age ? Rights management that's a bare copyright element with an implied recommendation to attach generated text of "All Rights Reserved" when you render it ? (What if the rights _aren't_ all being reserved ?)
DocBook is a pile of bodges and hacks, and I only use it because I don't know anything else that's out there, and I'm reluctant to roll my own and add another one to the pile.
DocBook is Perl for text documents; lot's of "There's More Than One Way To Do It", and not a lot of "Done. Sorted.".
My current project (the next version of ARKive [arkive.org.uk]) is a huge graph of linked nodes, most of which are either text or rich-media. The directed nature of the graph blows plain XML out of the water - there's just no way to handle the referencing problem in XML; you're either fooling around with the inadequate ID & IDREF, or you do it through either XLink, or your own href attributes and lose support for any notion of document structure based on these links, unless you code it yourself at the application level. With RDF, I just talk to an API like Jena and when I make things related, they stay related (and the underlying engine will hand them back to me on demand, as whatever relevant fragment of the document I might need).
I am using DocBook to represent the text content nodes. It's not much more advanced than HTML though - I need a huge amount of markup on each node to select the appropriate set (what it refers to, what it says about it, whether it's written for 7 or 17 year olds) and I hold this trivially in RDF, with DocBook under a content property.
There's simply no way I could express this in DocBook alone. I could express it in DocBook with embedded LOM markup, and I could do that very easily just by namespacing two schemas as you suggest. Ther trouble with that approach though is that the only code that could ever make sense of it would be my own. With RDF, any RDF app (like the Redland app framework) can wander through it and make a pretty good use of it, even if it hasn't seen the documents before.
XML has no mechanism for a semantic schema. Attempting to use the structural schema it does have, as one, doesn't work well and it certainly doesn't travel.
Re:RDF hasn't woken up yet. (Score:1)
I see your point, but semantics are never enforceable anyway.
Who cares ? If you're publishing the latest fat stock prices, then it's in the user's interests to get it right. Semantic publishing needs a reliable means of making them available to those who want them, it doesn't need to follow them up and enforce getting it right.
you haven't told me how RDF gets around this
Take a look at RDF Schema [w3.org].
Of course, semantics aren't enough on their own. It's not too useful to know where the "creator" value is in two schemas, if you can't distinguish between one's "author" and the other's "translator". This is where an ontological understanding is needed, and there's a couple of projects out there working on that too; DAML [daml.org] & OIL [ontoknowledge.org].
Re:No tool support, yet (Score:1)
DTDs are going to be required for defining new XML grammars
Rubbish. I haven't written a DTD in over 18 months. Tool support is better than DTD, mainly because Schemas also use XML as their expression syntax and so it's trivial to build tools (often with XSLT) for them.
Schemas are still brand new, and tool support is weak to nonexistant.
Schema has been a Candidate Recommendation since October. Maybe it's not signed off yet, but it's pretty stable and usable out in the "real world".
I thank M$oft for this one. Dropping early versions of XSL and Schema onto developers a long time ago put a rocket under the W3C. This might have ended badly, except M$oft then did something unusual for them and fell back into line with a developing standard. Credit where credit's due...
XML considered harmful (Score:2)
This is another example of What's Wrong With XML (and particularly, what's wrong with proliferating schemas all over the place).
A schema isn't a means of publishing your data to a wider audience, it's a means of locking-out everyone who doesn't have a copy of it.
Look at real user of RDF [purl.org] for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.
A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.
yeeesh (Score:1)
"The 'reporter' tag defines a sequence of codons (the four amino acids that comprise DNA)"
sheeesh! can't they even get the basics right? a codon is a unit of three nucleotides that encode a single amino acid (there are three out of the 64 that do not code for an animo acid, rather, they code for the translation stop signals).
four nucleotides comprise DNA. there are 20 amino acids.
this type of error is shameful.
james
Re:why not? (Score:1)
because it's XML.
james
Re:Human Markup Language (Score:2)
Oh dear, this is beginning to sound like a Voyager plot.
Re:GEML? Bah! Quadrary Encoding! (Score:1)
What would be REALLY cool would be code for a protein which decompressed the rest of the stream....
Re:DTDs shouldn't be forked - thats the point (Score:1)
The point of XML is to standardize the manner of extension. Even SGML allowed for internal subsets of markup declaration to extend the core DTD. The goal of such a standard is not to eliminate incompatibility but to minimize the pain of dealing with it.
Forking a DTD is like forking pudding, it doesn't do anything.
I forked up. (Score:1)
Very silly question, couldn't help, sorry (Score:1)
[Assume mandatory smiley here]
standards are important esp. for biologists (Score:5)
Constrast this with a relatively more recent model genetic organism, the roundworm Caenorhabditis elegans. Standards were set early whereby all gene names were standardized by basis of their phenotype (eat-4 is a worm with a mutant feeding behavior, unc-6 describes a worm with uncoordinated movement, lin-41 describes a mutant with mutant cell development lineage, etc etc), and is ascii-friendly. As a result, C. elegans people enjoyed standardized and searchable computerized gene databases for much longer than other geneticists in other fields.
I hope that a standard becomes set and rapidly adapted; lab chiefs (to us grad student peons anyway) can often seem like PHB's in IT when it comes to adapting new methods and paradigms.
Re:Hmmm.... (Score:1)
It's probably not useful to express hair color as full RGB values, though.
You were being serious, right? Oh.
Re:No tool support, yet (Score:1)
Re:Abbreviations... (Score:1)
Always something for Perl to do next (Score:1)
I bet I can make a script that then creates a life form. aww yeah.
CellML (Score:2)
GEML ISN'T alone. It has a competitor, another DTD known as CellML, used to define the complex interactions that take place within cells. CellML takes an integrated approach to describing all of the processes within a living cell -- its genes, proteins, enzymes, and chemical reactions, the pathways and connections between each part of the whole. CellML seems well suited to the kinds of work that supercomputers do -- creating simulations of incredibly complex systems -- while GEML only defines the genetics that create the cell.
Doesn't this seem a more apt way of describing a living organism? Sure, it's undoubtedly more complex and expensive (financially and computationally), but if you were to set an E10000 or Cray (or maybe a high-end Sun farm) to work on CellML, wouldn't it do more in less time than having to work everything out manually with GEML?
--
Re:Pathetic research by the author. (Score:1)
XSD? (Score:1)
Human Markup Language (Score:1)
<XML>
<HUMAN GENDER="m/f">
<HEAD>
<BRAIN></BRAIN>
</HEAD>
<BODY></BODY>
<LEGS></LEGS>
</HUMAN>
</XML>
etc etc etc, maybe at some point null transportation technology will describe a human completely with his genetics, memory and personality with XML, and transport the person as energy over wireless media to put it all together at the other end.
Hopefully fast XSLT engines will exist by then and hopefully the whole thing will not be based on MS implementation of XML document.
Oh, but why? (Score:1)
Answer: Just in case we ever need to view our genome sequence on IE
And if the human genome has about 3 gig wouldn't wrapping quaint bits of information blow it up by quite a bit? sorry but the idea seems to rank on the same idiocy level as XML
Re:CellML (Score:1)
Re:Not the first... (Score:1)
My mistake (Score:1)
Mea culpa.
And a closed standard ain't a bad thing... (Score:2)
Re:muy bein? (Score:1)
________
DTDs shouldn't be forked - thats the point (Score:2)
No tool support, yet (Score:2)
DTDs will probably stick around in one form or another for the next few years - its unfortunate that Schemas couldn't have been part of XML 1.0 - unfortunately the co-existance of DTDs and Schemas will cause code bloat as tools will basically need to support both.
Wake up, RDF is dead (Score:2)
Are you telling me that someone who doesn't have my data doesn't have it? Your astounding conclusion seems to be some sort of convoluted identity function.
Look at real user of RDF for how to do this in a better way. XML is great, but the coupling between structure and semantics that comes from using an XML schema to represent both is a nightmare for interworking between teams that overlap, but aren't identical enough to use exactly the same schema.
No one is doubting that poorly implemented schemas will degrade productivity, but I don't see how a dead, unused (sorry, never was used, ever) standard like RDF is going to help. Added to which you can employ namespaces to form compound documents from many schemas, so your limitation doesn't exist in any case.
A couple of years ago, we watched a bunch of old guys slaving over COBOL legacy conversion programs, desperately trying to suck the data out and into SQL, before Cinderella's glass computer turned back into the Y2K pumpkin. I don't want my future to turn into the same thing, scratching together n^2 XSL transforms to convert fooML into foo'ML.
You're vastly overestimating the dynamic nature of these schemas - this isn't the HTML DTD we're talking about. Look at DocBook, as an example - people have been able to use it for years without concern that the next revision would destroy their document semantics. Once again proof that a properly designed format weakens your counterarguments, and in any case, RDF isn't going to ever, EVER take off, so its probably time to quit flogging it.
Re:RDF hasn't woken up yet. (Score:2)
Added to which, you haven't told me how RDF gets around this, or are you saying that the issue should be avoided altogether?
Error Checking the Human Genome? (Score:1)
Last time i checked at ncbi [nih.gov] the genome was at 30.4% finished. and the rough draft assembly is in 148307 pieces according to the golden path [ucsc.edu].
And of course the finished target for the human genome is three years from now!
Hello World (Score:1)
HTML-like tags (Score:5)
Mr. Johnson: No!
Insurance provider: Yup. It's right between the <bald ugly-looking guy> tag and the <most likely to drink beer after finding out his wife gets fatter with age> tag.
Mr. Johnson: Oh God.
Insurance provider: I'm sorry.
Mr. Johnson: Is this hereditary? What can be done about my kids?
Insurance provider: Well, we can comment out the little buggers if we try. Some GScript may work to prevent them from passing the traits onto their children. Hell, we may even be able to use some Gava to touch up their faces so they won't be as ugly as you.
Mr. Johnson: And as for me?
Insurance provider: Your body is 2.0, Mr. Johnson. As far as we're concerned, noone supports you anymore.
incosequential reporting? (Score:1)
Since the genetic code is already digital, transforming in into something that computers can process seems rather pointless, what is wrong with AGTCTTCGADC? making it verbose for humans is also not very useful, because what the GEML seems to offer is very raw data, essentially a wrapper around raw sequences.
Maybe the issue is really hype. I.e. a clever gimmick to drive companies to share information by offering them bandwagons they can't refuse to climb?
Re:Hmmm.... (Score:1)
<body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
</GEML>
You have a syntax error in 'crotchproperties', 'crotchproperties' set to "0"
Coming soon, MS GenomePage 2006, so you can really start screwing things up.
We are very closed to this. (Score:2)
From CNN [cnn.com]: Genetically modified monkey - named ANDi carries in him an extra bit of DNA from a jellyfish. ANDi is the first primate to be similarly modified.
See CNN story [cnn.com] for full details.
Re:standards are important esp. for biologists (Score:1)
The scope of GEML seems quite limited. Its about gene expression data, which is currently very sexy. Its also been licensed in a fairly restrictive manner. Not the way to go if you asked me.
Phil
Re:It's a closed standard. (Score:1)
Phil
Re:Oh, but why? (Score:1)
If you can not see the value of structuring data into a format which is easily parsable, and whose semantics are formally defined in a standard format, then I fear that your own idiocy level is fairly high.
XML is potentially about a lot more than viewing web pages.
GEML incidentally is pretty much useless for viewing genome sequences. Whilst this is no doubt mainly the fault of the bloke who wrote the article for getting it totally wrong, GEML is not designed to represent genomic information, but gene expression data. Two very different things.
Phil
Re:It's a closed standard. (Score:2)
I would agree that bioxml servers as a much better licensing model for the community than GEML, its worth mentioning that at the current time they do not compete. GEML appears to be about gene expression, and bioxml has no DTD's addressing this.
As for nature, well I expect that there publishers are worried. Sooner or later paper journals are going to disappear. Perhaps they are diversifying, and have a stake in the company. This is not necessarily a problem. Even nature does not have the power to make a standard.
Phil
Re:Article ignored what is already used! (Score:2)
I agree that the article linked to is half-assed, and badly researched. But the sad fact is that most of the database formats in existance also seem to be fairly half assed. I think that XML might help us to get around some of these problems.
Phil
Re:My mistake (Score:1)
Actually, a codon is just what they had "TACAGTGTCAGAATTAACTGTAGTC". A tri-base codon is special and is called a "triplet codon". That is TAC, etc...
Re:num1 (Score:1)
Re:Pathetic research by the author. (Score:1)
There being four possible nucleotides (unless you're looking at something real exotic) surely you can get 4 per byte? Sticking to a base64 ascii encoding you can still get 3 nucleotides, so a single codon, per character, which is possibly a more elegant optimization.
Anyway, this shouldn't be necessary and goes against the XML philosophy. Although humans on the whole aren't meant to read XML directly, computers should be doing that, it should always remain *possibly* to do so, and I think this would muddy the human-eye view somewhat. It is accepted (by the people setting the standards) that this results in a larger raw stream, but that the correct way of dealing with that is to layer XML over storage-level and transport-level compression schemes to recover some of the entropy wastage. See REC-xml [w3.org], section 1.1, and points 3 and 5 of XML in 10 points [w3.org].
Heavy processing won't be done directly on markup - it'll be done on the in-memory representation after the markup is loaded, which can be assumed to be more compact than the markup if required (or less compact if there is a neat time/space tradeoff in the processing.)
Re:num1 (Score:1)
Hmmm.... (Score:2)
<body eyes="#00FF00" hair="#4F1F5F" height="74in" weight="175lb" crotchproperties=endowed>
</GEML>
Re:Oh, but why? (Score:1)
That said I can't think of any features you could want in such a language you can't do with just old html. Shrug.
UCSC Molecular Biology
Re:Always something for Perl to do next (Score:1)
My life is complete (Score:1)
One word (Score:1)
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:Goes to show... (Score:1)
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:muy bein? (Score:1)
But then again, WTF is "muy bein"?
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:GEML? Bah! Quadrary Encoding! (Score:1)
genes of living people so that they can go
travelling and when they get to the destination
they can be re-assembled with all the other
people-fragments and viewed with some sort of
CAT-scan-browser. HTTP-over-Humans.
And if you want better bandwidth, you can sign up
for Broad-band: Only Females will be used to
transport the data.
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:EMCAScript and backwards compatibility - with D (Score:1)
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:CellML (Score:1)
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:HTML-like tags (Score:2)
And then there's the parallel between reproduction of the species and that damn close-browser-window-makes-more-windows-popup trick that some sites pull on you. And I don't mean the fact that its usually a porn site that does it.
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Re:standards are important esp. for biologists (Score:2)
On Monday mornings I am dedicated to the proposition that all men are created jerks. -- H. Allen Smith, "Let the Crabgrass Grow"
Goes to show... (Score:1)
Muy Bein... wow (Score:2)
--
Re:No tool support, yet (Score:1)