Forgot your password?
The Internet Science

Freeing and Forgetting Data With Science Commons 114

Posted by Soulskill
from the bringing-it-all-together dept.
blackbearnh writes "Scientific data can be both hard to get and expensive, even if your tax dollars paid for it. And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. That's the argument that John Wilbanks makes in a recent interview on O'Reilly Radar, describing the problems that have led to the creation of the Science Commons project, which he heads. According to Wilbanks, scientific data should be easy to access, in common formats that make it easy to exchange, and free for use in research. He also wants to see standard licensing models for scientific patents, rather than the individually negotiated ones now that make research based on an existing patent so financially risky." Read on for the rest of blackbearnh's thoughts.
"Wilbanks also points of that as the volume of data grows from new projects like the LHC and the new high-resolution cameras that may generate petabytes a day, we'll need to get better at determining what data to keep and what to throw away. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it's meaningful, it'll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?'"
This discussion has been archived. No new comments can be posted.

Freeing and Forgetting Data With Science Commons

Comments Filter:
  • by repepo (1098227) on Friday February 20, 2009 @11:28PM (#26938191)
    It is a basic assumption in science that given some set of conditions (or causes) you get the same effect. For this to happen it is important to properly record how to setup the conditions. This is the kind of things that scientific papers describe (in principle at least!).
  • by MoellerPlesset2 (1419023) on Friday February 20, 2009 @11:39PM (#26938239)

    At what cost? Would you suggest discarding the data sets of nuclear bomb detonations since they are easily reproduced?

    Nobody said results are easily reproduced. But a-bomb tests are hardly representative of the vast majority of scientific results out there.

    How about other data sets that may need to be reinterpreted because of errors in the original processing?

    That's a scenario that only applies when the test is difficult to reproduce, and the results are limited by processing power rather than measurement accuracy. That's a relatively unusual scenario, since, first: Most experiments are easier to reproduce than that second: methods and measurements improve over time. The much more common scenario is that it's more efficient to simply re-do the experiment with modern equipment and get both more accurate measurements as well as better processing.

  • by Vectronic (1221470) on Friday February 20, 2009 @11:46PM (#26938265)

    Although likely, not necessarily...

    I'd be happy with a Wiki-Style, where the actual article can be as complex (in the know) as desired, but with a glossary of sorts.

    There are geniuses of all sorts, someone might be completely lost trying to understand it linguistically, but may find a fault in it instantly visually, or audibly.

    However that is somewhat redundant, as the original (as it is now) can be converted into that by people, but a mandate saying it must contain X, Y and Z, will open it up to more people, quicker.

  • by TapeCutter (624760) on Saturday February 21, 2009 @12:20AM (#26938429) Journal
    "You don't usually want to reproduce the exact same experiment with the exact same conditions."

    That's right I want an independent "someone else" to do that in order to make my original result more robust. If I were an acedemic I would rely on post-grads to take up that challenge, if they find a discrepency all the better since you now have another question! To continue your software development analogy - you don't want the developer to be the ONLY tester.
  • by Mr Z (6791) on Saturday February 21, 2009 @12:24AM (#26938445) Homepage Journal

    With a large and expensive dataset that can be mined many ways, yes, it makes sense to keep the raw data. This is actually pretty similar to the raw datasets that various online providers have published over the years for researchers to datamine. (AOL and Netflix come to mind.) Those data sets are large and hard to reproduce, and lend themselves to multiple experiments.

    But, there are other experiments where the experiment is larger than the data, and so keeping the raw data isn't quite so important as documenting the technique and conclusions. The Michelson-Morley interferometer experiments (to detect the 'ether'), the Millikan oil-drop experiment (which demonstrated quantized charges)... for both of these the experiment and technique were larger than the data, so the data collected doesn't matter so much.

    Thus, there's no simple "one size fits all" answer.

    When it comes to these ginormous data sets that were collected in the absence of any particular experiment or as the side effect of some experiment, their continued existence and maintenance is predicated on future parties constructing and executing experiments against the data. This is where your LHC comment fits.

  • Is storage an issue? (Score:2, Interesting)

    by blue l0g1c (1007517) on Saturday February 21, 2009 @12:34AM (#26938489)
    Data storage is something we've gotten very good at and we've made it very cheap. A Petabyte a day is not as staggering as it was even five years ago.
  • by onionlee (836083) on Saturday February 21, 2009 @12:56AM (#26938557)
    agreed. most sciences that have been around for a long time and have developed their own specializations within them, such as physics, have specific journals that target their "demographics" (such as the journal of applied physics a, b, c, d, letters). anything outside of those journals most likely have been rejected by those journals and are irrelevant. furthermore, the relatively young sciences such as linguistics use (what i personally think is lame) a system of keywords so that anyone can easily find articles that theyre interested in. truly, i have yet to find any researcher who has complained about this "problem".
  • by DerekLyons (302214) <`fairwater' `at' `'> on Saturday February 21, 2009 @01:17AM (#26938633) Homepage

    Not as staggering as it was five years ago only means it is not as staggering as five years - not that it still isn't staggering. Especially when you consider a petabyte a day means 36.5 exabytes a year.

  • by Fallingcow (213461) on Saturday February 21, 2009 @02:24AM (#26938877) Homepage

    I'd be happy with a Wiki-Style, where the actual article can be as complex (in the know) as desired, but with a glossary of sorts.

    Don't count on that being at all helpful.

    Take the math articles on Wikipedia: I can read one about a topic I already understand and have no idea what the hell their talking about in entire sections. It's 100% useless for learning new material in that field, even if it's not far beyond your current level of understanding. Good luck if you start on an article far down a branch of mathematics--assuming they bother to tell you the source of the notation in that article, it'll take you a half-dozen more articles to find anything that sort-of translates some of it for you.

    Some sort of mouseover tool-tip hint thing or a simple glossary is all I ask, but I think the people writing that stuff don't even realize how opaque it is to people who majored in something other than math.

  • by Grym (725290) * on Saturday February 21, 2009 @02:52AM (#26938979)

    In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?

    I'm a poor medical student, but a medical student with--quite frequently--interdisciplinary ideas. I can't tell you the number of times I have been interested in pursuing a subject for independent research and have been stymied or effectively stopped in my tracks because of my lack of ability to pay or lack of online access to experimental data and results. You might think that modern science is highly specialized and, for most bleeding-edge topics, you're probably right. In these cases, the affected researchers can all afford the one or two subscriptions they need to stay up to date. However, in overlapping areas, non-specialists (or specialists in other fields) might have have a unique perspective and possibly insightful findings to add. What harm could be done by letting them take a look?

    Take for example one of my hair-brain ideas. There is a disease called Pellagra [], which is caused by diets deficient in certain amino acids. These amino acids are lacking in corn. In the United States, corn is, by far, the largest cash crop. Now, diets in the U.S. are varied enough to where modern Americans do not get Pellagra, but this isn't the case in developing nations, where Pellagra can sadly be endemic. So, my idea is this: why not introduce conservative substitutions [] into the genetic sequence of the gene encoding the major structural protein of corn ( zein [] ) in such a way as to make corn a (more) complete amino acid food source? By doing this, you'd be turning one of the world's most abundant and cheap foodstuffs into an effective cure for a common, debilitating disease.

    Now, to me, as an outsider to Agriculture, this seems like a rather basic idea. I was convinced that someone had to have tried something similar to this. But you'd be surprised. I have yet to find a single paper that has ever attempted such a thing. Almost all of them focus on crop yields or the use of zein in commercial products. Now, maybe (for reasons unbeknown to me) my idea is untenable such that, people in the field have never given it a thought. But what if that isn't the case? What if the leaders of the field (or at least the emergent behavior of the scientists and scientific institutions) is pushing so hard in one direction that an obvious area for research or advancement was overlooked? Let's hope it's not the latter...

    Regardless it's a travesty how petty scientific institutions are in this regard considering how often they talk to the public about high-minded ideals when extolling the virtues of public funding of Science. This information should be available to all: specialists and non-specialists alike.


    P.S. Oh yeah, and in case, any of you were wondering. Somebody already patented the general idea [] described in my post. So don't get any wild ideas about trying to use it to help the poor, now! (/facepalm)

  • by Patch86 (1465427) on Saturday February 21, 2009 @06:04PM (#26944089)

    5 Insightful?

    Seriously, read the OP again.

    "What's most important to keep is quite simple and obvious really: The results. The published papers, etc."

    He never suggested you throw out the results. No-one is going to throw out the results. Why would anybody throw out the results? Whichever body owns the equipment is bound to keep the results indefinitely, any papers they publish will include the results data (and be kept by the publishers), and copies will end up in all manner of libraries and file servers, duplicated all over the place.

    The most important things to keep from any experiment is 1) the results (no point in doing it if you don't keep the results) and 2) the methodology (if they don't know how you got the data, it's worthless). What you could throw away without too much harm is the analysis and interpretations, since you can always reanalyze and reinterpret (and any interpretations made now may prove wrong in the future anyhow). Even then, anything interesting is likely to be kept in the grand scheme of things anyway.

    The place which TFA is actually talking about is less dramatic, lower budget science. Its still important (it's the bread and butter of science and technology), but will be found in the vaults of far fewer publishers, libraries and web servers. And it's lower budget science where it's far easier to reproduce results, as in GP.

Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. -- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982