Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
The Internet Science

Freeing and Forgetting Data With Science Commons 114

blackbearnh writes "Scientific data can be both hard to get and expensive, even if your tax dollars paid for it. And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. That's the argument that John Wilbanks makes in a recent interview on O'Reilly Radar, describing the problems that have led to the creation of the Science Commons project, which he heads. According to Wilbanks, scientific data should be easy to access, in common formats that make it easy to exchange, and free for use in research. He also wants to see standard licensing models for scientific patents, rather than the individually negotiated ones now that make research based on an existing patent so financially risky." Read on for the rest of blackbearnh's thoughts.
"Wilbanks also points of that as the volume of data grows from new projects like the LHC and the new high-resolution cameras that may generate petabytes a day, we'll need to get better at determining what data to keep and what to throw away. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it's meaningful, it'll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?'"
This discussion has been archived. No new comments can be posted.

Freeing and Forgetting Data With Science Commons

Comments Filter:
  • Again with the IP (Score:1, Insightful)

    by Anonymous Coward

    Einstein said "If I have seen farther than most it is because I have stood on the shoulders of giants."
    Where does that begin to apply in a society of lawyers, profiteers, and billion dollar industries based on exploiting shortsighted IP management?

  • I don't know! (Score:2, Insightful)

    by blue l0g1c ( 1007517 )
    I was reading through the summary quickly and almost had a panic attack at the deluge of questions at the end. We get the point already!
  • by MoellerPlesset2 ( 1419023 ) on Friday February 20, 2009 @10:11PM (#26938085)

    What's most important to keep is quite simple and obvious really:
    The results. The published papers, etc.

    It's an important and distinctive feature of Science that results are reproducible.

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      How can the results be reproducible if you don't keep the original data?

      • by MoellerPlesset2 ( 1419023 ) on Friday February 20, 2009 @10:27PM (#26938185)

        How can the results be reproducible if you don't keep the original data?

        The relevant results are supposed to be included in the paper, as well as the information necessary to reproduce the work. Most data doesn't fall into that category.

        To make an analogy the computer geeks here can relate to: All you need to reproduce the output of a program is the source code and parameters. You don't need the executable, the program's debug log, the compilers object files, etc, etc.

        The point is you want to reproduce the general result. You don't usually want to reproduce the exact same experiment with the exact same conditions. Supposedly you already know what happens then.

        • by mako1138 ( 837520 ) on Friday February 20, 2009 @10:47PM (#26938275)

          Let's say the LHC publishes its analysis, and then throws away the data. What happens when five years later it's discovered that a flawed assumption was used in the analysis? Are we going to build another LHC any time soon, to verify the result?

          For a billion-dollar experiment like the LHC, that dataset is the prize. The dataset is the whole reason the LHC was built. Physicists will be combing the data for rare events and odd occurrences, many years down the road.

          • Re: (Score:1, Offtopic)

            by Sanat ( 702 )

            Mod up this important position please.

          • Re: (Score:3, Insightful)

            Let's say the LHC publishes its analysis [..]

            Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule.
            First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility. They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.
            Second: Primary data, actual measurement results, are already kept, as a rule.
            Third: The vast majority of

            • by oneiros27 ( 46144 ) on Friday February 20, 2009 @11:52PM (#26938547) Homepage

              Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule. First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility. They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.

              There are two types of science. What you're referring to is called 'Little Science' (not to be derogatory), but it's the type of thing that a small lab can do, with a reasonable amount of funding. And then there's what we call "Big Science" like the LHC, Hubble Space Telecope, Arecibo Observatory, Large Synoptic Space Telescope, etc.

              Second: Primary data, actual measurement results, are already kept, as a rule.

              I wish. Well, okay, it might be kept, but the question is by who, and have they put it somewhere that people can analyze it?

              I was at the AGU last year, and there was someone from a solar observatory that I wasn't familiar with. As I do work for the Virtual Solar Observatory, I asked them if we could put up a web service to connect their repository to our federated search. They told me there was no repository for the observatory -- the data walks out the door with whoever the observer was.

              Then there's the issue of trying to to tell from the published research exactly what the original data was. But then, I've been harping on the need for data citation for years now ... it's an issue that's starting to get noticed.

              Third: The vast majority of experiments are never ever reproduced to begin with. You're lucky enough to get cited, really. Most papers don't even get cited apart from by those who wrote them.

              For the type of data that I deal with, none of it is technically reproducible, because it's observations, not experiments. And that's precisely why it's important to save the data.

              Fourth: Very little science is done by re-interpreting existing results. That only applies to the unique cases where the actual experiment can't be reproduced easily.

              In your field, maybe. But we have folks who try to design systems to predict when events are going to happen and need training data. Others do long-term statistical analysis with years or decades of data at a time. Still others find a strange feature that hadn't previously been identified as important (eg, coronal dimmings) and want to go back through all of the data to try to identify other occurrences.

            • Re: (Score:3, Informative)

              by mako1138 ( 837520 )

              Let's say the LHC publishes its analysis [..]

              Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule.

              First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility.
              They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.

              I admit that I jumped on the LHC as an extreme example. But even in an "ordinary" lab these days, you'll find some specialized and complex equipment. This is true for the cutting edge of any field.

              Second: Primary data, actual measurement results, are already kept, as a rule.

              As oneiros27 notes, this is not guaranteed, either by design or circumstance.

              Third: The vast majority of experiments are never ever reproduced to begin with. You're lucky enough to get cited, really. Most papers don't even get cited apart from by those who wrote them.

              Not sure what kind of point you're trying to make here.

              Fourth: Very little science is done by re-interpreting existing results. That only applies to the unique cases where the actual experiment can't be reproduced easily.

              It's not necessarily a matter of re-interpreting existing results. You may be adding an old dataset to a new dataset, and finding new results in the combined set, or finding a glimmer

          • by Mr Z ( 6791 ) on Friday February 20, 2009 @11:24PM (#26938445) Homepage Journal

            With a large and expensive dataset that can be mined many ways, yes, it makes sense to keep the raw data. This is actually pretty similar to the raw datasets that various online providers have published over the years for researchers to datamine. (AOL and Netflix come to mind.) Those data sets are large and hard to reproduce, and lend themselves to multiple experiments.

            But, there are other experiments where the experiment is larger than the data, and so keeping the raw data isn't quite so important as documenting the technique and conclusions. The Michelson-Morley interferometer experiments (to detect the 'ether'), the Millikan oil-drop experiment (which demonstrated quantized charges)... for both of these the experiment and technique were larger than the data, so the data collected doesn't matter so much.

            Thus, there's no simple "one size fits all" answer.

            When it comes to these ginormous data sets that were collected in the absence of any particular experiment or as the side effect of some experiment, their continued existence and maintenance is predicated on future parties constructing and executing experiments against the data. This is where your LHC comment fits.

            • Re: (Score:3, Insightful)

              by mako1138 ( 837520 )

              I agree that there is no simple answer, but I am uneasy with your "experiment is larger than the data" concept. Today we think of the Michelson-Morley and Millikan experiments as canonical and definitive investigations in Physics. But we do not often remember that each was preceded by a string of less-successful experiments, and followed by confirmations. It the accumulation of a body of data that leads to the gradual acceptance of a physical concept.

              See chart:
              http://en.wikipedia.org/wiki/Michelson-Morley_e [wikipedia.org]

          • Re: (Score:2, Interesting)

            by Patch86 ( 1465427 )

            5 Insightful?

            Seriously, read the OP again.

            "What's most important to keep is quite simple and obvious really: The results. The published papers, etc."

            He never suggested you throw out the results. No-one is going to throw out the results. Why would anybody throw out the results? Whichever body owns the equipment is bound to keep the results indefinitely, any papers they publish will include the results data (and be kept by the publishers), and copies will end up in all manner of libraries and file servers, du

            • Re: (Score:3, Insightful)

              by mako1138 ( 837520 )

              You seem to be using "results" in a wider sense than "published papers". Yes, nobody is going to throw out papers. But the raw data from instruments? It is not clear whether those will be kept.

              You say that the analysis and interpretations can be thrown out, but those portions are precisely what go into published papers. And for small-scale science, it makes little sense to throw away anything at all.

        • Re: (Score:3, Interesting)

          by TapeCutter ( 624760 )
          "You don't usually want to reproduce the exact same experiment with the exact same conditions."

          That's right I want an independent "someone else" to do that in order to make my original result more robust. If I were an acedemic I would rely on post-grads to take up that challenge, if they find a discrepency all the better since you now have another question! To continue your software development analogy - you don't want the developer to be the ONLY tester.
      • Re: (Score:3, Interesting)

        by repepo ( 1098227 )
        It is a basic assumption in science that given some set of conditions (or causes) you get the same effect. For this to happen it is important to properly record how to setup the conditions. This is the kind of things that scientific papers describe (in principle at least!).
        • Maybe you haven't noticed, but quantum mechanics seems to indicate there is not always one outcome for one set of conditions. This works on the macro scale, but not necessarily always on the subatomic level.

      • Re: (Score:2, Informative)

        by Rockoon ( 1252108 )
        On the subject of reproducibility, I am reminded of a situation with Wei-Chyung Wang, a climate scientist.

        He was involved in the paper Jones et al (1990), which is where the situation begins.

        After *17 YEARS* of requests, Jones FINALLY released some of the data used in Jones 1990 through demands under the terms of the U.K. Freedom of Information policy on publicly funded research.

        Wang himself is free from FOI requests because Wang is an American and operates in America, where FOI requests regarding pub
      • Re: (Score:3, Informative)

        by jschen ( 1249578 )

        How can the results be reproducible if you don't keep the original data?

        As others noted, there are cases where raw data is king, and others where raw data is virtually useless. LHC raw data will be invaluable. Raw data from genetic sequencing is a waste of time to keep. Why store huge graphics files when the only thing we will ever want from them is the sequence of a few letters? One must be able to distinguish between these two possibilities (and more subtle, less black and white cases, too), and there is no one size fits all solution.

        That said, you may be surprised how well r

    • What's most important to keep is quite simple and obvious really:
      The results. The published papers, etc.

      It's an important and distinctive feature of Science that results are reproducible.

      At what cost? Would you suggest discarding the data sets of nuclear bomb detonations since they are easily reproduced? How about other data sets that may need to be reinterpreted because of errors in the original processing?

      • Re: (Score:3, Interesting)

        At what cost? Would you suggest discarding the data sets of nuclear bomb detonations since they are easily reproduced?

        Nobody said results are easily reproduced. But a-bomb tests are hardly representative of the vast majority of scientific results out there.

        How about other data sets that may need to be reinterpreted because of errors in the original processing?

        That's a scenario that only applies when the test is difficult to reproduce, and the results are limited by processing power rather than measureme

    • The results. The published papers, etc. It's an important and distinctive feature of Science that results are reproducible.

      Having worked around academic groups that do medical research for three years now, I can tell you that is absolutely not what drives research.

      Researchers will love to tell you about how it is the quest for knowledge and other pie-in-the-sky ideals, but when it comes down to it- it's mostly about making a living (or more than a living), and fame/prestige.

      See, journals have what's

      • by smallfries ( 601545 ) on Saturday February 21, 2009 @06:09AM (#26939781) Homepage

        What incentive does a massive industry have to solve cancer, when it would put them out of business? Tens of thousands of people have dedicated most of their adult lives, usually to studying specific mechanisms and biological functions so narrow that if cancer were cured tomorrow, they would be useless- their training and knowledge is so focused, so narrow- they cannot compete with the existing population of researchers in other biomedical fields. Journals which charge big bucks for subscriptions also would be useless. Billions of dollars of materials, equipment, supplies, chemicals- gone. "Centers", hospitals, colleges, universities which each rake in hundreds of millions of dollars in private, government, and non-profit sourced money would be useless.

        That's an old argument and although it sounds reasonable it is completely unsound. An industry does not function as a single cohesive entity with wants and desires. It is composed of many different individuals with their own wants and desires.

        I know enough academics to say for certain that if any one of those individuals could discover a cure that would put their entire employer out of business then they would leap at the chance. The fame that would follow would make another job easy enough to get, and the recognition is what they're really in it for anyway.

        • by cmaley ( 104167 )

          I'm a cancer researcher and I agree. Though I'm in it more for the good of society and because it is an engaging problem. I would jump at the chance to cure cancer even if it put my institution out of business and I didn't get the recognition. The reality (of this fantasy) is that most institutions and researchers could easily move on to other diseases/problems. We do it all the time.

          In addition, there is BIG money to be made from a drug that cures cancer. Even the ones that cure a small percent of cancer c

      • Re: (Score:3, Informative)

        I've been doing research in the biological sciences for 12 years now, including some work that was at least tangentially related to human health. I am not in it for the paycheck--if that's all I wanted, my friends and I joke that we'd go to KFC School of Business Management and be assistant managers at fast food restaurants making more than we do in science. I, and the majority of the people I know, don't want to be professors either. It's extremely rare for a professor to actually do any lab work themse
  • eh (Score:1, Informative)

    by Anonymous Coward

    That's not true. Any tax funded study requires more documentation and publication then a private one. Anyone who reads them knows.
    All studies worth anything are aimed at a audience proficient in the subject, they are not meant for general audiences, and are often proven wrong, you need repeatable results.

  • "And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. "

    I predict the dumbing down of science.

    • Re: (Score:3, Interesting)

      by Vectronic ( 1221470 )

      Although likely, not necessarily...

      I'd be happy with a Wiki-Style, where the actual article can be as complex (in the know) as desired, but with a glossary of sorts.

      There are geniuses of all sorts, someone might be completely lost trying to understand it linguistically, but may find a fault in it instantly visually, or audibly.

      However that is somewhat redundant, as the original (as it is now) can be converted into that by people, but a mandate saying it must contain X, Y and Z, will open it up to more peopl

      • Re: (Score:3, Interesting)

        by Fallingcow ( 213461 )

        I'd be happy with a Wiki-Style, where the actual article can be as complex (in the know) as desired, but with a glossary of sorts.

        Don't count on that being at all helpful.

        Take the math articles on Wikipedia: I can read one about a topic I already understand and have no idea what the hell their talking about in entire sections. It's 100% useless for learning new material in that field, even if it's not far beyond your current level of understanding. Good luck if you start on an article far down a branch o

    • Re: (Score:3, Insightful)

      by wisty ( 1335733 )

      Why should science be more complex than necessary? For every String Theory area (where complexity is unavoidable) there are plenty of theories like economics, which just rely on weird jargon to fence out the interlopers.

    • Or the scientists just stop writing in third person passive, and start writing in a manner people outside of the scientific community are used to. Though I think the summary refers more to trying to extract data you do understand from complicated papers that talk a lot about things you neither understand nor care about.

  • Has nobody ever read The tragedy of the commons? [wikipedia.org]

    However, in the case of the non-physical, I guess noone can "waste" or "steal" it, only copy and use.
  • I have been waiting all my life to see how my simulation data would look in Excel. And everyone is supposed to have it (damn you linux users ! damn you people with not enough money to buy it !).

    On a more serious note, a common ground for data format would be nice. You already have some generic formats, like HDF5 and other, but i must admit right now, it is a bit of a jungle in the astrophysic department, and it is not going to change anytime soon (unless someone make a awesome generic, one-fit-all library

    • Linux reads Excel files in open office, so .xls is pretty universal; one can also install crossover office or use wine to install office (though I don't know how well that works).
    • It is an almost trivial exercise to convert one format to another.

      What is a lot harder is knowing how the data sets were measured and whether it is valid to combine them with data sets measured in other ways.

      At least half the Global Warming bun-fight is about the validity of comparison between different data sets and the same goes for pretty much any non-trivial data sets.

  • by Rostin ( 691447 ) on Friday February 20, 2009 @11:24PM (#26938443)

    I'm a working scientist (ok, PhD student), so I read journal articles pretty often. I can understand the rub in principle, but let's say that we come up with some way for all scientific data to be freely shared. So what? In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?

    It reminds me of the XKCD this morning...

    • Re: (Score:2, Interesting)

      by onionlee ( 836083 )
      agreed. most sciences that have been around for a long time and have developed their own specializations within them, such as physics, have specific journals that target their "demographics" (such as the journal of applied physics a, b, c, d, letters). anything outside of those journals most likely have been rejected by those journals and are irrelevant. furthermore, the relatively young sciences such as linguistics use (what i personally think is lame) a system of keywords so that anyone can easily find a
    • Re: (Score:2, Insightful)

      by TapeCutter ( 624760 )
      "I'm a working scientist (ok, PhD student), so I read journal articles pretty often."

      And how would you read them if your institution did not foot the bill for subscriptions?

      "In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists."

      When you amalgamate "almost all cases" you end up with "almost all publications". The rest of your post smacks of elitisim, trivializes scientific curiosity and completely ignores the social and scientifi
      • by robbyjo ( 315601 )

        To be honest, if your institution does not foot the bill for subscription, try inter-library loans. That's easy. Most credible institutions in the US do have some subscription for more mainstream journals. Unless you're in third world countries.

        The problem with scientific publication is that you need to be terse. They're limited to 8-12 pages. If you are required to spend time for background knowledge for the uninitiated, you'll produce a 1000 page book instead. Moreover, the reviewers will think that you s

        • "If you are required..."

          I don't think anyone in TFA is seriously suggesting that hand holding noobs be a requirement for publication and this is probably where the confusion sets in. I also understand that you may want to keep your own data close to your chest until you have extracted a paper out of it (ie: publish or perish).

          "To be honest, if your institution does not foot the bill for subscription, try inter-library loans...[snip]...The problem with scientific publication is that you need to be ters
          • by robbyjo ( 315601 )

            Einstein managed to get away with three elegant pages and zero refrences

            Science has evolved much from 1905. Even with his zero references, he's still implicitly citing the results of Lorentz. By today's standard, no citation like that is unacceptable.

            Let me ask you this: Can you honestly ask a high school student or a freshman to understand even that paper without grasping the concepts of differential equation (DE)? They can't. Sure, you can understand the motivation and introduction of that paper, just li

            • "Science has evolved much from 1905."

              Document procedures have evolved (precicely what TFA is banging on about), the philosophy and methodology of science are pretty much the same, no?

              "Let me ask you this: Can you honestly ask a high school student or a freshman to understand ..."

              I could but as you say they may have diffuculty understanding. More puzzling is why are you asking me? - I'm 50 and I am talking about myself and other educated laymen (particularly those in the less developed countries), w
              • by robbyjo ( 315601 )

                The original post made a point that "In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists." I completely agree with him. Public mostly has no use to any of such data unless they know how to process the data and all the rationale behind them (which implies that they must know all the underlying scientific process). I agree to that as well. However, you stressed the communication issue to the uninitiated--which I think is misleading. And t

                • "worthy of the data" - Thank you for confirming my suspicions of elitisim or is it just plain arrogance? Either way the rest of the post that precedes your conclusion of who is "worthy" reads as an attempt to define what others should or should not be interested in. If you don't want to take part in open access then fine, nobody is forcing you to do so. Please do not obstruct the efforts of others just because it does not fit your worldview as this would imply you are not only elitist but also a control fre
        • To be honest, if your institution does not foot the bill for subscription, try inter-library loans. That's easy. Most credible institutions in the US do have some subscription for more mainstream journals. Unless you're in third world countries.

          Anything that complicates the retrieval of knowledge ends up reducing access to that knowledge. Why should someone have to put up with manual process, when we have this things called the internet. The internet is designed to facilitate access of knowledge, so it is t

          • by robbyjo ( 315601 )

            Anything that complicates the retrieval of knowledge ends up reducing access to that knowledge. Why should someone have to put up with manual process, when we have this things called the internet. The internet is designed to facilitate access of knowledge, so it is the tool of choice.

            Yes, and there are open-access journals already. Guess what? The scientists (i.e. the paper authors) are required to pay much more for the open access. Heck, they're required to pay for non-open journals as well. Don't believe

    • by Beetle B. ( 516615 ) <beetle_b@@@email...com> on Saturday February 21, 2009 @01:13AM (#26938837)

      Typical comments from someone in the first world.

      First, just on the side, I know lots of people who got PhD's but did not really stay in research and academia. They still want to read papers, though, as they still maintain an interest.

      But the main benefit of opening up journal papers is for the rest of the world to benefit. Yes, if you have a very narrow perspective, you could just dismiss that as charity. If you're open minded, you'll realize that shutting out most of the world to scientific output means much less science globally, and much less benefits to you as a result.

      Imagine if all researchers in Japan published papers only in Japanese, and the journals had a copyright condition that prevented the content from ever being translated to another language, and you'll see what I mean. Whereas current journals require a lot of money for access, these ones also have a price: Just learn Japanese. It's not exactly promoting science.

      Then again, of course, journals do need a base amount of money to operate. Just that Elsevier kind of companies charge so much more than is needed to make a profit.

      • I know people who'd have an easier time learning Japanese than C++, does that mean we should write computer code in English? The same thing applies to science. Non-technical science isn't science, so when scientists publish something for each other to read, they publish it in their language. There are people who translate it back to English, such as teachers and writers, and those of us who don't have the background to compile science code in our minds need to find the binary version or learn the languag
      • by ceoyoyo ( 59147 )

        Sucks to live in the developing world and be told that if you want to publish your results it's $1000 a paper.

    • The real problem is hoarding knowledge, which over time leads to elitism, then guilds, and finally priesthoods. The fix is literally trivial: open access to electronic publications for everybody, ie bypass all the elaborate subscription checks. This isn't rocket science. The only thing stopping it from happening are greedy publishing companies who like the status quo.

      You're right that every single modern scientific publication has a very small intended readership, yet the argument for opening up everythin

    • Re: (Score:3, Interesting)

      by Grym ( 725290 ) *

      In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?

      I'm a poor medical student, but a medical student with--quite frequently--interdisciplinary ideas. I can't tell you the number of times I have been interested in

    • by lbbros ( 900904 )
      An example would be if the data could be re-analyzed or reviewed when new methods for looking at it or simply to try out new stuff. I work with high-throughput data (DNA microarrays) and about half of my work is applying my ideas to data that others have published, to validate an approach in an independent data set.

      Some fields require access to the data more than others. In the case I'm talking about, you should take a look at the MIAME (Minimal Information About a Microarray Experiment) checklist [mged.org] publish

    • by smallfries ( 601545 ) on Saturday February 21, 2009 @06:22AM (#26939827) Homepage

      Trickle-down. Dissemination of knowledge.

      You don't know it yet (not meant as a jibe but it is something that clicks in after your PhD) but your primary function as a scientist is not to make discoveries. It is spreading knowledge. Sometimes that dissemination will occur in a narrow pool, through journal papers between specialists in that narrow pool of talent.

      This is not the primary goal of science, although it can seem like it when you are slogging away at learning your first specialisation well enough to get your doctorate. Occasionally a wave from that little pool will splash over the side - maybe someone will write a literature review that is read by a specialist in another field. A new idea will be found - after all sometimes we know the result before we know the context that it will be applied to.

      The pools get bigger as you move further downstream. Journal articles pass into conference publications, then into workshops. Less detail but carried through a wider audience. Then after a time, when the surface seems to have become still textbooks are written and the knowledge is passed on to another generation. We tend to stick around and help them find the experience to use it as well. This is why all PhD students have an advisor to point out the best swimming areas.

      That was the long detailed answer to your question. The simple version is that you don't know who your target audience is yet. And limiting it to people in institutions that pay enormous access fees every year is not science. As a data-point - a lot of European institutes don't bother with IEEE fees. They run to about £50k/year which simply isn't worth it. As a consequence results published in IEEE venues are cited less in Europe. So even amongst the elite access walls have an effect.

    • by radtea ( 464814 )

      Could someone explain to me why this is a real problem

      I'm a physicist who runs a business that amongst other things does data analysis in the life sciences, mostly genomics [predictivepatterns.com]. In this area data collection is relatively expensive (hundreds or thousands of dollars per sample) and disease states are relatively generic--follicular lymphoma is pretty much the same regardless of whether you are in Kansas or Karachi.

      I recently invented a new algorithm for combing gene expression data for patterns of expression tha

    • I'm a working scientist (ok, PhD student), so I read journal articles pretty often. I can understand the rub in principle, but let's say that we come up with some way for all scientific data to be freely shared. So what? In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?

      Replace "scientific data" with "satellite imagery".
      There's nothing to gain by letting anyone look at it? Only highly trained experts can decipher it?

      People have found hidden forests, ancient ruins, and a few meteor impacts. You don't know what's to find in the data until you let people look.

    • by cmaley ( 104167 )

      Actually, the few times I know of that a good data set was put up on the web, it generated a lot of research and progress. I'm thinking of Pat Brown putting up some of the first data on gene expression arrays. Probably hundreds of people worked on that data - everything from statistical methods, to reverse engineering the gene network. It was great. This is probably most valuable when the data is from a new type of experiment that is likely to be widely used.

      I hope to do something similar but there is a big

    1. It would be good to share, for a paper, both all the data as well as the conclusions on the web. Then the reasoning is easier to check. More importantly, other people can query the data differently to produce other conclusions without special requests.
    2. "Reproduceability" doesn't mean using the same data, it means using the same procedures on different data and seeing if the same conclusion is reached.
    3. If data were commonly provided as experiments are done, this would be of value to others even if the exper
    • The data is reviewed by people qualified to review the data; that's what peer reviewed journals are for.
      • Peer review doesnt require reviewing the data. In most cases, peer review simply has a peer signing off as if to say "yeah, if he went through the steps he claims to have gone through, then his conclusion is probably reasonable."

        Do you HONESTLY think that publications such as Nature and Science have teams of people sifting over supplied data?

        Look into what these publications require of the researchers sometime. They do not require the data. They instead require that the data be made available upon reque
  • Data storage is something we've gotten very good at and we've made it very cheap. A Petabyte a day is not as staggering as it was even five years ago.
    • Re: (Score:2, Interesting)

      by DerekLyons ( 302214 )

      Not as staggering as it was five years ago only means it is not as staggering as five years - not that it still isn't staggering. Especially when you consider a petabyte a day means 36.5 exabytes a year.

    • by dkf ( 304284 )

      Data storage is something we've gotten very good at and we've made it very cheap. A Petabyte a day is not as staggering as it was even five years ago.

      It still has to be paid for. It still has to be actually stored. It still has to be backed up. It still has to be kept in formats that we can actually read. It still has to have knowledge about what it all means maintained. In short, it still has to be curated, kept in an online museum collection if you will. And this all costs, both in money and effort by knowledgeable people.

      The problem doesn't stop with copying the data to a disk array.

  • Wilbanks also points of that as the volume of data grows from new projects...

    I'm sorry, but that makes no sense. 'Points of'???? Come on.

  • by jstott ( 212041 ) on Saturday February 21, 2009 @12:36AM (#26938705)

    And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them.

    I know that this is a real shock to you humanities majors, but science is hard. And yes, for the record, I do have degrees in both [physics and philosophy, or will as of this May — and the physics was by far the harder of the two].

    Here's another shocker. If you think the papers are hard to read, you should see the amount of work that went into processing the data until it's ready to be written up in an academic journal. Ol' Tom Edison wasn't joking when he said its "1% inspiration and 99% perspiration." If you think seeing the raw data is going to magically make everything clear, well, I'm sorry, the real world just doesn't work that way. Finally, if you think professional scientists are going to trust random data they downloaded off the web of unknown provenance, well, I'm sorry but that isn't going to happen either. I spend enough time fixing my own problems; I certainly don't have time to waste fixing other peoples' data for them.

    -JS

    • Re: (Score:1, Insightful)

      by Anonymous Coward

      I fully agree.

      Furthermore, I've read the entire, long interview and get the feeling this is a person looking for a problem. Yes, taxpayer-funded research should be freely available. Yes, we could all benefit from more freely available data. But he builds up a massive and poorly defined manifesto with very little meat around a few good points.

      I'd love to have access to various data sets that I know exist, because others have published their results and described the data collection. But they likely invested

    • by ceoyoyo ( 59147 )

      Bravo.

      There ARE big shared datasets, when it makes sense, from trustworthy sources. They tend to cost a lot to assemble, make available, and maintain. I'm starting a post doc at a new lab and they showed me one they're working on: the price tag was $40 million.

      We also have a mechanism by which anybody can read scientific papers, for free, if they choose to put in a little effort. They're called libraries.

      Yes, the journal publishers probably need to cut their prices now that nobody actually wants the prin

  • said once to a king "there is no royal road to geometry". The nature of some things is in fact complex and there is no easy and accurate at the same time way to represent that.

    Is a science or religion goal that the universe is made in such way that should be easy to explain it to humans?
  • This is the just as likely to add burden as to remove it.

    I can't count the number of times I've seen attempts to 'standardize' data, or even just notation, in a given field. It all works very well for data to that point, but then the field expands or changes, or new assumptions become important, and the whole thing becomes either unwieldy or obsolete. This is one reason why every different field, it seems, has their own standards in their literature.

    Speaking of the literature, most of these proposals
  • Excluding experimental data, those fields don't really have the problem that this guy is talking about. Perhaps someone should give him/her a lesson in the Scientific Method. Then maybe his/her words would reflect some rigour. Well, that and a link to the ArXive (http://arxiv.org/).

    Why is this so? Because, these communities are so small, that just about everyone knows or knows of everyone else('s work). Of course, that's a slight hyperbole. BUT, /just/ a *slight* one.

    This sort of project only really a

    • Finally somebody mentioned the arxiv.
      By the way, it's quite funny to see all these guys telling somebody how to do his job better, mostly when they have absolutely no idea what they're talking about.
      Some nice sentences from the article:
      -"It's taken me some time to learn how to read them"... what!!??
      -"Because you're trying to present what happened in the lab one day as some fundamental truth", hahaaha, that one is good.
      -"So what we need to do is both think about the way that we write those papers, and t
  • Does anyone, -- I mean there's me obviously -- think that the way the structure of the articles doesn't, in the sense that it's sort of an exact word for word -- transcription of someone *speaking* -- is extremely jarring when you see it -- by that I mean in the written form?
  • by w0mprat ( 1317953 ) on Saturday February 21, 2009 @09:09AM (#26940447)
    Research data is typically large. In the mid-late 90s I recall a researcher planning to move 10 TB of data internationally. It wasn't exactly unprecedented either. The internet was simply not capable of such a transfer. Eventually they had to ship it on many disks.

    The problem is with such raw data, ie from a radio telescope, is you need all of it, you can't really cut any out before it's even processed.

    This is a lot less of a issue today with research networks all hooked into multi-gigabit pipes. But there are still very large datasets researchers are attempting to work with that are simply not cheap to handle.

    I think this is a great idea, it's nice being able to share it but as far as the really sexy big research going on these days I don't see it being much of a point-click-download service!
    • by ceoyoyo ( 59147 )

      Grant applications in my field typically have at least one line item for "storage." It's not cheap.

  • From the article, regarding scientific literature: "Because you're trying to present what happened in the lab one day as some fundamental truth. And the reality is much more ambiguous. It's much more vague. But this is an artifact of the pre-network world. There was no other way to communicate this kind of knowledge other than to compress it."

    A statement like this suggests that the speaker either unfamiliar with the way scientific data is actually turned into papers, or inappropriately optimistic about the

  • What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data?

    The original data is of paramount importance, software for processing and analysis not so much... Science requires the ability to independently redo experiments and analyze data... getting the same result IS the method of verification that makes the "Scientific Method" valid. Getting the same result using different tools for analysis is even better... Mann's "Hockey Stick" graph is one of the failures o

No spitting on the Bus! Thank you, The Mgt.

Working...