Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Science

Major Scientific Journal Publisher Requires Public Access To Data 136

An anonymous reader writes "PLOS — the Public Library of Science — is one of the most prolific publishers of research papers in the world. 'Open access' is one of their mantras, and they've been working to push the academic publishing system into a state where research isn't locked behind paywalls and subscription services. To that end, they've announced a new policy for all of their journals: 'authors must make all data publicly available, without restriction, immediately upon publication of the article.' The data must be available within the article itself, in the supplementary information, or within a stable, public repository. This is good news for replicating experiments, building on past results, and science in general."
This discussion has been archived. No new comments can be posted.

Major Scientific Journal Publisher Requires Public Access To Data

Comments Filter:
  • Good policy (Score:5, Interesting)

    by MtnDeusExMachina ( 3537979 ) on Tuesday February 25, 2014 @05:40PM (#46339223)
    It would be nice to see this result in pressure on other publishers to require similar access to data backing the papers in their journals.
    • Practicalities (Score:5, Interesting)

      by Roger W Moore ( 538166 ) on Tuesday February 25, 2014 @06:17PM (#46339525) Journal
      Open data is a great idea but it is not always practical. Particle physics experiments generate petabytes of extremely complex, hard to understand data. Making this publicly accessible is extremely expensive and ultimately useless since, unless you understand the innards of the detector and how it responds to particles and spend the time to really understand the complex analysis and reconstruction code there is nothing useful that you can do with the data. In fact one of the previous experiments I worked on went to great trouble to put their data online in a heavily processed and far easier to understand format in the hope that theorists or interested members of the public would look at the data. IIRC they got about 10 hits on the site per year and 1 access to the data.

      So I agree with the principle that the public should be able to access all our data but for experiments with massive, complex datasets there needs to be a serious discussion about whether this is practical given the expense and complexity of the data involved. Do we best serve the public interest if we spend 25% of our research funding on making the data available to a handful of people outside the experiments with the time, skills and interest to access it given that this loss in funds would significantly hamper the rate of progress?

      Personally I would regard data as something akin to a museum collection. Museums typically own far more than they can sensibly display to the public and so they select the most interested items and display these for all to see. Perhaps we should take the same approach with scientific data. Treat it as a collection of which only the most interesting selections are displayed to/accessible by the public even though the entire collection is under public ownership.
      • So, are you worried that everyone is going to download petabytes of data? To where, their desktops?

        Shit, that's the monthly volume of third world countries these days.

        • Re: (Score:3, Insightful)

          by Anonymous Coward

          Uploading and hosting it in the first place to meet such a requirement would be an extremely difficult & costly endeavor.

          Perhaps the compromise is to include a clause that requires the author to permit others to obtain a copy and/or access the data, but only if the receiver of the data pay for the cost to transfer/access the data. This is similar to state open records access laws, where you must pay for things like the cost to make copies of documents. So in the above case, satisfying the "must permit

        • So, are you worried that everyone is going to download petabytes of data?

          No, I am worried about the cost of setting up an incredibly expensive system which can serve petabytes of data to the world and then having it sit there almost unused while the hundreds of graduate students and postdocs the money could have funded move on into careers in banking instead of going on to make a major scientific breakthrough which might benefit all society.

          • Why would a website front end be incredibly expensive? It doesn't need to be highly available with gb of bandwidth.

            Hell, pay a grad student to build it.

            • Why would a website front end be incredibly expensive?

              It wouldn't so long as all you expected was a simple file system with data files but without some explanation of the data format, where to find the associated calibration database, geometry database etc. it will be of no use to anyone. So you will need to hire someone to nicely format the data, write documentation on where to find the calibration and geometry databases, etc. etc. This is before you even start to look at the cost of storing the hundreds of petabytes of data - you are looking at about $5 mi

      • Unlike a museum, data doesn't require anyone to physically interact in order for it to be available. Whether or not you make the data publically available, you have to store and make it privately available, putting in public access is a matter of creating a read-only user and opening a firewall port.

        The sad thing is that most scientists don't actually store their data properly, it sits on removable hard drives, cd or an older variant of portable media (zip drive, tape) until it's forgotten about, lost, thro

        • Re: Practicalities (Score:5, Informative)

          by Obfuscant ( 592200 ) on Tuesday February 25, 2014 @07:19PM (#46340137)

          Whether or not you make the data publically available, you have to store and make it privately available,

          I have boxes and boxes of mag tapes with data on it from past experiments. That's privately available. It will never be publicly available.

          putting in public access is a matter of creating a read-only user and opening a firewall port.

          It is clear that you have never done such a thing yourself. There is a bit more to it than what you claim. I've been doing it for more than twenty years, keeping a public availability to much of the data we have (but not all -- tapes are not easily made public that way), and there is a lot more to dealing with a public presence than just "a read-only user and a firewall port".

          The sad thing is that most scientists don't actually store their data properly, it sits on removable hard drives, cd or an older variant of portable media

          And now you point out the biggest issue with public access to data: the cost of making it online 24/7 so the "public" can maybe sometime come look at the data. Removable hard drives are perfectly good for storing old data, and they cost a lot less than an online raid system. For that data, that is storing it "properly".

          If you want properly managed, publicly open data for every experiment, be prepared to pay more for the research. And THEN be prepared to pay more for the archivist who has to keep those systems online for you after the grants run out. And by "you", I'm referring to you as the public.

          Researchers get X amount of dollars to do an experiment. Once that grant runs out there is no more money for maintenance of the online archive, if there was money for that in the first place. For twenty two years our online access has been done using stolen time and equipment not yet retired. When the next grant runs out, the very good question will be who is going to be maintaining the existing systems that were paid for under those grants. Do they just stop?

          • by guruevi ( 827432 )

            I actually do this for a living; Having data available for projects does require it to be on large data systems which are properly backed up etc. Heck, any halfway decent staged system (Sun used to make really good ones) will allow you to read tapes as if it were a regular network share. The problem will be (which is inevitable) that your PI is going to ask for the data 3 years after they left the institute and your tapes will be unreadable (either because they degrade or because you can't find a reader and

            • by Bongo ( 13261 )

              Thanks, I've been wondering about this problem for a while. I'd seen ZFS as the technical part, but didn't know what to do about the "no money" part.

            • I actually do this for a living; Having data available for projects does require it to be on large data systems which are properly backed up etc.

              So do I, and for more than twenty years. If you do, you'd know it is quite a bit more than just a hole in a firewall and a read-only login. It requires an organization that the public can navigate and understand and actually find things. That's different than the organization that the local users need since local users get a much larger view of the data and need it in faster and more direct ways. I.e., local users see a lot of files, public users see links on a web page.

              If your "public" access is just a

              • by guruevi ( 827432 )

                The "read only user" was hyperbole but it's very close to a technical solution. To "open your data" all you need is a system that you can point to and will resolve externally. Usually, that link will be a very specific data set which is included in the paper and which will be available. How you organize it internally doesn't matter, as long as you can point to say an HTTP page with all the data in read only. There are no major security issues there because the data should already be open, it doesn't matter

      • Re:Practicalities (Score:4, Informative)

        by RDW ( 41497 ) on Tuesday February 25, 2014 @06:47PM (#46339837)

        There could be significant issues with biomedical data, too. For example, the policy gives the example of 'next-generation sequence reads' (raw genomic sequence data), but it's hard to make this truly anonymous (as legally and ethically it may have to be). For example, some researchers have identified named individuals from public sequence data with associated metadata: http://www.ncbi.nlm.nih.gov/pu... [nih.gov]

        • It goes way beyond just genes and patient data. First, there's the issue of regulation. In most biology/psychology related fields, there's a raft of regulations from funding sources, internal review boards, the Dept. of Agriculture (which oversees animal facilities) and IACUCs for example that make it impossible to comply with this requirement, and will continue to do so for a long time. No study currently being conducted using animal facilities can meet this criteria, because many records related to animal

      • Re: (Score:2, Insightful)

        Well, but.

        I think there's an arguable line to draw between "the entire body of data available", and the statistical sampling data that your typical paper is based on, or the specific data about a newly discovered phenomenon, for example.

        Exactly where that line is, I don't claim to know. But it behooves us to be reasonable, and not draw UNreasonable fixed lines in the sand.

        My personal opinion is: petabytes or not, if the research is publicly funded then the data belongs to the public, and must be ma
        • My personal opinion is: petabytes or not, if the research is publicly funded then the data belongs to the public, and must be made available in some fashion.

          The public is currently not paying for this access. Do you want to massively increase the research funding system in the US (or whatever country) to pay for long-term management of all publicly-funded data? Or do you expect to get it for free?

          Your desire to access any and all data that was created using public money means that every research grant would need to be extended from the current length (one to three years for many of them) into decades. Someone has to pay for the system administrator, the netwo

          • by aurizon ( 122550 )

            A lot of people ignore the collateral functions of the so-called 'peer review' system administered by the publisher.
            The publication must be read by someone who knows the subject passable. If his first pass finds it acceptable, he must then select from a number of true experts in these matters (the peers or equals to the writer of the paper). He works for a living as a competent editor for that area of research. The peers he choose are sent a copy of the paper to review and criticize, if not acceptable, the

            • "A lot of people ignore the collateral functions of the so-called 'peer review' system administered by the publisher."

              I don't see this as a stumbling block, though. There are already public-access peer-reviewed journals [peerj.com]. They may have a way to go yet but I expect them to get better and their number to expand in the near future.

              • by aurizon ( 122550 )

                Too many badly reviewed articles are published by them.

                • "Too many badly reviewed articles are published by them."

                  Well, that's a pretty broad statement and I haven't seen any evidence. In any case, I repeat:

                  "They may have a way to go yet but I expect them to get better"

                  • by aurizon ( 122550 )

                    I had not seen peerj, it looks better than some of the others, and their $99 fee is encouraging, even if optimistic - what happens when the work load gets large, which can happen if they atttract many authors.. There are other journals of easy access and low editorial standard, which is the 'them' I referred to. By the use of a pool of reviewers peerj has a shot at kicking the established journals to the curb = good. In so doing peerj will improve the ecology and hopefully the lower grade journals will smar

          • "The public is currently not paying for this access."

            I know it isn't. That was an aside, slightly off-topic, I admit.

            "Your desire to access any and all data that was created using public money means that every research grant would need to be extended from the current length (one to three years for many of them) into decades."

            Not if such a program were to affect only future research. After all: ex post facto laws are forbidden in the United States.

            "Someone has to pay for the system administrator, the network access, the electricity, the replacement compute/server hardware, the maintenance contracts, etc. Are you willing? "

            I am aware that it would cost somewhat more. But it is arguable that the benefit lost to society is worth far more.

            "Are you willing to forgo your free access when the funding agencies don't pay?"

            If they don't pay, then it wasn't publicly funded, was it?

            "I can tell you, I MIGHT work for free to keep some of the systems I created running, but I wouldn't work for free to maintain the access to the pubic for that data."

            If you are profiting on my dime, then yeah. Cough it up, bud.

            I didn't say the researchers should pay for it. The public (meaning of course governmen

            • Not if such a program were to affect only future research.

              I don't know what would be magic about future research that would allow a three-year grant to pay for extended, stable, long-term public access to the data that is collected under that grant. If you want someone to provide the data to you someone needs to pay for the systems and people it will require to store and distribute it. That would require a source of funding for the long-term. That would mean the three-year grant would need to be twenty years long or more, even if it is just paying for maintenance

          • We are paying for that access.

            I've been a government employee overseeing research grants. Nearly every single one of them has a clause built in that the data is to be organized and shared with the government and the government has unlimited rights to that data, including all publications. Almost all of them have to have a data management plan and have to describe how the grantee will ensure access to the data.

            Almost every single PI simply says "We will follow a standard data management plan." or some othe

        • But if I have to spend $100k on lobbying before I get public funding, I don't want to have to share the results with freeloaders who didn't pony up the lobbying cash and didn't put the manpower into the research. The rest of society benefits from the public funds after they have bought my product. Take Google, for instance.

          • "But if I have to spend $100k on lobbying before I get public funding, I don't want to have to share the results with freeloaders who didn't pony up the lobbying cash and didn't put the manpower into the research."

            You are describing exactly why the current system is broken.

            First off, if the research is worthwhile you shouldn't have to spend $100,000 to lobby for it. And I would argue that is an unethical practice: what about the little guy who is doing promising research but doesn't have the funds to lobby?

            Second: quite frankly I don't give a flying fuck how much you spent to get the grant. Public money is public money. If I'm paying for it, it belongs to me. Period. And I don't care even a little if you don't

      • Re:Practicalities (Score:4, Insightful)

        by Crispy Critters ( 226798 ) on Tuesday February 25, 2014 @07:18PM (#46340119)
        "petabytes of extremely complex, hard to understand data"

        The point seems to be missed by a lot of people. RAW DATA IS USELESS. You can make available a thousand traces of voltage vs. time on your detector pins, but that is of no value whatsoever to anyone. The interpretation of these depends on the exact parameters describing the experimental equipment and procedure. How much information would someone require to replicate CERN from scratch?

        Some (maybe most, but not all) published research results can be thought of as a layering of interpretations. Something like detector output is converted to light intensity which is converted to frequency spectra and the integrated amplitudes of the peaks are calculated and are fit to a model and the parameters fit giving you a result that the amplitude of a certain emission scales with temperature squared. Which of these layers is of any value to anyone? Should the sequence of 2-byte values that comes out of the digitizer be made public?

        It is not possible to make a general statement about which layer of interpretation is the right one to be made public. Higher levels, closer to the final results, are more likely to be reusable by other researchers. However, higher levels of interpretation provide the least information for someone attempting to confirm that the total analysis is valid.

        • It is not possible to make a general statement about which layer of interpretation is the right one to be made public. Higher levels, closer to the final results, are more likely to be reusable by other researchers. However, higher levels of interpretation provide the least information for someone attempting to confirm that the total analysis is valid.

          You're wrong. It is perfectly clear what needs to be published openly: whatever is necessary for someone to confirm that the total analysis is val

          • "You're wrong. It is perfectly clear what needs to be published openly: whatever is necessary for someone to confirm that the total analysis is valid."

            This is not what is under discussion. To confirm the total analysis, you need access to all the raw bits, all the calibration data underlying the analysis, all the computer codes used, copies of any written information in logs and lab books, and all the laboratory equipment as it was at the time the data was collected. Plus, you need to have all the knowled

      • There's precedent for this. In many biology experiments, the "raw data" is an actual organism, like a colony of bacteria or something. There are scientific protocols for accessing that "data", but you have to be able to prove that you are an institution that can handle it. Even if the public "owns" it, technically speaking, no reputable scientist is going to send an e. coli sample to just anyone.

        So I think we all understand that, in practice, we mean different things by "public access". Sometimes that means

        • What? The organism is not the data - the data is all the measurements you took of that organism and all the situations you subjected them to in order to reach the conclusions that you are publishing.

          • The idea is not mine; I'm actually paraphrasing Richard Lenski.

          • "What? The organism is not the data - the data is all the measurements you took of that organism and all the situations you subjected them to in order to reach the conclusions that you are publishing."

            You simply don't understand and have a very naive view of biology and the complexity of life on planet Earth. If you don't have voucher material available to confirm the identity of the organisms under study, then there are no definite subsequent statements that one can make about any of the measures, observa

            • Ah. Good point. Fortunately DNA sequencing is getting cheaper by the day - that's about as unambiguous an identification as you can get, and can't reproduce on its own.

              • More rapid sequencing and hopefully much less expensive sequencing will greatly improve our knowledge, but the reality is that species identification will always be an issue since there are so many similar species often difficult to tell apart. So care must be taken with the identifications to ensure that the correct name is being attached to the sequences generated. The need for vouchers will be with us for a very long time to come and this may be a good thing, since it will shift the focus from simply o

                • I would suggest that we are discovering that the entire concept of species is itself rather poorly defined. It's clear enough once two organisms have diverged too far to permit reproduction, but that doesn't really address the transitional phases or "intermediate" species (A+B or B+C can reproduce, but not A+C). And that traditional heuristic completely falls down in the face of asexuals who can exchange genes with other organisms that are barely related at all.

                  At some point I think we're going to have to

      • How hard would it be to grant exceptions to the policy? It's a good policy, no reason it can't be flexible too.
      • Good point. However, even for data that only comes to gigabytes, all such data and the resources necessary to set up and maintain such repositories is going to cost a lot of money. Journals can demand it, but its not clear that authors will be able to pay to put it in the form that journals might like to see. There is also the question of archival costs. Any organization that accumulates such data is going to require a revenue stream to pay for it. This could well be yet another cost that needs to be g

      • There would seem to be a relatively easy solution to this problem - make the raw data available from the article itself, or at least as an attachment. If that requires petabytes of storage, then presumably PLOS will provide the necessary infrastructure. That way they can ensure that as long as the article is being offered, all data used is also available. Does that sound unreasonable considering their requirement?

      • Making this publicly accessible is extremely expensive and ultimately useless since, unless you understand the innards of the detector and how it responds to particles and spend the time to really understand the complex analysis and reconstruction code there is nothing useful that you can do with the data.

        If the format/origin of the data is not understood by your readers, perhaps your publications need more work.

        • If the format/origin of the data is not understood by your readers, perhaps your publications need more work.

          Sadly, "the public" is a much larger superset of people than "your readers". And journals are not the way to teach people all the various formats and origins of data, so even considering a limited subset of "your readers" your statement is wrong.

      • But if a researcher -- or an interested enthusiast -- contacted you and asked questions and then wanted to see the data, you would give it to them, right? I strongly applaud the spirit if not the implementation of the idea. Science is supposed to be publically verifiable evidence that any interested party could reproduce. And that is what it is, as long as you have the grant money and access to materials such as particle accelerators or MRI magnets. Now I realize that this type of equipment is expensive to
    • Re: (Score:3, Interesting)

      by Pseudonym ( 62607 )

      You know who needs to introduce this rule? The ACM.

      I'm fed up with so-called scientific papers with results based on proprietary software. It doesn't even have to be open source, though that would clearly be good for peer review. If I can't (given appropriate hardware and other appropriate caveats) run your software, I can't replicate your results. If I can't replicate your results, it's not science.

    • I'd say that if they want the data to be publicly accessible without restriction, they should make the published journals publicly accessible without restriction.

  • by jpellino ( 202698 ) on Tuesday February 25, 2014 @05:40PM (#46339225)

    Will cut a lot of nonsense out of reading stuff into the results.

    • Yes, but increase the amount of reading necessary for each paper by orders of magnitude.

  • Tables with improvement percentage readings will be less excellent.

  • by RogueWarrior65 ( 678876 ) on Tuesday February 25, 2014 @05:49PM (#46339307)

    And not just the data that was cherry-picked to support the hypothesis?

  • by Kaz Kylheku ( 1484 ) on Tuesday February 25, 2014 @05:52PM (#46339333) Homepage

    Public results? Anyone can take your work and use it for something profitable, while you scrape for grants to continue.

  • by PvtVoid ( 1252388 ) on Tuesday February 25, 2014 @05:53PM (#46339345)
    Actually, the Obama administration has mandated open data [whitehouse.gov] for all federally supported research. Good news indeed.
  • Awesome. Simply awesome
  • by Anonymous Coward

    It would be nice also if journals got on the bandwagon and accepted open formats (OpenDocument) instead of proprietary file formats like .doc and not fully open formats like .docx.

  • good and bad (Score:4, Interesting)

    by eli pabst ( 948845 ) on Tuesday February 25, 2014 @06:04PM (#46339447)
    Will be interesting to see how this is balanced with patient privacy, in particular with the increasing numbers of human genomes being sequenced. I know a large proportion of the samples I work with in the lab have restrictions on how the data can be used/shared due to the wording of the informed consent forms. Many would certainly not allow public release of their genome sequence, so publishing in PloS (or any other journal with this policy) would be impossible. So while I think the underlying principle is good, I think an unintended consequence might be less privacy for patients wanting to participate in research (or less patients electing to participate at all).
    • Re:good and bad (Score:4, Informative)

      by canowhoopass.com ( 197454 ) <rod.canowhoopass@com> on Tuesday February 25, 2014 @06:13PM (#46339505) Homepage
      The linked blog specifically mentions patient privacy as an allowable exception. They also have exceptions for private third party data, and endangered species data. I suspect they want to keep the GPS locations for white rhino's hidden.
    • I work with data collected by others, and those others are typically rather protective of their data for commercial reasons. I can use them for scientific purposes, but I'm not allowed to publish them in raw form. For most of these data there are no alternatives. I'd much rather publish everything of course, but that's impossible in this case, so I wonder if that means that I can't publish in PLOS any more now?

      Just to be clear, I applaud this move, we should be publishing the data, plus software and such, w

  • by Bueller_007 ( 535588 ) on Tuesday February 25, 2014 @06:10PM (#46339489)

    This is bad news for ecologists and others with long-term data sets. Some of these data sets require decades of time and millions of dollars to produce, and the primary investigators want to use the data they've generated for multiple projects. Current data licensing for PLOS ONE (and--as far as I know-- all others who insist on complete data archiving) means that when you publish your data set, it is out there for anyone to use for free for any purpose that they wish; not just for verification of the paper in question. There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

    Requiring full accessibility of data makes many people reticent to publish in such a journal, because it means giving away the data they were planning on using for future publications. A scientist's publication list is linked not only to their job opportunities and their pay grade, but also to the funding that they can get for future grants. And of course those grants are linked to continuing the funding of the long-term project that produced the data in the first place.

    What is needed is a new licensing model for published data that says "anyone is free to use these data to replicate the results of the current study, however it CANNOT be used as a basis for new analyses without written consent of the primary investigator of this paper or until [XX] years after publication." Journals would also need to agree that they would not accept any publications based on data that was used without consent.

    It seems to me that this arrangement would satisfy the need to get data out into the public domain while respecting the scientists who produced it in the first place.

    • by JanneM ( 7445 ) on Tuesday February 25, 2014 @06:29PM (#46339669) Homepage

      On the other hand, if I don't have your data I can't check your results. If you want to keep your data secret for a decade, you really should plan to not publish anything relying on it for that time either. Release all the papers when you release the data.

      Also, who gets to decide when a study is a replication and when it is a new result? Few replication attempts are doing exactly the same thing as the original paper, for good reason. If you want to see if it holds up you want to use different analysis or similar anyway. And "use" data? What if another group produces their own data and compares with yours? Is that "using" the data? What if they compare your published results? Is that using it?

      A partial solution, I think, is for a group such as yours to pre-plan the data use already when collecting it. So you decide from start to publish a subset of that data early and publish papers based on that. Then publish another subset for further results and so on.

      But what we really need is for data to be fully citeable. A way to publish the data as a reserach result by itself - perhaps the data, together with a paper describing it (but not any analysis). ANyone is free to use the data for their own research, but will of course cite you when they do. A good, serious data set can probably rack up more citations than just about any paper out there. That will give the producers the scientific credit it deserves.

      • Release all the papers when you release the data.

        Not going to happen. You need to publish during the data collection period in order to continue getting the funding you need for data collection.

        Few replication attempts are doing exactly the same thing as the original paper, for good reason.

        Right, but replication of the experiment is the EXACT reason that we're making the data available. If you want to use the data for something else, that's fine, but if it's data that the original author is still using, then you should contact them about it first.

        A partial solution, I think, is for a group such as yours to pre-plan the data use already when collecting it. So you decide from start to publish a subset of that data early and publish papers based on that. Then publish another subset for further results and so on.

        Again, this is not realistic in the overwhelming majority of cases. One of the benefits of long-term s

    • by Arker ( 91948 )
      "What is needed is a new licensing model for published data that says "anyone is free to use these data to replicate the results of the current study, however it CANNOT be used as a basis for new analyses without written consent of the primary investigator of this paper or until [XX] years after publication." "

      I could not disagree more.

      What is needed here is to deal with the real problem - the issues that force working scientists into a position where doing good science (publishing your data) can harm your
      • hear hear!!!

      • This is a tall order, since scientists are held to a much higher standard than capitalists, and consequently always at a disadvantage. Scientists are expected to give away the product of their labor for free for all to use as they wish, but others are permitted to extract all the profits they may be able to get from the scientist's work, without any of the funds flowing directly to the scientist, who generated the data in the first place. One might ask why government contractors aren't likewise expected t

        • Other people give you money for research. You want the profit, fund your own work then.
          • You missed my point entirely. Many scientists work for the public and it is the public who pays, yet it is typically a handful of capitalists that profit. Why should scientists be expected to share, but not those capitalists who get to profit form the work of others?

            Perhaps the solution is for scientists to simply patent and copyright everything themselves. Now that there is electronic publishing, except for reviews is there really the need to pay publishers 7 figure salaries just to gather up the work o

    • "There are plenty of scientists out there who poach free online data sets and mine them for additional findings."

      Right. This leads to a two-class system where the scientists that collect the data (and understand the techniques and limitations) are treated as technicians while those that perform high-level analysis of others' results get the publications. This can lead to unsound, unproductive science in may cases. Those who understand the details are not motivated, and the superficial understanding of tho

      • This leads to a two-class system where the scientists that collect the data (and understand the techniques and limitations) are treated as technicians while those that perform high-level analysis of others' results get the publications.

        Maybe in some fields, but in genomics and molecular biology, the result tends to be exactly the opposite: the experimentalists (and their collaborators) get top-tier publications, while the unaffiliated bioinformaticists mostly publish in specialty journals.

    • by the gnat ( 153162 ) on Tuesday February 25, 2014 @07:36PM (#46340261)

      Some of these data sets require decades of time and millions of dollars to produce, and the primary investigators want to use the data they've generated for multiple projects. . . There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

      I work in a field (structural biology) that had this debate back when I was still in grade school: the issue was whether journals should require deposition of the molecular coordinates in a public database, or later, should these data be released immediately on publication, or could the authors keep them private for a limited time. The responses at the time were very instructive: one of the foremost proponents of data sharing was accused of trying to "destroy crystallography as we know it", to which his response was yes, of course, but how was that a bad thing? Skipping to the punchline: nearly every journal now requires immediate release of coordinates and underlying experimental data immediately upon publication, during which time the field has grown exponentially and there have been at least six Nobel prizes awarded for crystallography (at least one of which went to an early opponent of data sharing). The top-tier journals (Science, Nature) average about a paper per week reporting a new structure. Not only did the predicted dire consequences never happen, the availability of a large collection of protein structures has actually accelerated the field by making it easier to solve related sturctures (and easier to test new methods), and facilitated the emergence of protein structure prediction and design as a major field in its own right.

      The question I'm worried about: what form do the data need to take? Curating and archiving derived data (coordinates and structure factors) is already handled by the Protein Data Bank, but the raw images are a few orders of magnitude larger, and there is no public database available. Most experimental labs simply do not have the resources to make these data easily available. (The exceptions are a few structural genomics initiatives with dedicated computing support, but those are going away soon.)

    • This is preposterous. Unless you self-funded your work, you don't own the data. The people who give out grants don't intend it for you to spend for your own benefit.
    • There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

      And this is a good thing, despite your word "poach". Analyses which would not have occurred to the original experimenters get done, and we get more science for our money. For many big data projects (e.g. the human genome project, astronomical sky surveys), giving 'poaching' opportunities is the primary purpose of the project.

      A former boss of mine once, when reviewing a paper, sent a response which

    • by g01d4 ( 888748 )

      There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

      I think the additional findings are part of what science is all about. How do scientists 'poach' something that's free? Did you think waiting many decades [wikipedia.org] for the Dead Sea Scroll results was acceptable?

      If data is that expensive to collect, then its collection and publication should rank as an end in itself.

    • This may seem naive, but are you seriously telling us, the Slashdot crowd, that you shouldn't have to release your data because *gasp* someone might do SCIENCE with it? And your suggestion is to impose a copyright/licensing scheme on it? I'm a bit surprised I'm the only person commenting on this. I do see your "continuing funding, job opportunities and pay grade" - but if everyone is doing it, then PERHAPS things might change?
    • by hsu ( 970167 )

      The whole point of science is creating new information, and digging further into raw data, correlating with other data, and finding relationships and further information is very important. Monopolizing the data for one party will reduce amount of research which would benefit from that particular set of data, as there will not be anyone beyond other original creators of the data set being allowed to work on it. Keeping one to yourself might be short-sighted benefit of getting another grant, but it definite

  • by umafuckit ( 2980809 ) on Tuesday February 25, 2014 @06:18PM (#46339535)
    Standard policy. Nature have been doing this [nature.com] for some time. They state: authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. So have Cell Press [cell.com] and Science [sciencemag.org]. I stopped searching at this point, but I'm sure other major journals do the same thing.
    • Ok, sorry, I see they want the data deposited upon publication.
    • by Anonymous Coward

      And many scientists that get published in these high profile journals are scofflaws when it comes to sharing... It's been covered many times but compliance is near zero.

      • I'd agree with that. I once tried, very politely, to get data from authors of an NPG paper. They stalled and it become awkward. In the end I gave up because my interest was purely motivated by curiosity and I didn't want to make an enemy (even if the person in question was in a different field). Glad I backed off now as I've ended up moving into that field...
  • by hubie ( 108345 ) on Tuesday February 25, 2014 @06:53PM (#46339895)
    one of the most prolific publishers of research papers in the world.

    Their journals aren't in my field (they are all bio journals), so I have not heard of them, but is it true that they are that big? Their web site [plos.org] wasn't much help in terms of information on subscriptions or article numbers, or I simply missed it. Can anyone familiar with them provide any input?

    Their data policy might work for the biosciences, but good luck requiring all the many TB of raw data from a particle physics experiment to be put up somewhere. And in some instances, like that one, the raw data will most likely be useless without knowing what it all means, what the detectors were, what the detector responses are, etc. etc. etc. For experiments where it takes man-months or man-years to collect and process the data, making it all available in raw format will largely be a waste of time.

    In general, at least for experiments done in the lab that use specialized equipment, raw data will not be very useful if you don't understand what you're collecting or familiar with the equipment. You can end up with situations like that guy who took the Mars rover images and kept zooming in until he saw a life form.

  • Well, that really wraps it up for the global warming crowd.
    If their source data has to be publicly accessible, it'll be laughed out off the stage before their "studies" get any traction.

    • Yup because IT'S A CONSPIRACY!

      Right? That's what Exxon Mobil and Fox News tell me...

    • We won't know the result BUT yeah, finally researchers will have to really provide transparency on their work.

      That works both ways though. Now Exxon et al will also have to show their justifications with hard numbers whose origins are clearly replicable.

    • Hardly, when you consider that the results are typically posted every day online and in the newspapers. Its not as if world temperature data is being kept secret. Are you suggesting that scientists are hiding temperature data from the public? Surely you must be joking.

  • There is a great deal of science, and public policy, that would benefit from public exposure. But medical and sociological research benefits from the privacy of the subject, who then feel more free to be truthful. The same is true of political survey data, and "anonymizing" it can be a lengthy, expensive, and uncertain process, especially when coupled with various metadata that is being collected with the experiments or in parallel with it. It can also be very expensive to make public, even without privacy

  • "This is good news for replicating experiments, building on past results, and science in general."

    It is, unless the data can't be made "publicly available, without restriction" (very important emph. added), in which case you can't publish there. Yes, there are others, but demanding dropping all restrictions in all cases is simply an approach blind to reality. Also, if they demand so, they must provide free storage, which in some cases could range to multiple gb of data - and you won't want to pay for inde
  • Rather than publishing on proprietary data of uncertain characteristics, this will essentially force researchers to use common, known, and available data sets. A smattering of what's available and reputable:

    http://www.itl.nist.gov/div898... [nist.gov]
    http://www.keypress.com/x2814.... [keypress.com]
    http://lib.stat.cmu.edu/DASL/ [cmu.edu]
    http://www.statsci.org/dataset... [statsci.org]
    http://data.gc.ca/eng/facts-an... [data.gc.ca]
    http://library.med.cornell.edu... [cornell.edu]

A morsel of genuine history is a thing so rare as to be always valuable. -- Thomas Jefferson

Working...