Forgot your password?
typodupeerror
Science

Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades 189

Posted by samzenpus
from the here-today-gone-tomorrow dept.
cold fjord writes "UPI reports, 'Eighty percent of scientific data are lost within two decades, disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated. The finding comes from a study tracking the accessibility of scientific data over time, conducted at the University of British Columbia. Researchers attempted to collect original research data from a random set of 516 studies published between 1991 and 2011. While all data sets were available two years after publication, the odds of obtaining the underlying data dropped by 17 per cent per year after that, they reported. "Publicly funded science generates an extraordinary amount of data each year," UBC visiting scholar Tim Vines said. "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.' — More at The Vancouver Sun and Smithsonian."
This discussion has been archived. No new comments can be posted.

Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades

Comments Filter:
  • Lifecycle management (Score:5, Interesting)

    by FaxeTheCat (1394763) on Friday December 20, 2013 @04:15AM (#45743887)
    So the institutions do not have any data lifecycle management for research data. Are we supposed to be surprised? Ensuring that data are not lost is a huge undertaking and cannot be left to the individual researcher. It may also require a change in the research culture at many institutions. As long as research is measured by the publications, that is where the resources go and where the focus will be.

    Will this change? Probably not.
  • by ron_ivi (607351) <[moc.secivedxelpmocpaehc] [ta] [ontods]> on Friday December 20, 2013 @04:18AM (#45743897)
    ... poorly collected unreliable data also vanishes at at least the same rate (hopefully faster). And assuming shoddy data disapears faster than good data, then the quality of available data should continually increase.
  • is/are (Score:5, Interesting)

    by _Ludwig (86077) on Friday December 20, 2013 @05:17AM (#45744029) Journal

    Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.

    Whichever side of the "data is" vs. "data are" argument one falls on, I hope we can all agree that mixing both forms within the same sentence is definitely wrong.

  • Re:Concerning... (Score:4, Interesting)

    by clickclickdrone (964164) on Friday December 20, 2013 @05:18AM (#45744031)
    That's still 100 years which is a lot better than the data being talked about here.

    There was a documentry on the radio this week about the loss of letter writing as a form and how alarmed biographers were getting because it's getting very hard to trace someone's life, thoughts, actions etc without a paper trail as stuff like emails, digital photos etc generally get lost when someone dies.

    Personally, I find the increasing rate of loss quite alarming - so much of our lives are digital and so little is properly curated with a view to future access. We know so much about the past from old documents, often hundreds if not thousands of years old but these days we're hard pushed to find something published ten years ago.
  • Re:Concerning... (Score:5, Interesting)

    by serviscope_minor (664417) on Friday December 20, 2013 @05:32AM (#45744057) Journal

    Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing?

    Probably not that much. I'm not claiming this is good, but I don't htink it's as bad as it appears.

    If a paper is unimportant and more or less sinks without a trace (perhaps a handful of citations), then the data is probably of no importance since someone is unlikely to ever want it. Generally this is because papers tend to get more obscure over time and also get supereseded.

    For important papers, the data just isn't enough: is a paper is important then it will establish some technique or result. In 20 years people will have generally already reanalysed the data and likely also independently verified the result if it is important enough. After 20 years I think the community will have moved on and the result will either be established or discredited.

    I think the exception is for things that are "hard" to find such or non-repeatable such as finding fossils. Then again the Natural History Museum has boxes and boxes and boxes of the things in the back room. They still haven't gotten round to sorting all the fossils from the Beagle yet (this is not a joke or rhetoric: I know someone who worked there).

    So my conclusion is that it's not really great that the data is being lost, but it's not as bad as it initially sounds.

  • Lost forever (Score:3, Interesting)

    by The Cornishman (592143) on Friday December 20, 2013 @05:53AM (#45744127)
    > many other data sets are expensive to regenerate...
    Or maybe impossible to regenerate (for certain values of impossible). I remember reading a classified technical report (dating from the 1940s) related to military life-jacket development, wherein the question arose as to whether a particular design would reliably turn an unconscious person face-up in the water. The experimental design used was to dress some servicemen (sailors, possibly, but I don't recall) in the prototype design, anaesthetise them and drop them in a large body of water, checking for face-down floaters to disprove the null hypothesis. Somehow, I don't think that those data are going to be regenerated any time soon. I hope to God not, anyway.
  • Re:Concerning... (Score:5, Interesting)

    by Anonymous Coward on Friday December 20, 2013 @06:42AM (#45744251)

    I designed and built the equipment for scientific experiments that will never be repeated: cochlear implant stimulation of one ear, done in an MRI. This was safe because the older implant technology had a jack that stuck out of the subject's head, and which we could connect to electronics outside the MRI itself. But the old "Ineraid" implants have been replaced, clinically, with implants using embedded electronics and usually magnets. Those are hideously unsafe to to even bring in the same *room* as an MRI, much less actually scan the brain of a person wearing one.

    So that experiment is unlikely to ever be repeated. Losing the data, and losing the extensive clinical records of those subjects, would be an immense loss to science. There is especially historical data from decades of testing on these subjects that show the long term effects of their implants, or of different types of redesigned external stimulators. That data is scientifically priceless. When I started that work, we used mag-tape for data, and scientific notebooks for recording measurements. I helped reformat and transfer that data to increasingly modern storage devices several times. We went through 3 different types of storage media in 10 years, and I remember having to write software to allow Exabyte drives to find the end of the tape and add data. (Exabytes had no End-Of-Tape marker.) Preserving that data.... was a lot of work.

  • Re:Concerning... (Score:2, Interesting)

    by jabuzz (182671) on Friday December 20, 2013 @07:17AM (#45744355) Homepage

    No it won't because it 20-30 years we will be able to do gene therapy to "grow" or "regrow" the stereocilia and hence cochlear implants will be considered as barbaric as medieval blood letting. Consequently the data will only be of obscure historical interest.

  • by QilessQi (2044624) on Friday December 20, 2013 @09:16AM (#45744787)

    The Long Now Foundation has devised an interesting mechanism for storing important information which, although not optimal for machine readability, is dense and has an obvious format: a metal disk etched with microprinting, whose exterior shows text getting progressively smaller as an obvious way of saying "look at me under a microscope to see more":

    http://rosettaproject.org/ [rosettaproject.org]

    I highly recommend reading The Clock of the Long Now if you're interested in the theory and practice of making things last.

  • by n1ywb (555767) on Friday December 20, 2013 @09:29AM (#45744847) Homepage Journal
    No but it is amazing what NEW science you can do with OLD data. I've worked with the Transportable Array project for example http://www.usarray.org/researchers/obs/transportable [usarray.org] it's over a decade old and scientists are still discovering new ways to take advantage of the data and will likely be doing so for decades to come. On the other hand a lot of data is just junk due to poor quality metadata; when was that instrument calibrated? I dunno. Damn. At leat in geophysics we have the National Geophysical Data Center to curate this stuff http://www.ngdc.noaa.gov/ [noaa.gov] at least until Congress cuts it's funding.

"Stupidity, like virtue, is its own reward" -- William E. Davidsen

Working...