Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Science

Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades 189

cold fjord writes "UPI reports, 'Eighty percent of scientific data are lost within two decades, disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated. The finding comes from a study tracking the accessibility of scientific data over time, conducted at the University of British Columbia. Researchers attempted to collect original research data from a random set of 516 studies published between 1991 and 2011. While all data sets were available two years after publication, the odds of obtaining the underlying data dropped by 17 per cent per year after that, they reported. "Publicly funded science generates an extraordinary amount of data each year," UBC visiting scholar Tim Vines said. "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.' — More at The Vancouver Sun and Smithsonian."
This discussion has been archived. No new comments can be posted.

Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades

Comments Filter:
  • And in 20 years... (Score:5, Insightful)

    by Anonymous Coward on Friday December 20, 2013 @03:04AM (#45743845)

    And in 20 years, these results too shall be lost.

    • by Z00L00K ( 682162 )

      Unless it's published in a newspaper or magazine that is widespread. But printed matter seems to be on a decay.

      • by queazocotal ( 915608 ) on Friday December 20, 2013 @07:12AM (#45744523)

        That's not the point.
        The actual published results - even if published in an obscure journal tend to stick around _much_ more.

        Even old journals which go out of publication get their archives and the rights to distribute them bought - as there is some small amount of value there, in addition to the copies in the various reference libraries around the world.

        The problem is that if you are wondering about that graph on page 14 of the paper that the whole paper rests on, you can't get the original data to recreate that graph.

        This is a major problem because the only way to check that graph is now to redo the whole experiment.

    • Well, they're currently behind a paywall, so I don't see how most of us were even supposed to find them in the first place.

  • lulz (Score:3, Funny)

    by Anonymous Coward on Friday December 20, 2013 @03:05AM (#45743853)

    thats okay, the nsa has a backup

  • Concerning... (Score:5, Insightful)

    by Adam Colley ( 3026155 ) <mog@ k u p o .be> on Friday December 20, 2013 @03:09AM (#45743873)

    Trying to ignore that a paper about the unavailability of scientific data is locked behind a paywall.

    This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP/M floppy today.

    As libraries move to digital storage rather than the dead tree that's been fine for thousands of years they are inviting a catastrophe, possibly only one well aimed solar mass ejection from massive data loss.

    • Re:Concerning... (Score:5, Insightful)

      by Dutch Gun ( 899105 ) on Friday December 20, 2013 @03:26AM (#45743915)

      Paper has its own issues. Talk to me about the durability of paper after you recover the books lost throughout time due to natural decay, burning (intentional or otherwise), floods, wars, and social forces (politics, religion, etc). Digital data can be easily copied and archived (when not behind a paywall, of course). It seems to me that redundancy is the best form of insurance against data loss. A solar mass is not going to wipe out every computer with a copy of important data on it, and all the relevant backups. And if it does, we're probably in a lot more trouble for reasons other than losing some scientific research.

      Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing? I'm only half joking, of course, since it's difficult to ascertain the value of research immediately in some cases, but wouldn't it stand to reason that any important or groundbreaking research will naturally be widely disseminated, and thus protected against loss?

      • Re:Concerning... (Score:5, Insightful)

        by Eunuchswear ( 210685 ) on Friday December 20, 2013 @04:03AM (#45743993) Journal

        Digital data can be easily copied and archived

        Can be. But mostly isn't.

      • Re:Concerning... (Score:5, Interesting)

        by serviscope_minor ( 664417 ) on Friday December 20, 2013 @04:32AM (#45744057) Journal

        Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing?

        Probably not that much. I'm not claiming this is good, but I don't htink it's as bad as it appears.

        If a paper is unimportant and more or less sinks without a trace (perhaps a handful of citations), then the data is probably of no importance since someone is unlikely to ever want it. Generally this is because papers tend to get more obscure over time and also get supereseded.

        For important papers, the data just isn't enough: is a paper is important then it will establish some technique or result. In 20 years people will have generally already reanalysed the data and likely also independently verified the result if it is important enough. After 20 years I think the community will have moved on and the result will either be established or discredited.

        I think the exception is for things that are "hard" to find such or non-repeatable such as finding fossils. Then again the Natural History Museum has boxes and boxes and boxes of the things in the back room. They still haven't gotten round to sorting all the fossils from the Beagle yet (this is not a joke or rhetoric: I know someone who worked there).

        So my conclusion is that it's not really great that the data is being lost, but it's not as bad as it initially sounds.

        • Re:Concerning... (Score:5, Interesting)

          by Anonymous Coward on Friday December 20, 2013 @05:42AM (#45744251)

          I designed and built the equipment for scientific experiments that will never be repeated: cochlear implant stimulation of one ear, done in an MRI. This was safe because the older implant technology had a jack that stuck out of the subject's head, and which we could connect to electronics outside the MRI itself. But the old "Ineraid" implants have been replaced, clinically, with implants using embedded electronics and usually magnets. Those are hideously unsafe to to even bring in the same *room* as an MRI, much less actually scan the brain of a person wearing one.

          So that experiment is unlikely to ever be repeated. Losing the data, and losing the extensive clinical records of those subjects, would be an immense loss to science. There is especially historical data from decades of testing on these subjects that show the long term effects of their implants, or of different types of redesigned external stimulators. That data is scientifically priceless. When I started that work, we used mag-tape for data, and scientific notebooks for recording measurements. I helped reformat and transfer that data to increasingly modern storage devices several times. We went through 3 different types of storage media in 10 years, and I remember having to write software to allow Exabyte drives to find the end of the tape and add data. (Exabytes had no End-Of-Tape marker.) Preserving that data.... was a lot of work.

          • Re: (Score:2, Interesting)

            by jabuzz ( 182671 )

            No it won't because it 20-30 years we will be able to do gene therapy to "grow" or "regrow" the stereocilia and hence cochlear implants will be considered as barbaric as medieval blood letting. Consequently the data will only be of obscure historical interest.

            • Re:Concerning... (Score:5, Insightful)

              by Lisias ( 447563 ) on Friday December 20, 2013 @07:06AM (#45744497) Homepage Journal

              Wishful thinking.

              Let's make a deal: *first*, the gene therapy works. *THEN* we assume we can afford to lose the data the grandparent talks about.

            • Why would a cochlear implant ever be considered as barbaric as medieval blood letting? The implants aren't perfect, but they provide a huge increase in the quality of life for a large number of patients. A potential better solution that's decades down the line doesn't make a currently effective treatment barbaric...
          • by dj245 ( 732906 )

            We went through 3 different types of storage media in 10 years, and I remember having to write software to allow Exabyte drives to find the end of the tape and add data. (Exabytes had no End-Of-Tape marker.) Preserving that data.... was a lot of work.

            Do what everybody else does. Encrypt it using a strong password, then upload it to The Pirate Bay or the Semi-centralized Filesharing Platform Which Shall Not Be Named, and call it "insurance file xxxxx".

        • Re:Concerning... (Score:5, Insightful)

          by Teun ( 17872 ) on Friday December 20, 2013 @07:02AM (#45744491)
          In the nineties I had a friend working for a company that bought a lot of old Soviet geophysical data.

          It needed some very special transcription technology but once in the clear and fed to modern 3D seismic software it revealed a lot more than the original reports gave.

          Retaining old reports is nice, retaining old raw data even nicer.

      • Problem is that this data is funded and paid for while the research is active. After it's done the data is put in the closet so to speak. No one is being paid to go back every 5 years and re-read the data to convert to a new media format. Paper doesn't help because it can't store all this data anyway.

        Ie, satellite collects lots of data on solar activity, it all gets stored onto 9 track tape drives, which then end up on a minicomputer for data analysis, image enhancement, etc. Project funded through gove

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      The problem is not just an issue of digital storage, but also a problem of redundancy.

      In the "old days", people understood and accepted the risk that a paper copy would be lost. In fact, it was a GIVEN that they would eventually be lost (or damaged or misplaced or stolen or checked out and simply never returned). So multiple copies were kept because centuries of experience dictated that some copies would be lost no matter how strong, carefully maintained and well preserved the originals were.

      Nowadays, peopl

    • well dead tree has its own issues. Try finding a book written and published in 1910. Most likely you won't. The paper is so fragile that its has to specially sealed to survive. Rag paper on the other hand still looks good for its age.
      • Re:Concerning... (Score:4, Interesting)

        by clickclickdrone ( 964164 ) on Friday December 20, 2013 @04:18AM (#45744031)
        That's still 100 years which is a lot better than the data being talked about here.

        There was a documentry on the radio this week about the loss of letter writing as a form and how alarmed biographers were getting because it's getting very hard to trace someone's life, thoughts, actions etc without a paper trail as stuff like emails, digital photos etc generally get lost when someone dies.

        Personally, I find the increasing rate of loss quite alarming - so much of our lives are digital and so little is properly curated with a view to future access. We know so much about the past from old documents, often hundreds if not thousands of years old but these days we're hard pushed to find something published ten years ago.
      • Re:Concerning... (Score:5, Informative)

        by _Shad0w_ ( 127912 ) on Friday December 20, 2013 @04:24AM (#45744045)

        I'd go to one of the British deposit libraries and ask to see their copy; deposit libraries have existed since the Statute of Anne in 1710. The British Library has 28,765 books and 1,480 journals in its catalogue from 1910...

      • Re:Concerning... (Score:5, Informative)

        by clickclickdrone ( 964164 ) on Friday December 20, 2013 @05:10AM (#45744165)
        As an extreme case, the BBC has reported on scrolls from Pompeii and Herculaneum that were 'destroyed' by Vesuvius are now starting to reveal their secrets using some pretty impressive techniques. http://www.bbc.co.uk/news/magazine-25106956 [bbc.co.uk]
      • I have a collection of books on my shelf that date from the 1870s, all in top condition and never been stored in any special way.

    • by dbIII ( 701233 )

      imagine trying to read an 8" CP/M

      If I didn't already know of two companies that could do that I'd look in the yellow pages. I get your point though and there are older or rarer formats than that which would require a bit of legwork or possibly even reverse engineering.

    • Re:Concerning... (Score:5, Insightful)

      by martin-boundary ( 547041 ) on Friday December 20, 2013 @05:10AM (#45744171)

      This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP /M floppy today.

      It's not that it's a new problem as such, it's that for the first time in history we have a simple way to solve it, yet we have stupid greedy rich people who sponsor and enact laws to stop us from solving the problem.

      The way to solve the problem is through massive duplication of all the data, over and over again through time. We have the technical means to do this on an unprecedented scale.

      Even 1000 years ago, people had to painstakingly copy books, by hand, one at a time. And after a handful of copies were produced, there still weren't enough to guarantee that most would survive the ages, wars, fires, censorship, etc. So we generally have tiny collections from the past.

      But now it's digital data. Anyone could copy it. We could have millions of copies of some obscure scientific work, all perfect duplicates. If even 0.1% of these copies survive, that's still thousands of copies.

      And what do we do? We let a bunch of 1 percenters, who themselves barely know how or care to read, sponsor draconian copyright laws to stop eeryone from copying all that stuff, just on the off chance that they might copy a bunch of songs or movies that are outmoded within two years. And the commercial scienrific pulishers are some of the worst.

      It's pathetic.

      • We let a bunch of 1 percenters, who themselves barely know how or care to read, sponsor draconian copyright laws to stop eeryone from copying all that stuff, just on the off chance that they might copy a bunch of songs or movies that are outmoded within two years. And the commercial scienrific pulishers are some of the worst.

        Commercial scientific publishers do indeed tend to be bottom-feeders, but if I'm understanding the article correct, they're not the root cause here - the issue is not that articles are

    • by Yvanhoe ( 564877 )
      On this fight, Aaron Swartz came very close to make the whole world totally different.
    • Re:Concerning... (Score:4, Insightful)

      by bfandreas ( 603438 ) on Friday December 20, 2013 @05:25AM (#45744203)
      The combination of insane copyright claims and the overrelyance on comparatively volatile storage technology is steering us directly into another dark ages.
      That's one take on things.
      On the other hand we have already lost so much stuff over the centuries that perhaps what I just said is idiotic alarmism. After all we have rebuilt western civilisation after the fall of Rome(that just took the Dark Ages) and we didn't all die off after the Great Library of Alexandria burned down. The stuff that gets often replicated will propably not be lost. But let's hope it isn't a retweet of Miley Cyrus' knickers.
    • the dead tree that's been fine for thousands of years

      Not so fine.... The Alexandria library fire was perhaps the most catastrophic loss of human knowledge ever. For example, it destroyed the details of a heliocentric theory, which was postulated by Greek astronomer Aristarchus of Samos, millennia before Copernicus brought it to mainstream.

      • Just because it was sitting around in a library is no guarantee anything would have happened with it.

        • Fair enough, it was just one piece of a massive collection of knowledge lost.

          I was just questioning the claim that paper has "been fine for thousands of years".

          While paper (if carefully stored and looked after) is more durable than any digital media invented so far, you can't ignore the many advantages of digital media: extremely easy and quick to copy with no loss of quality, possibility to do that remotely, far less bulky...

          • paper (if carefully stored and looked after) is more durable than any digital media invented so far

            Except for punched cards and maybe other extremely bulky storage devices, now that I think about it.

    • Some years ago I picked up a copy of "Dark Ages II -- When the Digital Data Die" by Bryan Bergeron (2002) but only now have gotten around to finishing reading (for some reason I never got past the first chapter at the time). When I bought it I had just had my own experience with the not-so-long life of digital data (some CDs I'd burned a few years earlier were already unreadable). The book's a bit dated (it says that there are many people out there with Zip drives connected to their PCs) as, obviously techn

      • Personally, I think we're just fine if everything is converted to bits and we remember that there is no guarantee that a set of bits might not be damaged or lost.

        Just like books - there's no guarantee those won't fall victim to water damage, or fire, etc. You have to take care of it, and guard against the applicable failure modes. With digital data this is just as possible, but the techniques are as different as the failure modes.

  • Lifecycle management (Score:5, Interesting)

    by FaxeTheCat ( 1394763 ) on Friday December 20, 2013 @03:15AM (#45743887)
    So the institutions do not have any data lifecycle management for research data. Are we supposed to be surprised? Ensuring that data are not lost is a huge undertaking and cannot be left to the individual researcher. It may also require a change in the research culture at many institutions. As long as research is measured by the publications, that is where the resources go and where the focus will be.

    Will this change? Probably not.
    • Re: (Score:2, Troll)

      by TubeSteak ( 669689 )

      Vines is calling on scientific journals to require authors to upload data onto public archives as a condition for publication.

      If authors put their data into the public sphere, people might notice how much of it is fudged.

      • The answer to both problems is to publish everything in the Journal of Irreproducible Results

      • by N1AK ( 864906 )
        Even more reason for us to want it putting there. Publishing research based on falsified information should be a pretty major crime and shouldn't be tolerated. It misleads the public, wastes scientists time trying to build on it etc.
  • Precisely (Score:2, Insightful)

    by Anonymous Coward

    This is bang on. As a system administrator for a STEM department at a Canadian institution, my budget is 0 for data retention. Long term data retention is just not in the mindset of researchers.

    • One of the places that I've worked did various sorts of science / engineering type project work. Quarterly backups of filesystems were archived indefinitely. Even if the data was staying online, at the completion of every project an archive was made of the data on a minimum of two pieces of backup media along with various bits of metadata regarding the media and data. The archival copies were tested by restore and diffed before actually going into the archive. Of course they kept examples of the differen

      • by Z00L00K ( 682162 )

        Just increase the disk array size and copy the data as it grows to larger and larger storage systems. Data that's offline is useless.

    • Hello,

      Our mindset at my research institution is very different. We generate a certain amount of data per year (several terabytes), but the cost of storage decreases so fast we just copy old data onto new media and never delete ANYTHING.

      In fact, we consider the cost of actually figuring out what data to delete to be higher than simply buying more storage.

      I would not call it "well-indexed" however.

      Our backup strategy is tailored to the nature

  • ...100% is retained for 2 years, and 17% is lost every year after that, then after 20 years, I get about 3.5% of the data still being accessible, not 20%. WTF, or did someone lose the data for this study and the article is really just a guess.
  • by ron_ivi ( 607351 ) <sdotno@NOSpAM.cheapcomplexdevices.com> on Friday December 20, 2013 @03:18AM (#45743897)
    ... poorly collected unreliable data also vanishes at at least the same rate (hopefully faster). And assuming shoddy data disapears faster than good data, then the quality of available data should continually increase.
    • That is a very unreasonable assumption.

      There are many entities with vested interest to keep the data that supports their point of view, or their profit motive or their meal ticket alive. For example data collected meticulously by a underfunded biology professor about the allotropic speciation of the salamanders around the lake hole-in-the-mud would disappear in a jiffy. But flawed research supporting the efficacy of a patented clot busting drug would be perpetuated. Epidemiological studies showing the adve

  • So...? (Score:4, Insightful)

    by Anonymous Coward on Friday December 20, 2013 @03:22AM (#45743909)

    I'm a researcher and I don't have time or space to keep old data as I'm generating too much new data. We work hard to maximize the use of these data and analyses when we write and publish papers. If this was talking about the papers (or presentations), that were the product of the data, being lost at this rate it would be one thing, but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities. This just seems like ammunition for the climate change deniers to bitch about. It's unreasonable to keep the old data indefinitely without a massive public repository that will be poorly indexed and organized.

    • by N1AK ( 864906 )
      Your an AC posting about something not remotely controversial so you're either lazy or lying and I'll take your claim with a pinch of salt on those grounds. I don't think anyone is claiming that keeping the data available is either simple or cheap; but those points don't make it any less important. If the data a paper is based on isn't available then the paper itself loses value because anyone can write a paper showing anything and if they don't need to provide the data then it's much harder to investigate.
    • by rnturn ( 11092 )

      ``the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities''

      Wouldn't documenting your experimental method be part of your job? There's really no reason why raw data should be this mysterious entity that nobody can possibly understand unless they were there when it was collected. IMHO, your results -- whatever they are (I only hope it doesn't have anything to do with a drug that physicians might be prescribing to patients) -- are highly qu

    • by mspohr ( 589790 )

      On several occasions I have tried to get data from researchers. Most of them guard their data jealously and will give any number of excuses for not distributing it, including:
      - telling me that I don't have the knowledge or context to properly understand the data
      - fear of me stealing their precious, precious secrets
      - fear of me "misrepresenting" the data
      - (unspoken) fear of me finding problems with their data or analysis
      Unfortunately, most researchers live in a very closed, secretive world and fear exposure.

      • On several occasions I have tried to get data from researchers. Most of them guard their data jealously

        I should note that this almost certainly violates the terms of publication for most journals, and possibly the terms of their research grants as well. I actually had one professor complain to me that it was "his" data and I had no right to it - conveniently ignoring the fact that he (like me) was being funded by taxpayers (albeit in different countries). My views on this subject aren't particularly radic

  • what the hell? (Score:2, Insightful)

    by Anonymous Coward

    I think it is ridiculous that Slashdot's keep posting articles that are behind paywalls. How the hell are we supposed to see them? Do you expect us to pay for subscriptions to services we'd only use once? you, OP, are out of your mind. articles such as this should be rejected as most users, if not all, can't even access the story. This site really has gone down hill in the last few years, over populated with clueless simpletons, frauds, so-called armchair IT experts and -obvious- subscription pushing trolls

  • Many things are based on this data... and when the data is gone it cannot be audited which makes it impossible to verify the finding of the data which is later simply referenced... but the data upon which it is based... *poof*

    This practice also gives a free reign to fraudsters because if you don't catch them quickly they can claim the data was just in their other pair of trousers.


    • This practice also gives a free reign to fraudsters because if you don't catch them quickly they can claim the data was just in their other pair of trousers.

      No, the timespan is 20 years. Within 20 years the results will either be sunk without a trace, disproven or replicated. A fourth option is very unlikely.

      For example, I doubt the original measurements of superconductivity are still around. If they are, they'd be interesting from a historical perspective, but you could replicate the results yourself with

      • In most cases I would agree. However there are some large tables of data that are at least that old that are referred to currently. And they have not been replicated since.

        Further, the tables are themselves not raw data but modified data with the raw data and methodology no longer available.

  • by Chrisq ( 894406 ) on Friday December 20, 2013 @04:04AM (#45743995)
    ... wait what was it again ... its gone!
  • that I used for my paper 15 years ago. It is on a tape, that is somewhere in a drawer, that I have no tape drive for. On the other hand, the LaTeX file and the C and FORTRAN programs I used to evaluate and create the data and write the paper are still on a hard drive that is running on a computer in my network and I can access it right now. I probably can*t compile the the program without change (was written for Solaris and DEC machines) and maybe not even run LaTeX on it without getting some of the include

  • is/are (Score:5, Interesting)

    by LMariachi ( 86077 ) on Friday December 20, 2013 @04:17AM (#45744029) Journal

    Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.

    Whichever side of the "data is" vs. "data are" argument one falls on, I hope we can all agree that mixing both forms within the same sentence is definitely wrong.

  • Some idiot sub-editor wrote a misleading figure caption here. The article (which I've read) says nothing about how data is lost with age. It only says something about how much data is lost for papers of a given age as of now.

    In other words it does not mean that in 10 years time, 10 year old papers will have such drastic data loss. The world 20 years ago was a very different place in terms of communication, scientific practice, and data storage than it was 10 years ago or is now.

    The Slashdot article repeats

  • a) because it's behind a paywall; and b) how can the original data even hope to be located when a majority of the population can't even read the paper?

  • Lost forever (Score:3, Interesting)

    by The Cornishman ( 592143 ) on Friday December 20, 2013 @04:53AM (#45744127)
    > many other data sets are expensive to regenerate...
    Or maybe impossible to regenerate (for certain values of impossible). I remember reading a classified technical report (dating from the 1940s) related to military life-jacket development, wherein the question arose as to whether a particular design would reliably turn an unconscious person face-up in the water. The experimental design used was to dress some servicemen (sailors, possibly, but I don't recall) in the prototype design, anaesthetise them and drop them in a large body of water, checking for face-down floaters to disprove the null hypothesis. Somehow, I don't think that those data are going to be regenerated any time soon. I hope to God not, anyway.
  • I'm....losing...my..mind..Dave......Dave....Would you like me to sing a song?

  • The very fact that "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.", makes me wonder if this could even be considered "scientific data" anymore. Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process. Given that, should the lack of reproducibility mean that lost scientific data should be redefined as experimental data or hypothesis

    • by dj245 ( 732906 )

      The very fact that "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.", makes me wonder if this could even be considered "scientific data" anymore. Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process. Given that, should the lack of reproducibility mean that lost scientific data should be redefined as experimental data or hypothesis data? It also brings up the idea in my mind that scientific data has a half life since it can degrade back to hypothesis or experimental data if not properly stored.

      Completely incorrect! How can you study "how X has changed over time" if you don't have data from other times? It is also impossible in many, if not most, cases to gather such historical data in the present time.

      • But you can't study "how X has changed over time" if you don't even have the original data that you'd be comparing it to?

        Still, that's not really my point. I'm saying that without the original data (and remember this is data that cannot be gotten again even with effort), one cannot re-do the study and see if the results are reproducible. Therefore, the entire scientific process is impossible with studies that have lost & irretrievable data.

    • Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process.

      This gets tricky in some fields, however. I work in a field where generating the data is a notoriously difficult and haphazard process, subject to many non-experimental variables, such that the use of a different pipette or stock solution can make the difference - or even just the speed of the researcher's manual labor. Temperature and humidity play a role too,

  • by QilessQi ( 2044624 ) on Friday December 20, 2013 @08:16AM (#45744787)

    The Long Now Foundation has devised an interesting mechanism for storing important information which, although not optimal for machine readability, is dense and has an obvious format: a metal disk etched with microprinting, whose exterior shows text getting progressively smaller as an obvious way of saying "look at me under a microscope to see more":

    http://rosettaproject.org/ [rosettaproject.org]

    I highly recommend reading The Clock of the Long Now if you're interested in the theory and practice of making things last.

  • Before science gets hot and bothered about the loss of data scientists need to do something about the quality of the data they produce to begin with. Frankly given the complete lack of quality controls that a lot of scientists use the loss of their data is probably for the best. Depending on the field as much as 60% of all scientific research cannot even be reproduced. Work that cannot be reproduced by another team is far from isolated to one field either:

    http://online.wsj.com/news/articles/SB10001424052970 [wsj.com]

  • Universities should band together to distribute all data from published material on P2P networks so it's redundantly stored at mulitple locations. This has the side-benefit of making a legitimate use of P2P obvious.

  • As a former paper industry professional (recycled pulp), Paper is fine except that people limit its use to readable font. That is what led to Microfiche (which is now being dumped by the truckload at recycling stations as "obsolete tech"). If you printed a hard copy of everything either to microfiche or extremely small 1-point font, you could store the data in a type of seedbank or gene bank.

    A salt mine may not be appropriate, but I'd like to start a business where everyone could send their hard drives t

  • by TheRealHocusLocus ( 2319802 ) on Friday December 20, 2013 @09:53AM (#45745431)

    [OP] "disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated

    Well so much for the study. Money changes everything. Eventually one hundred thousand copies of the abstract will exist on the Internet, but the authors' future descendants will find only only one actual link that leads to content, which terminates at a page saying "this domain is for sale".

    You'd think that even science data of extremely low bit rate such as original weather station temperature data [blogspot.com] should be out there somewhere. A lot of other people did too... but all that is available now might be "value added" ajusted data. Not an evil conspiracy per se, it's human nature at it's best and worst.

    A handy chronology of the history of data retention:

    [2500BC] King Fuckemup boldly slew the enemy and I, Scribe Asskissus hath inscribed it in stone. He is an asshole who owes me back wages."
    [1500] "With quivering quill [wikipedia.org] I will write mine own data."
    [1866] "Data published at great expense into leather-bound volumes. Dust sold separately."
    [1970] "This is really important. we should print it and store it in a binder."
    [1971] They didn't.
    [1983] "I'll write it to floppy disk with a notsosticky label"
    [1985] "After a long and desperate search, the label has been found!"
    [1987] "Unlabeled floppy disk keeps coffeemaker level."
    [1995] "Roxio CD storage is forever, and Real Scientists don't close their data sessions."
    [2003] "Microsoft Word has experienced a problem updating from an older document format and will now close. Save your work as soon as possible."
    [2005] "I'll just email it to myself and shut the computer off immediately, then pick it up at work."
    [2009] "Yes, three copies! In the safe. There was a fire. Yes, inside the safe. It was a fireproof safe, so no one noticed."
    [2010] "This is really important. I should print it and store it in a binder. But my ink cartridge is dry."
    [2013] "Our data has been uploaded to the Cloud where it will live forever."
    [2500] "King Grapeape slew the primitive humans and buried their statue on the beach. I, Scribe Anthopoapologus hath incribed it in stone."

    Perhaps the most mystiying data retention escapade of Modern Times [youtube.com] is the missing Apollo 11 SSTV moon tapes [wikipedia.org] which contained a multiplexed stream of raw telemetry and the original slow-scan TV signal broadcast from the moon. Not 'missing' really, rather we know they were re-used and recorded over because everyone assumed it was someone else's job to ensure that at least one copy was in a safe place. While the earth station operators dutifully sent their tapes to NASA where the sharpest signal of the moon landing was sure to be perserved for posterity (not), fortunately there were some librarians on duty, and you can aquire DVDs of the moonwalk [honeysucklecreek.net] with better quality than the recordings you've seen in countless movies -- an 8mm film camera pointed at an original SSTV monitor at Honeysuckle Creek, and the best quality scan-converted version.

    In the Foundation series, Asimov envisioned Gaia [wikipedia.org], a world in which a telepathic network of sentient (and sensuous) beings kept a 'working set' retrievable data in-memory -- but also via access to progressively less and non-sentient objects, such as plants and even rocks -- a vast archive. Ask the mountain, it will answer in time, a long time.

    Our own Earth has a Gaia storage mechanism, a record of its magnetic field over geologic time stored as polarization in crystallized lava floes [youtube.com]. But it i

  • The NIDDK was aware of this years ago and had commissioned a feasibility study on creating a storage mechanism that all grant paid research would have to use. Unfortunately after a successful feasibility study the reviewers for the follow up real grant responded with "I do not see the scientific value of this research" and the grant went away with Vanderbilt as the only applicant. I've heard through the vine that someone picked up a new similar grant to work on it, but I haven't seen anything from it yet

  • how this happens (Score:4, Informative)

    by Goldsmith ( 561202 ) on Friday December 20, 2013 @10:12AM (#45745567)

    Our scientific research system is built around the process of joining a lab, mastering the work there, and then leaving. There are very few long term research partnerships. The people who stay in place are the professors, who generally do not do the research work.

    So you join a lab, produce a few terabytes of data a year, pull a few publishable nuggets out of that and then leave. I have a few backup hard drives that move around with me with what I consider my most important data, probably total 1/10 of the data I have taken. After a few years, this data is really unimportant to me as the labs I have left have done a good job of continuing the research and I have to spend my time and money on something else.

    The original data is eventually overwritten by researchers a few "generations" removed from me and that's the end of it.

  • How is that different from the previous state of affairs?

    Before digital age, Scientists would have work booklets that would get lost or destroyed when they change job, or when they become too numerous.

    Drawning in an overflow of data is about as useful as having no data at all. It could be argued that forgetting is actually a good thing that puts forward important matter, those that we care to keep because they are valuable. Sure, some valuables get lost in the process, but anyway, who would go sort trough a

  • by RogueWarrior65 ( 678876 ) on Friday December 20, 2013 @10:32AM (#45745759)

    When John Knoll (yes, THE John Knoll, co-creator of Photoshop and VFX wizard extraordinaire) wanted to reproduce the Apollo moon landing in CG he ran into a small problem. He went to NASA to obtain the telemetry data for altitude and orientation but apparently the data had been tossed a long time ago. However, he was able to find physical prints of graphs of the telemetry channels. So he scanned them in, made them an underlay in a 3D modeling program, and painstakingly traced them by hand in order to extract the data. The results can be seen in Magnificent Desolation Apollo 15 landing sequence. And BTW, that's his modeling work for the lander too.

  • I am thinking back to one lab I used to work in that had boxes and boxes of old tape spools sitting out in the hallway, it was always sad to wonder what might be on them since the machine used to create the data had already been disassembled to make space.

    And then I think about the actual project I was working on, which produced something like 1GB/hour every hour every day. Only a fraction of the raw data really made it through cooking, but if there turned out to be a flaw in that initial processing our a
  • between data and information. Information is data which reduces confusion. Data can actually carry negative information value if it increases confusion. Any data which is highly informative survives. And just because money was spent to obtain it, doesn't mean it was fruitful. Research is, almost by definition, a walk in the dark. It attempts to reduce confusion. And, as such, is bound to have misses more often than hits.
  • We Need Legacy Support - I keep saying this and the little kids keep dissing me but we desperately need to maintain legacy support. In 30 more years what else will we have lost through rapid obsolescence?

    Companies like Apple and Microsoft need to reach back and provide it all the way to their earliest systems forward. We need to be able to access our old data and that means being able to run our old applications.

    Congress needs to put forth the legal framework that allows all software to be legal cross comp

  • I have a box with about 200 3.5" floppy disks of facility data. And another box with several laser disks from HP data systems (1980s that ran RMB) because those floppies could only store four hours of data. Data is not "scientific" but facility pressure, temperature, stresses, etc. Don't know what to do with all this, I don't think is important like data from Voyager or Pioneer but one never knows. We don't have the equipment anymore to read it. Maybe we can find it used, ebay perhaps? I remember those HP i
  • A couple of us just rescued some 20-year-old data that had been stored on 3.5 inch floppies. We actually had to go to one of our old retired colleague's houses because he was the only person we could find who had a computer with a floppy drive capable of reading them. Even so, some of the data was unrecoverable.

    I know probably the best option right now for preservation in digital form would be several copies on CD/DVDs of the proper archival type, but I'm wondering if there are any free online services suc

  • For one example, for one project let's say I have roughly 300GB of simulation data. Of out that data, how much will be used to generate a figures for publication? Maybe 1%? The rest of it is from testing, fine tuning, and exploring the parameter space. The real problem isn't where to save it all, but that there is exteremely little incetive to to go through the trouble of sifting through and archiving the important stuff. 80% is proably a lower bound, IMHO. Futhermore, let's say you save that im portant pre

Genius is ten percent inspiration and fifty percent capital gains.

Working...