Forgot your password?
typodupeerror
Data Storage Science

Neglect Causes Massive Loss of 'Irreplaceable' Research Data 108

Posted by Soulskill
from the store-those-magnets-over-there-by-the-old-hard-drives dept.
Nerval's Lobster writes "Research scientists could learn an important thing or two from computer scientists, according to a new study (abstract) showing that data underpinning even groundbreaking research tends to disappear over time. Researchers also disappear, though more slowly and only in terms of the email addresses and the other public contact methods that other scientists would normally use to contact them. Almost all the data supporting studies published during the past two years is still available, as are at least some of the researchers, according to a study published Dec. 19 in the journal Current Biology. The odds that supporting data is still available for studies published between 2 years and 22 years ago drops 17 percent every year after the first two. The odds of finding a working email address for the first, last or corresponding author of a paper also dropped 7 percent per year, according to the study, which examined the state of data from 516 studies between 2 years and 22 years old. Having data available from an original study is critical for other scientists wanting to confirm, replicate or build on previous research – goals that are core parts of the evolutionary, usually self-correcting dynamic of the scientific method on which nearly all modern research is based. No matter how invested in their own work, scientists appear to be 'poor stewards' of their own work, the study concluded."
This discussion has been archived. No new comments can be posted.

Neglect Causes Massive Loss of 'Irreplaceable' Research Data

Comments Filter:
  • Just ask somebody to figure out how to build a Battleship, or even the guns off one, heck, you'd have trouble finding people who know the process of firing them.

    Or if you prefer, Greek Fire.

    • by xmundt (415364) on Friday December 20, 2013 @08:36PM (#45750657)

      Or as a slight step up....there is NO chance that America could build a Saturn V rocket these days. It was a great workhorse, but so complicated that the loss of a few percent of the drawings, and the number of engineers that worked on it that have retired or died means that reproducing it is impossible now.
                In any case, as for the loss of data...that IS a problem. Back in the Olden Days, before someone decided that the computer, with its amazingly fluid and ever-changing methods of storage were the answer to saving data, much of it was printed on paper and tucked away in libraries. Is that still a workable solution? I do not know, but, I do know that when one is trying to store information for a long time, it HAS to be in the simplest and most durable medium and format available.

      • by TubeSteak (669689)

        Or as a slight step up....there is NO chance that America could build a Saturn V rocket these days.

        We have at least a couple complete Saturn V rockets lying around if we wanted to reverse engineer 'em.
        I've personally seen the ones in Alabama and Washington D.C.
        http://en.wikipedia.org/wiki/Saturn_V#Saturn_V_displays [wikipedia.org]

        The hardest part of rebuilding old hardware is the metallurgy.
        As long as we can get that right (or use a better quality substitute)
        reverse engineering from existing parts isn't anything we couldn't farm out to China.

  • by Anonymous Coward on Friday December 20, 2013 @07:19PM (#45750211)
    Sounds familiar! [slashdot.org]
  • by Anonymous Coward on Friday December 20, 2013 @07:23PM (#45750239)

    My wife is a wildlife biologist. Her office collects raw field data all year, compiles data, runs stats, writes reports, reads reports, creates a pretty large volume of "product" every year.

    I ask her who exactly reads all the required papers and reports they produce. The federal Fish and Wildlife Service demands product. State demands product. Various agencies with funding ties that would confuse anyone all demand product. The real ass-kicker? Almost none of it is actually READ by those who asked for it. The papers that are read, are rarely read by more than one person.

    In the end, thousands and thousands of offices like hers, producing real scientific data, it is just too much.

    The number of people consuming the product is DWARFED by those producing it. The number of people tasked to archive, organize, store, catalog, and index this torrent of information are even FEWER than those who consume it.

    These are "real life" scientists out there every day. Not throw in academia, including "research academia".

    The bottom line? A true first-world problem. We produce WAY more research than we are prepared to do ANYTHING with.

    • Put it on the web. Who knows who may find it useful? The value of the research might not reveal itself for some time, but if google or someone has archived it, it might sit there waiting to unveil secrets, like the Pillars of Ashoka.

      • by Obfuscant (592200)

        Put it on the web.

        Who pays for that? Disks and servers and networks cost money. Academics rarely have that just sitting unused.

        • Let the Fed expand its balance sheet to buy govt bonds that allow for academics to publish data, and keep the loans rolling over forever while returning the interest to the Treasury. Making the research and data available is in the General Welfare. Like libraries...

          • by Obfuscant (592200)
            An interesting interpretation of the Constitution, where the general welfare clause is part of the preamble and not a proscriptive statement. And an interesting interpretation of how research grants are awarded, and even the general usefulness of vast quantities of research data.

            Academics already write "publish" into the grants they get, or they ought to do so. "Publish" is not the same as "put all the data up in an organized manner for everyone to come use", however. And even being able to put it all up f

            • General Welfare is mentioned twice, in the Preamble and in Article 1, Section 8.

              It isn't printing money, since no physical greenbacks need be involved, just figures in a virtual ledger book. Banks of course use this trick to expand their balance sheets, by issuing loans or otherwise creating assets. UBS for example booked future expected profits right away on AAA mortgage-backed securities, and paid bonuses on those profits. So these type of accounting practices go on all the time in the private sector.

              Perh

              • by khallow (566160)

                It isn't printing money, since no physical greenbacks need be involved, just figures in a virtual ledger book.

                There's so much fail in this sentence. "Printing money" is a saying not a literal description of the act. It means that you create currency without creating value. Inflation takes care of that hubris.

                And how can anyone think that "figures in a virtual ledger book" is an adequate solution for anything productive or vital?

                Perhaps by the time someone comes across your data, they will be smart enough (or have an AI that's smart enough) to figure it out. Or they could become architectural relics, providing valuable information to future societies. I think you discount your own research unfairly.

                Like a room with a thousand Madonna portraits. Someone will be interested.

                • "figures in a ledger book" is what the financial sector busies itself with. I agree, there's no value added. We'd be better off bypassing the financial sector and simply providing liquidity, from the government or a central bank, when it's needed. The financial sector is mired in all sorts of perverse incentives and moral hazards that cause lots of friction and push prices away from their efficient levels. That's why asset prices bubble and crash, because dealers push them away from their efficient levels.

                  A

                  • by khallow (566160)

                    "figures in a ledger book" is what the financial sector busies itself with. I agree, there's no value added.

                    Sometimes you're right. And sometimes those figures represent things of value. A loan to you would be "figures in a ledger book" to the bank. But it'd be a home, a business, or an education to you.

                    We'd be better off bypassing the financial sector and simply providing liquidity, from the government or a central bank, when it's needed.

                    That's what creates these huge bubbles in recent time. Easy money from the Fed gets dumped into dubious investments by the finance sector.

                    Inflation is mostly psychological.

                    Then you don't know what inflation is. For example, if the US government were to secretly "print money", that is, buy things with currency that they don't have the backing for,

                    • The central banks didn't provide the liquidity for the most recent bubble, or for the tech bubble. The Fed was increasing interest rates (which killed dot-com). The credit expansion took place in the private sector, not from the Fed. You're model is deeply flawed, based on an ideology that history doesn't support.

                      Financial "innovations" preceding the most recent crash created what private banks thought of as "risk-free" assets. The banks booked future profits from these riskless, AAA-rated, mortgage-backed

                    • by khallow (566160)

                      The central banks didn't provide the liquidity for the most recent bubble, or for the tech bubble.

                      You are wrong here. The US Federal Reserve had low interest rates going into both bubbles and Fed officials did link money policies to the asset bubbles (for example, Greenspan's "irrational exuberance" speech in 1996).

                      If inflation is tied to the money supply, why didn't we see hyperinflation when the Fed expanded its balance sheet by a factor of at least 2 in a week? Why didn't we see hyperinflation when the private sector was expanding its balance sheet by much larger than a factor of two in the run-up to the crash?

                      Because a mere factor of two isn't hyperinflation. If they were doubling money supply every week for many weeks, then that would result in hyperinflation. And the Fed's "balance sheet" isn't a full measure of inflation since one also has to consider velocity of money which slows greatly durin

                    • Let's look at some data, shall we?

                      http://research.stlouisfed.org/fred2/graph/?g=qip [stlouisfed.org] shows that the Fed was in disciplinary mode, raising interest rates, before both the dot-com and the real-estate crash. Greenspan's "irrational exuberance" attitude was what killed dot-com, because, I think, he's an old fool who didn't understand the potential of technology to make obsolete his feudal economic models.

                      Regarding velocity of money: http://research.stlouisfed.org/fred2/graph/?g=qiq [stlouisfed.org]
                      If velocity of money leads to i

                    • by khallow (566160)

                      http://research.stlouisfed.org/fred2/graph/?g=qip shows that the Fed was in disciplinary mode, raising interest rates, before both the dot-com and the real-estate crash.

                      Interest rates were rather low just the same. Also, your observation is additional support for my argument since the rates were raised just before the asset bubbles crashed. That timing is an important correlation for claiming cause and effect.

                      When I look earlier, I see a 2.75% rate in early 90s (lowest since the 60s) and sub 2% rates after the 9/11 attacks (lowest since after the 1957-1958 recession).

                    • I think you're ignoring far more obvious causes for inflation. In the 1970s, it was oil supply shocks. OPEC raised prices not because of economics of supply and demand, but for purely political, or psychological, reasons.

                      The interest rate profile for the 1960s is similar to that for the 2000s. But inflation consequences were quite different, because there are much more important psychological factors involved.

                      Rates were being raised years before the "bubble" burst. That's a very strange cause theory you hav

                    • Also the obvious point in the interest-rate graph is that 8 of 9 recessions immediately followed a rise in interest rates. Discipline caused the recessions, not too much money.

                      In the dot-com crash, investors started pulling back because they couldn't keep their loans rolling over at the low interest rates. In the real-estate crash, mortgage rates went up because money was becoming tighter. What if interest rates had not gone up? Let's run a simulation to see if people would have been better off.

                    • by khallow (566160)

                      OPEC raised prices not because of economics of supply and demand, but for purely political, or psychological, reasons.

                      As you noted yourself, 70s recessions were triggered by oil shocks, not by OPEC psychology.

                      The interest rate profile for the 1960s is similar to that for the 2000s. But inflation consequences were quite different, because there are much more important psychological factors involved.

                      No, they weren't. For example, the 60s interest rates didn't stick around the lowest interest rate for any length of time while the lowest points of the 2000s interest rates were maintained for more than a year.

                      Then a hiccup occured when UBS announced it was writing off over $10 billion in MBSes, and groupthink took over and the traders started an emotion-based sell-off. Interest rates and the money supply had little to do with it. Psychology and emotional overreaction were the main causes.

                      Why did UBS write off anything in the first place? They had a margin call (well, the equivalent for banks which are required to maintain a level of reserves). Higher interest rates provided a lot of pressure fo

                    • OPEC psychology created the oil shocks. There was no production capacity problem. There was a psychological issue.

                      Regarding UBS, here's a quote from the Economics of Money and Banking class, Lecture 20 Notes:

                      UBS was doing something it called a Negative Basis Trade in which it paid AIG 11 bp for 100% credit protection on a supersenior CDO tranche, and financed its holding of that tranche in the wholesale money market. In its report to shareholders, to explain why it lost so much of their money, it states tha

                    • by khallow (566160)

                      OPEC psychology created the oil shocks. There was no production capacity problem. There was a psychological issue.

                      You don't get it. The existence of an effective cartel demonstrates in the first place that there was production capacity problems - namely that production capacity was highly concentrated in the hands of the cartel. And the oil shocks were profitable (in addition to increasing the political power of the OPEC members) - providing a straightforward market advantage for that choice.

                      The risk turned out to be liquidity risk, when money market funding dried up and they could not sell their AAA tranche.

                      Here we go. Liquidity risk that originated from the easy Fed money no longer being in the market.

                      So it wasn't a margin call.

                      Then why did they need to "raise

                    • by khallow (566160)
                      For another model, I view recessions as large corrections of market perception. Recent recessions have been asset bubble driven, but there are other kinds of recessions such as the oil crises of the 70s (where suddenly the developed world realized that OPEC could manipulate oil supply and prices a huge amount and that resulted in all sorts of costly economic adjustments from changes in individual behavior up to national investments in alternative energy approaches).

                      In this light, when the central bank se
                    • Regarding OPEC: there was no supply and demand problem with oil. In economic terms, the price should not have risen because there was no production capacity problem. The reason prices rose were purely a matter of politics, of psychology, of policy. Not physical necessity. The proof is that prices later dropped to $10/barrel. So there was no production capacity problem. There was only a psychological problem.

                      Regarding UBS's liquidity risk: according to Prof. Mehrling's story, UBS was getting funding from mon

                    • "In this light, when the central bank sets interest rates, it is actually paying the markets to see interest rates as being in a certain range. This primes the pump for putting money in any available high leverage investments since suddenly there's no low risk investments with good interest payments out there. And once money starts flowing into such a bubble, it develops an attractive short term trend which brings in more money."

                      Your story doesn't take into account that the Fed doesn't set interest rates (e

                    • by khallow (566160)

                      Your story doesn't take into account that the Fed doesn't set interest rates (except the Discount Rate which is set a fixed amount above the natural private rate). It can try to target rates, but the rates are ultimately negotiated by the private institutions themselves.

                      The Fed has very effective tools for targeting rates. A control system doesn't need to be perfect to be an effective control system.

                    • The private banking system evolved of its own accord towards a centralized system where clearinghouses played a role similar to the Fed today. There was a need for a central bank that could provide elasticity in times of crises, and it was convenient for all the banks to settle payments once a day at a clearinghouse, instead of many times with each bank someone had written a check on or cashed a check at. A centralized system made sense.

                      The problem with the centralized system was that it didn't provide enou

      • There's nothing in the world preventing you from donating time and hardware to help them do so.

        That's the point. There's really too much raw data that's not really needed, just produced.
        • Who knows what's really needed? The market is too short-sighted to be a reliable judge. Mendel's research was not needed, until after his death. The research that went into the internet was thought to be unneeded by AT&T. The library of Alexandria was thought to be unneeded and burned. Kafka wanted all his unpublished manuscripts burned after his death.

          If you or I can't afford to help researchers publish their data on the internet, the government can and should.

          • by khallow (566160)

            If you or I can't afford to help researchers publish their data on the internet, the government can and should.

            This is the outcome of government intervention in the scientific process - the generation of scientific activity which can't have long term value merely because it won't be saved. Maybe if we apply more of the poison, we'll save the victim.

            • government intervention in the scientific process...

              Involvement is not the same as intervention.

              Scientific research is a public good.

              • by khallow (566160)

                Scientific research is a public good.

                Except when it's not.

                But by forcing so much research to be a public good, you also create the usual tragedy of the commons situations of overconsumption of the good, such as researchers who research all sorts of things to consume the available public funds, but have no incentive to actually save their work.

                • Why not save their work whether they have the incentive, or not? That way it can be checked by anyone who wants to. It's in the public interest to check research, like Rogoff and Reinhart [theatlantic.com]'s:

                  Thomas Herndon, Michael Ash, and Robert Pollin of the University of Massachusetts, Amherst, have found serious problems with Reinhart and Rogoff's austerity-justifying work.

                  • by khallow (566160)

                    Why not save their work whether they have the incentive, or not?

                    Because they don't get funded to do that.

                    • Create the money to fund them. It's in the public interest, the General Welfare. We, and our grandchildren, will be better off if we can check research by having access to the data it used.

                    • by khallow (566160)
                      You still have the problem that nobody reads most of the research and that even if you did pay people to read research, you'd still be nowhere near use of that research. All this blather about "General Welfare" ignores that it's not really in the public interest to pay brilliant people to spin their wheels.
                    • You don't know what research will be read. Maybe it will become valuable after you're dead. It's value to you in the present may be nothing, but to another it might be great. For example, I like to listen to old jazz tunes on youtube that may have one or two other views. But there's value in them, because value is not a popularity contest. In the same way, research that is not valuable to you, or not popular at this time, can have immense value to the future. Example: piles of trash that are invaluable to a

                    • by khallow (566160)

                      You don't know what research will be read.

                      No, but I have a pretty good idea.

                      Maybe it will become valuable after you're dead.

                      But it probably won't. As a general rule of thumb, if something doesn't prove itself in the first few decades to have value, then it probably never will.

                      For example, I like to listen to old jazz tunes on youtube that may have one or two other views. But there's value in them, because value is not a popularity contest.

                      Sure, because you listen to them, those youtube videos have some value.

                      Example: piles of trash that are invaluable to archaeologists in reconstructing ancient Troy, say.

                      It doesn't have to have universal value to everyone to have value. This is completely irrelevant to my point. Recall we're speaking of research data that will probably never be examined by anyone other than the author and perhaps a few reviewers. You the

                    • I think your approach is like uniformly reporting "negative" on cancer tests, because the incidence of cancer is so low. You can have a very high successful prediction rate (99%, say) by simply saying "no" on every test. But that doesn't help the patients who have cancer. You can boast "I have a great prediction success rate!" but you're not helping anyone.

                      In the same way, saying categorically that no research is valuable because a lot of it isn't valuable is silly. It's precisely the cases that are "thrown

                    • by khallow (566160)

                      I think your approach is like uniformly reporting "negative" on cancer tests, because the incidence of cancer is so low.

                      That is not the case. My approach would be picking up most, if not all cases of useful research in question. Recall that scientific research which results in useful progress over the long term invariably has some usefulness and value even in the short term. This is a universal feature, not a quirk of market-oriented research.

                      In the same way, saying categorically that no research is valuable because a lot of it isn't valuable is silly.

                      Then don't say it. I don't say it either.

                      I do think that this sort of claim indicates that you don't understand my argument. I'm not arguing that publicly funded research can't have

      • There are groups working on this -- the University of California is trying to do it in a consistent way, with its wealth of historical data -- but it's harder than you'd think. It's not very useful if you don't get the metadata reasonable, and that's skilled work and not something we reward. Institutional support (libraries, machine shops, etc) gets pinched because it's constant overhead and hard to point to single high-status payoffs. It takes one year to kill a library (Canada's superb fisheries and lake

    • by nbauman (624611)

      I dunno. There may just be a half a dozen people who are interested in your wife's penguin or whatever, but to them it's really interesting. They might have to make a decision about penguin habitat or whatever.

      And then there's the scientific paper lottery. A few papers turn out to be really important, everybody cites them, and they change the world -- but you can't know in advance which one is going to be important. There were people doing studies of the hearing of fish, and suddenly, when porpoises start g

    • by Anonymous Coward

      Scientific data would not be lost if it was posted on Slashdot... you could just retrieve the next day's dupe.

    • by gandhi_2 (1108023)

      Because without constant refreshing, this article would disappear!

    • Don't blame them, the editors really care, given their apparent short-term memory loss and/or schizophrenia.

      (yes I know about varying medical definitions of schizo)

  • by Anonymous Coward

    They should post their data to slashdot. Who will duplicate that shit so many times it will never vanish.

  • That's why Slashdot is keen on posting all new studies at least twice, thus increasing the chances they are still available for future generations!
  • by Anonymous Coward

    I've found dead links to data in peer reviewed papers published just a week or less prior to reading them, sometimes these links were never valid to begin with.

    • by Obfuscant (592200)

      I've found dead links to data in peer reviewed papers published just a week or less prior to reading them, sometimes these links were never valid to begin with.

      Maybe the peer-review process should be shorter, or you should keep up with current journals and not depend on ten year old articles?

      Seriously though. maintenance of data requires money. I have 22 years worth of data here. Much of it is raw video on VHS tapes. Much of it is on old floppies. Much of it is on TK70 tapes. Much of it is on early versions of magnetoptical disks. I don't have anything that reads any of those formats anymore.

      Who pays to keep copying old data onto new media as new media are deve

  • Options (Score:5, Interesting)

    by jklovanc (1603149) on Friday December 20, 2013 @07:42PM (#45750341)

    Maybe there should be an option to "ignore" an article or "report as duplicate". The second option would require someone to react to it so it may not work.

  • At least (Score:4, Interesting)

    by Nemyst (1383049) on Friday December 20, 2013 @07:56PM (#45750447) Homepage
    Slashdot is doing its part by posting the same data multiple times. Perhaps one copy will survive the test of time!
  • by Anonymous Coward
    Dupity dupe dupe!
  • by Anonymous Coward

    Working in the field, I can pretty much state that far from enough care is taken with data archival and/or transfer to newer storage media when older ones approach obsolescence.

    There's:
    A: not enough staff to take care of it properly or keep a proper archival environment for the various media
    B: not enough money & time to modernize the records/transfer to new mediums
    C: sometimes not enough money to even properly maintain obsolete, long-unsupported and obscure data recording equipment
    (I've seen 'rubber' pi

    • by cusco (717999)

      Sometimes there is also deliberate and/or malicious destruction to take into account as well, like the Bush mAdministration ordering the destruction of the Mariner and Pioneer data.

  • Think of all the family photos that will get deleted or destroyed by hardware failure, and to think I have family photos (on film) from over 100 years ago.

  • Maybe designating the Library of Congress as a repository for scientific data would work. They're pretty good at archiving stuff.

  • Part of the problem with corresponding with authors of papers more than 2 years old is that there is no good way to uniquely identify an author. If you know that you are interested in a "John Smith" who wrote a Nature paper i n1989, good luck figuring out which "John Smith" is the same one today (if he is still alive). Another good example is of how many papers are by "Z Huang":currently over 6,000 to date in pubmed [nih.gov].

    Considering how we expect researchers to change institutions multiple times in their car
  • One thing that I lament about scientific publications, is that the results are boiled down to a few pages. You rarely see raw data , an generally only the statistical analysis. I would like to see web links in journals that include more of the raw data, the programs that generated that data, etc. We live in a day in age when gigabytes are cheap. It would be a lot easier to duplicate someone's work for peer review if the inherent data & analysis programs were more accessible. Although, there are a fair

  • "Research scientists could learn an important thing or two from computer scientists,..."

    What is the error bar on "a thing or two"?

    As someone with a foot in each camp, I believe it's more like fifty or a hundred. The methods of scientists regarding computing are often built of slow evolutionary changes upon old familiar methods, while incorporating selected cutting edge hardware or algorithms. It is partly the nature of some science projects to carry out observations over many years, ideally with the sa

  • https://en.wikipedia.org/wiki/John_Lott#Disputed_survey [wikipedia.org]

    Disputed survey

    In the course of a dispute with Otis Dudley Duncan in 1999–2000,[55][56] Lott claimed to have undertaken a national survey of 2,424 respondents in 1997, the results of which were the source for claims he had made beginning in 1997.[57] However, in 2000 Lott was unable to produce the data, or any records showing that the survey had been undertaken. He said the 1997 hard drive crash that had affected several projects with co-authors h

  • Perhaps this is n opportunity for journals to update their business models?
    Warehouse and convert data, as well as curate contact lists for papers.

  • ... are condemned to repeat it.

  • I'm late to the party here, but I thought it was worth mentioning that the Purdue University Research Repository (https://purr.purdue.edu) is designed as a Trusted Digital Repository for research data. The default lifetime is 10 years, but the Purdue Libraries will add noteworthy datasets to its permanent digital collection after their default lifetime expires. (And yes, I am a programmer on the project.)

Help me, I'm a prisoner in a Fortune cookie file!

Working...