Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Science

Digital Preservation Is Not Keeping Up With the Growth of Scholarly Knowledge (nature.com) 43

Nature: Millions of research articles are absent from major digital archives. This worrying finding, which Nature reported on earlier this year, was laid bare in a study by Martin Eve, who studies technology and publishing at Birkbeck, University of London. Eve sampled more than seven million articles with unique digital object identifiers (DOIs), a string of characters used to identify and link to specific publications, such as scholarly articles and official reports. Of these, he found that more than two million were 'missing' from archives -- that is, they were not preserved in major archives that ensure literature can be found in the future.

Eve, who is also a research developer at Crossref, an organization that registers DOIs, carried out the study in an effort to better understand a problem librarians and archivists already knew about -- that although researchers are generating knowledge at an unprecedented rate, it is not necessarily being stored safely for the future. One contributing factor is that not all journals or scholarly societies survive in perpetuity. For example, a 2021 study found that a lack of comprehensive and open archiving meant that 174 open-access journals, covering all major research topics and geographical regions, vanished from the web in the first two decades of this millennium.

A lack of long-term archiving particularly affects institutions in low- and middle-income countries, less-affluent institutions in rich countries and smaller, under-resourced journals worldwide. Yet it's not clear whether researchers, institutions and governments have fully taken the problem on board. [...] At the heart of the problem is a lack of money, infrastructure and expertise to archive digital resources. [...] For institutions that can afford it, one solution is to pay a preservation archive to safeguard content. Examples include Portico, based in New York City, and CLOCKSS, based in Stanford, California, both of which count a raft of publishers and libraries as customers.

Digital Preservation Is Not Keeping Up With the Growth of Scholarly Knowledge

Comments Filter:
  • by MpVpRb ( 1423381 ) on Tuesday December 03, 2024 @03:36PM (#64988491)

    Digital data decays and the formats and readers become obsolete. This is a serious technical problem, but even worse are the attitudes of rights holders. They view preservation as theft and would prefer that old data disappear if they can't get paid for it

    • Digital data decays and the formats and readers become obsolete. This is a serious technical problem, but even worse are the attitudes of rights holders. They view preservation as theft and would prefer that old data disappear if they can't get paid for it

      (The PDFuckin’ Don Father) ”Oh yeah? Pretty damn easy to create a standard if ya ask me.”

      By the time we get done worrying about some format lasting 30 years or more, it’s gonna prove us wrong.txt.

      As far as patented greedy assholes go, fuck ‘em. We didn’t need patents to learn about the Roman Empire. Their history will either be remembered through donation, or forgotten by litigation.

    • Scientific Journals inadvertently contribute to this by pay walling everything. Research should not be pay walled by default.
    • Copyright reform is badly needed but no politician considers it a priority or has the guts to drive it. I

    • by Tony Isaac ( 1301187 ) on Tuesday December 03, 2024 @11:50PM (#64989321) Homepage

      Like biological evolution, it's the survival of the fittest. DRM-protected documents won't survive, publicly accessible common formats like PDF will do better.

      A lot of biological organisms have gone extinct, yet we have a thriving diversity today. One might even say that extinction is a part of the larger process. The same is true for the digital stuff. We fret over what we might be losing now, but in the long run, 99.99% of that research will not be worth anything 1,000 years from now, and yet it will be OK.

      • by AmiMoJo ( 196126 )

        The issue is selecting formats today that will be around in a few decades. PDF has proven durable, as have optical discs starting with CDs... Or at least the format is durable, the actual discs might not last very long.

        LTO tape is a tricky one. The tapes are durable but the drives can only read two generations back, unlike a Bluray drive which can still read CDs pressed in the 1980s. Software is also an issue, because it's somewhat proprietary and will eventually stop working on most computers. Even the fac

    • This has always been a problem. Only one tablet with ten commandments of Moses survived till today. The other 665 tablets broke or were lost in the desert.
    • > formats and readers become obsolete

      Doesnt stop them working way beyond their designed life. You are lookking at around 100 years before having serious problems accessing a media. Not only must you wait for that media and it's associated hardware to no longer be produced, but also for all working hardware to fail to the point of un-repairability.

      Obsolecence doenst result in non-function basically. Right now only the very oldest media is an issue, along with media that was never popular in the first p

  • by wierd_w ( 1375923 ) on Tuesday December 03, 2024 @04:07PM (#64988557)

    Might have something to do with the fact that efforts to back up, replicate, and preserve this data are met with strong enforcement efforts from the likes of Eslevier and pals, who's business model REVOLVES around this data being scarce, and only obtainable THROUGH THEM.

    Exclusivity, or Preservation.

    PICK *ONE*.

    • by vivian ( 156520 )

      Copyright holders who have by definition been granted a unique monopoly on information should be obliged to maintain archives of that information in perpetuity, to be made available to anyone who cares to access it for a reasonable fee.
      If they no longer wish to maintain that archive, or go out of business, they should be required to release that information to open public archives and relinquish copyright on that information.
      If they fail to maintain the archive and lose information, or otherwise mak

      • also limit disney vault like locking and only selling in big collections

        • Ban the Disney Vault. The whole point of copyright is a monopoly on distribution. If you aren't distributing to the general public, you should lose your copyright instantly. At bare minimum, your copyright should be held as unenforceable by the courts until such distribution resumes or the rights expire. (Which ever comes first.)
  • by ctilsie242 ( 4841247 ) on Tuesday December 03, 2024 @04:16PM (#64988587)

    After people started streaming/downloading, work on mainstream optical formats has ceased. Even Sony (AFAIK) doesn't have an optical archiving format anymore.

    This is something businesses need. LTO-9 is okay, but expensive, and optical media done right is relatively cheap to make, can hold just as much, if not more than a tape. [tomsguide.com]

    Long term archiving formats are the cornerstone of any real preservation efforts. Yes, one can always run through tapes and copy the data every few years or stuff everything on a NAS and back that up, but when you start getting into exabytes worth of data, you need to have a solid format, as you don't have the bandwidth to keep rereading the data.

    I just wish optical could get some updates. Even a 5 TB disk would make life a lot easier and make home backups a thing again. People use portable drives for this, but hard drives are not archival media.

    From there, it would be nice to have an open source DAM or even an archiver. that can store data on the backend with ECC encoding, so if there is a damaged sector or a file got corrupted, there is a high chance that it can be repaired. I have used WinRAR in the past, and the recovery record functionality has saved damaged records. Even something like Borg Backup that supports erasure coding, where one can add an additional percentage to the backend repository for ECC, can be something that can save data.

    • by nightflameauto ( 6607976 ) on Tuesday December 03, 2024 @04:25PM (#64988615)

      After people started streaming/downloading, work on mainstream optical formats has ceased. Even Sony (AFAIK) doesn't have an optical archiving format anymore.

      This is something businesses need. LTO-9 is okay, but expensive, and optical media done right is relatively cheap to make, can hold just as much, if not more than a tape. [tomsguide.com]

      Long term archiving formats are the cornerstone of any real preservation efforts. Yes, one can always run through tapes and copy the data every few years or stuff everything on a NAS and back that up, but when you start getting into exabytes worth of data, you need to have a solid format, as you don't have the bandwidth to keep rereading the data.

      I just wish optical could get some updates. Even a 5 TB disk would make life a lot easier and make home backups a thing again. People use portable drives for this, but hard drives are not archival media.

      From there, it would be nice to have an open source DAM or even an archiver. that can store data on the backend with ECC encoding, so if there is a damaged sector or a file got corrupted, there is a high chance that it can be repaired. I have used WinRAR in the past, and the recovery record functionality has saved damaged records. Even something like Borg Backup that supports erasure coding, where one can add an additional percentage to the backend repository for ECC, can be something that can save data.

      In an age where every query about archival backup is met with, "Just put it in the cloud, dude," I don't think any of us are getting our wish for decent optical backup solutions anytime soon. I've relegated myself to mirrors at home, once a month swapped disks in a firesafe, and *a* cloud backup, but would love to have a real archival option for those monthly / yearly backups. And this is just for shit that I know nobody but me will ever care about. Real data? Forget it. We're too obsessed with profit to give a shit about base knowledge data that the entire species may want in the future. I'm sure the current technologists believe the AI Gods of the future will fix it all for us.

      • by frdmfghtr ( 603968 ) on Tuesday December 03, 2024 @05:04PM (#64988691)

        There is already an optical form of archiving--print.

        Archive copies of data should not be working copies. Archive copies are meant to be put somewhere safe so they can be recalled if needed. Want searchable digital copies? Those are working copies, meant to be poked and prodded and searched. If they get corrupted or run into a technological dead end where they can't be converted to a new format, you carefully re-scan the archive copies. The archive copies go back into protective storage and the new digital copies go out into the world.

        Printed forms of the data aren't convenient to search that's for sure--but properly made (archive-quality ink and acid-free paper, etc) and preserved they can last centuries. And you don't have to worry about the file format being impossible to read by future technology. We have already seen that digital archives can disappear like a fart in the wind.

        Paper isn't perfect but in my experience proves to be a better archive vehicle of anything that can be preserved on paper than anything digital.

        • There is already an optical form of archiving--print.

          Archive copies of data should not be working copies. Archive copies are meant to be put somewhere safe so they can be recalled if needed. Want searchable digital copies? Those are working copies, meant to be poked and prodded and searched. If they get corrupted or run into a technological dead end where they can't be converted to a new format, you carefully re-scan the archive copies. The archive copies go back into protective storage and the new digital copies go out into the world.

          Printed forms of the data aren't convenient to search that's for sure--but properly made (archive-quality ink and acid-free paper, etc) and preserved they can last centuries. And you don't have to worry about the file format being impossible to read by future technology. We have already seen that digital archives can disappear like a fart in the wind.

          Paper isn't perfect but in my experience proves to be a better archive vehicle of anything that can be preserved on paper than anything digital.

          For "words on a page" media, I agree. For my music and video creations, I'd still like optical for archival purposes.

        • While in general I love 'archiving' on paper - I always prefer to have paper copies of my work, just in case - you have to realize that format has certain serious operational limitations that make it inherently inferior in today's world.

          One of the most serious: storage capacity. Assuming font size readable by humans without microscopes or microdot readers, you would need huge physical libraries to store your 'data'. The full contents of a large physical library could be easily stored (in text form) on a poc

        • > Want searchable digital copies? Those are working copies, meant to be poked and prodded and searched. If they get corrupted

          Not when you burn them to read only optical discs, then they are just the same as paper, only more dense.

          But yes you are right. In fact there is a company printing stuff to stone tablets that then are archived in their vault stored inside an old salt mine. YOu can even upload your own text for free, they will include it if you can justify a good reason.

      • > "Just put it in the cloud, dude,"

        It's the same here where I work. Here we must remain offline, we cant never employ the cloud due to security reasons as sovereignty of data is a REQUIREMENT for the customers.

        Thus it must remain on-site and we need to archive much of it, to tape usually. I just implemented the LTO8 tape system, upgrading us from LTO4/6. I'm now migrating all the old tapes, DDS/DAT tapes from the 90's and every LTO version since LTO1 as well as some optical disks upwards to LTO8.

        I eve

    • Blu-ray disks use a 405 nm diode laser and can write with a spot size down to 150 nm https://en.wikipedia.org/wiki/... [wikipedia.org] To Increase density one needs to use the same tricks as in the semiconductor production, such as very expensive deep UV lasers (excimer ArF 193 nm) or immersion lithography in water or refractive oils. These solutions are not practical for a consumer product.

    • It's not just the media, it's the files themselves. We've already lost software to read tons of file formats that were in common use only 30 years ago. Good luck opening those old WordPerfect or StarCalc files. You already have to install extra software just to open legacy .doc or .xls files. PDF itself keeps evolving. You can put your bits on long-lasting optical discs, but if you don't have software to read them, it's no simple thing to decode them after the fact.

      • This is a non-issue. If the formats are properly documented, then even if the software that uses them gets lost or inoperable, we can always write another.

        • And...what format is this "proper" documentation? And is that format "properly" documented?

          The DOC and XLS file format, for example, is documented, including in book form. But the documentation is not comprehensive. Some aspects, such as the way charts are embedded in an XLS file, or other blobs such as images in DOC and XLS files, is not part of the format, but part of the format of other software, some of which may or may not be "open" or have good documentation. When people use Word and Excel, they don't

      • > Good luck opening those old WordPerfect or StarCalc files

        Whilst your statement is sound you have two problems. First of all, todays formats are far from going to get "lost", all the old formats you talk of were from the early days of computing, when competition was strong and everyone had their own special format. These days much less so as the formats we use today have been developed specicially to not suffer such a fate. Proprietary exceptions exist, and encryption kills all formats, till you can c

        • My examples aren't perfect because all of digital computer history is still pretty recent, as computers have existed for only about 80 years (ENIAC). As time passes, some of the older, less-used software will fade from memory, people will lose interest and stop updating them. In 100 years, will we even have Windows computers? Maybe, maybe not. In 500? Who knows! Availability of software to read archaic file formats follows a hyperbolic curve, with a long tail, but not infinite, and steadily (if slowly) appr

    • Pioneer to the rescure.

      Pioneer recently, in response to the Japanese stating that business records must be archived for a minimum of 100 years have developed a burner and companion disc that meet ISO 18630. This standard defines a disc and drive combination, testing parameters etc to ensure that such discs are capable of lasting up to 100 years.

      The compliant drive is also made of sturdier stuff. In 100 years there will certainly be a form of demand to access this media so some form of device to read them

  • by methano ( 519830 ) on Tuesday December 03, 2024 @04:17PM (#64988589)
    The truth is that the "good" publishers like Nature, Science, ACS, Elsevier, etc. are publishing so many journals these days that the average quality is plumitting. This does not even include a lot of lessor journals. If a lot of it rots and disappears, it will be easier to find the good stuff. I'm not fretting.
    • The truth is that the "good" publishers like Nature, Science, ACS, Elsevier, etc. are publishing so many journals these days that the average quality is plumitting.

      Just like spelling.
      • by methano ( 519830 )
        Sorry about the spelling. It should be "plummeting". And I felt good about spelling Elsevier correctly.
    • yeah, came here to say similar: Cross reference this article with the article here yesterday saying how X% of scientific publishing (in China... but also pretty much everywhere) is fake, false, or otherwise untrue. Just because you publish doesn't make it valuable. I'm guessing there is a lot of "ai" generated content flooding in...
      • True. But if it gets referenced you'll want to see what was referenced.

      • Also, there's a fair bit of stuff that isn't untrue, but is borderline useless. In math for example, there are a whole bunch of low quality "journals" which publish occasionally false things but most of what they publish are true but largely trivial statements or reprove things which are already well known in their fields. Very little is going to be lost if these journals are not preserved.
  • That nobody gives a shit to do the right thing, the only thing that is obviously correct in this situation, is because of capitalism. The actual solution to the problem, doing things correctly in an open source, actually collaborative way using the original internet intention for the internet and not this corporate whorescape we've created, shall forever be out of reach under capitalism, which seeks only to privatize the commons and to prevent obvious, simple, universal and free solutions from being enacte
    • That nobody gives a shit to do the right thing, the only thing that is obviously correct in this situation, is because of capitalism. The actual solution to the problem, doing things correctly in an open source, actually collaborative way using the original internet intention for the internet and not this corporate whorescape we've created, shall forever be out of reach under capitalism, which seeks only to privatize the commons and to prevent obvious, simple, universal and free solutions from being enacted.

      Show us the profit in cooperative, open-source solutions, and we'd be all about them. Thus far, any attempt gets bastardized and whore-ified, like the Internet has been, because it's more profitable to be a society of assholes hoarding and pillaging than it is to be cooperative. And I'm sorry to say it yet again, but we passed the point of giving a shit about anything other than profit a long, LONG time ago. At least in the West. Hells, profit comes before life and health. Why the fuck would we care about i

  • It feels like IP has more rights than human beings now.

    • Of course it does. IP is the modern chains by which others are bound. To force creators to a particular company, to steal unrelated ideas from workers off the clock, to ban competition, to shakedown those you dislike, to silence critics, to control supply lines and distributors, to buy political favors, to demand perpetual payment for a one time effort. All of these are things that IP allows the rightsholders, and they will fight tooth and nail to prevent anything that curtails that power which makes them m
      • Whatever point in our evolution that allowed us to start believing in artificial concepts like money, religion, and the like? That was our turning point. We could have, at some point, turned that ability into a good thing, but we somehow convinced ourselves the artificial concept of money was the most important thing, and infused the concept of money with religious tendencies until all that matters is profit, profit, and more profit. We'll fucking drown in profit as a species. Well, the top of the top will

  • Yeah we're losing a large number of research papers. But many of them *should* fade into obscurity. Just as with biological evolution, the fittest will survive. That will be a small percentage of the total, but in the end, this is how progress happens.

  • The Sokal Affair showed how easy it is to slip complete gibberish past peer review for publication. and the explosion of papers that analyze other papers only through statistical review of the number of papers rather than actually checking data, and it's not knowledge that is growing. It's the whimsy of the politics of the authors.

  • So much scientific publication these days is simply noise.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (10) Sorry, but that's too useful.

Working...