Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Science

Study Finds That We Could Lose Science If Publishers Go Bankrupt (arstechnica.com) 66

A recent survey found that academic organizations are failing to preserve digital material -- "including science paid for with taxpayer money," reports Ars Technica, highlighting the need for improved archiving standards and responsibilities in the digital age. From the report: The work was done by Martin Eve, a developer at Crossref. That's the organization that organizes the DOI system, which provides a permanent pointer toward digital documents, including almost every scientific publication. If updates are done properly, a DOI will always resolve to a document, even if that document gets shifted to a new URL. But it also has a way of handling documents disappearing from their expected location, as might happen if a publisher went bankrupt. There are a set of what's called "dark archives" that the public doesn't have access to, but should contain copies of anything that's had a DOI assigned. If anything goes wrong with a DOI, it should trigger the dark archives to open access, and the DOI updated to point to the copy in the dark archive. For that to work, however, copies of everything published have to be in the archives. So Eve decided to check whether that's the case.

Using the Crossref database, Eve got a list of over 7 million DOIs and then checked whether the documents could be found in archives. He included well-known ones, like the Internet Archive at archive.org, as well as some dedicated to academic works, like LOCKSS (Lots of Copies Keeps Stuff Safe) and CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe). The results were... not great. When Eve broke down the results by publisher, less than 1 percent of the 204 publishers had put the majority of their content into multiple archives. (The cutoff was 75 percent of their content in three or more archives.) Fewer than 10 percent had put more than half their content in at least two archives. And a full third seemed to be doing no organized archiving at all. At the individual publication level, under 60 percent were present in at least one archive, and over a quarter didn't appear to be in any of the archives at all. (Another 14 percent were published too recently to have been archived or had incomplete records.)

The good news is that large academic publishers appear to be reasonably good about getting things into archives; most of the unarchived issues stem from smaller publishers. Eve acknowledges that the study has limits, primarily in that there may be additional archives he hasn't checked. There are some prominent dark archives that he didn't have access to, as well as things like Sci-hub, which violates copyright in order to make material from for-profit publishers available to the public. Finally, individual publishers may have their own archiving system in place that could keep publications from disappearing. The risk here is that, ultimately, we may lose access to some academic research.

This discussion has been archived. No new comments can be posted.

Study Finds That We Could Lose Science If Publishers Go Bankrupt

Comments Filter:
  • We need a Foundation (Score:5, Interesting)

    by Baron_Yam ( 643147 ) on Saturday March 09, 2024 @10:53AM (#64302341)

    The idea's always appealed to me. From time to time we re-codify laws, we need to do the same for all human knowledge.

    Set a baseline year - presumably whenever you start the project - and try to distill everything into a single encyclopedia of all human knowledge, a reference for getting someone from the stone age to the information age in as few steps as possible. We really don't need every paper ever published, and trying to keep it all means we're not in control of what is missing when something inevitably gets lost. It's a massive project, and would take a long time and inevitably be imperfect, but I think it's worth the attempt.

    When you have the best encyclopedia you can churn out, you etch it into quartz tablets that can be human-read with a decent magnifying glass, and you put those tablets into a cave in some very geologically stable place. And then you take your digital copies and give them to whoever wants them.

    • by parityshrimp ( 6342140 ) on Saturday March 09, 2024 @11:24AM (#64302409)

      This is basically a publically funded version of wikipedia that is maintained by actual scholars, right? And they would release a new official version every decade or something.

      I like it. I wish the leaders of human societies had a set of values that would lead them to fund something like this.

      • >This is basically a publically funded version of wikipedia that is maintained by actual scholars, right?

        More or less, though I would think any 10-year updates should be purely electronic; filling a new cave with etched quartz tablets every decade would be expensive and time consuming. I'd save that for every century.

      • We have this already. It’s called the academic literature. DOIs and the citation structure basically make an academic-driven wikipedia. But it’s too large for quartz tablets. And it’s not understandable by your average florida man.
      • I like it. I wish the leaders of human societies had a set of values that would lead them to fund something like this.

        The only thing that matters is profit. Fuck you. I got mine and I will ensure you never get yours. I will monetize your efforts at getting what you think you deserve. You are my slave. Forever.

        This is such a wonderfully constructed society that we live in. I can see why the poors support it so vigorously. :)

    • Sure, sure. And the government needs to provide seed money for this foundation. And publishers will be required by law to archive documents for a fee. And this foundation will charge a small fee for accessing the collected archives. Of course, it will be a 501c3.
      • I would say it would be enough to exempt the foundation from IP law and allow them to copy whatever they need to fulfill their mission. The whole point would be to dump the unnecessary stuff and distill our accumulated knowledge down to the straightest possible path from 'small nomadic family groups banging rocks together' to the present day.

    • We’d always probably want to make it so anyone could make a clone , via torrent or rsync. This way we wouldn’t be vulnerable to a single point of failure.

      Previously I’ve considered the idea of academic and reasearcg research institutions hosting their own PDFs in a form that facilitates publishing and replication.

    • by HiThere ( 15173 )

      Call it "the Encyclopedia Foundation".

      • by micheas ( 231635 )
        Or maybe the Library of Congress
      • We could put it somewhere on the edge of civilization, someplace unpleasant and barely habitable so it will be beyond political interference. ...maybe Australia.

    • by micheas ( 231635 )

      The idea's always appealed to me. From time to time we re-codify laws, we need to do the same for all human knowledge.

      Set a baseline year - presumably whenever you start the project - and try to distill everything into a single encyclopedia of all human knowledge, a reference for getting someone from the stone age to the information age in as few steps as possible. We really don't need every paper ever published, and trying to keep it all means we're not in control of what is missing when something inevitably gets lost. It's a massive project, and would take a long time and inevitably be imperfect, but I think it's worth the attempt.

      When you have the best encyclopedia you can churn out, you etch it into quartz tablets that can be human-read with a decent magnifying glass, and you put those tablets into a cave in some very geologically stable place. And then you take your digital copies and give them to whoever wants them.

      Isn't that why we have the Library of Congress?

    • We need a laser printer and a Xerox machine.

    • Set a baseline year - presumably whenever you start the project - and try to distill everything into a single encyclopedia of all human knowledge, a reference for getting someone from the stone age to the information age in as few steps as possible.

      LOL, fuck no. How would we monetize the learning process then? It is better to go back to the stone age than lose a penny of profit. Fuck the human race, my experience of existence is purely about ME.

  • Download everything and put it in a GitHub repo immediately.

    It's not perfect but it's way more likely to be there in 100 years than Internet Archive.

    • s/Hub//

      Git is good, GitHub is currently owned by an extremely vile corporation that 1. actively destroys stuff it disagrees with, 2. may shut it down the moment it believes it's not profitable, or a project manager loses a company politics battle.

      Of course, having a GitHub repo mirror the real stuff would be good for exposure.

      • Ok, but the problem is that companies go bankrupt and their servers disappear. GitHub is run by Microsoft. And MS is arguably the company in the world most likely to exist in 100 years.

    • I’d see it more as a collection of web servers that host the PDF and an associated metadata file, that mirror their data. I suppose something like the way DNS operates? Each one would also have a basic REST API (standardised via RFC) on how to query the data.

      You’d then have a front ends that leverage that data for discovery.

  • Library of Congress, or local National/State libraries with legal/mandatory deposit requirements of publishers? No? The requirement is clearly there for publishers in Australia, at least https://ned.gov.au/resources/l... [ned.gov.au]

    • by ceoyoyo ( 59147 )

      Most countries do, including the US. The problem seems to be that the DOIs break. Which is a good reason why journals (at least any I've ever submitted to) don't let you reference by DOI.

    • by HiThere ( 15173 )

      It it enforced? There are lots of laws on the books that would be just dandy, except that they're only selectively enforced.

  • by rsilvergun ( 571051 ) on Saturday March 09, 2024 @11:26AM (#64302419)
    The monthly ones. Funny thing about those they didn't go away because they weren't selling.

    See over time every single monthly anthology book had come to depend on a single company for distribution. This would have been the late 60s or early 70s.

    Turns out that distributor was sitting on a ton of extremely valuable land that they had just accumulated over time. A private equity firm (they didn't call them that back then) noticed and bought up the company and liquidated it for the assets.

    Suddenly none of these monthly sci-fi anthologies had a distributor. And it wasn't easy back then to get one so they all went tits up. Margins are tight in an industry like that and they can't go three or four months without any revenue.

    We have a very tightly and very vertically integrated economy with extremely little competition. The whole thing is basically Jenga. Now would be a good time to start enforcing antitrust law and breaking up monopolies
  • by test321 ( 8891681 ) on Saturday March 09, 2024 @11:42AM (#64302467)

    * sci-hub, hosts 88 million files and 95% of all DOI. The full database is backed up by volunteers. In 2021 it was 77 Tb.
    * National libraries are archiving a lot of these things; just the online library of ministry of reseearch of France archives 27.9 million documents from 9617 journals https://www.istex.fr/ [istex.fr]

    • by Anonymous Coward

      sci-hub, hosts 88 million files and 95% of all DOI. The full database is backed up by volunteers. In 2021 it was 77 Tb.

      In my country (Belgium, and probably in several other Western countries too), sci-hub and lib-gen are blocked as "illegal websites".

      They should be sponsoring it as public service instead of blocking it!

      • by HiThere ( 15173 )

        You're mixing up two things.
        1) When a work goes out of copyright or is no longer available, who gets access?
        vs
        2) We want free access to everything

        This is about 1 being screwed up.

      • They can’t keep blocking access as fast as new extensions and mirrors pop up. I’ve researched several medical conditions I have and access to medical journal articles is the only way I’ve actually been able to save my own ass because here in the US if doctors can’t work it out in a single 15 minute time slot you are on your own. In that case if you can’t educate yourself on the current research you don’t get to live.

        And before the tired comments come in of just email
    • And 60% of what is stored is fraudulent, paper pleasing garbage that is not accessible to people with a college education. Their own peers struggle to identify what is actually said enough to refute it.

  • by v1 ( 525388 ) on Saturday March 09, 2024 @11:42AM (#64302469) Homepage Journal

    History is full of scaremongering (often termed FUD) where a group with a major financial interest in maintaining some status-quo tries to make an argument about how bad some looming decision will be "for everyone", and how some industry will be destroyed and vanish if regulated. But upon closer inspection, the "industry" they are referring to is currently operating in coordinated, abusive behaviors that are making it hugely profitable, and the proposed regulation or oversight is threatening to rein that in to a more reasonable and fair practice. This isn't going to "destroy" the industry, but will end the heyday they're currently enjoying, and restore balance to the market they're in. And that's actually what they're fighting to prevent, or at least slow down.

    Scientific publishing is the industry in question here, and it ticks all the boxes. It's hugely profitable, is standing on the shoulders of hard-working people, and is siphoning huge money from both the supply and demand sides of the industry to fatten the publishers. None of this is serving the public good. Their "happy times" are coming to an end, and they'll be fighting us tooth-and-nail to make it last as long as they can before the balance is restored and their cash cow wanders off into the sunset. This isn't about protecting an industry, it's about squeezing as much bonus profit as possible out of a broken system before someone manages to repair it.

  • All those billions made from charging tuition fees larger than mortgages should be required to pay for it, if they want the money they should be forced to use it for something other than another yacht.
    • University Endowments should pay for storage

      Money is not the problem, universities already have publication repositories. The problem is authors should upload the files themselves, and nobody really cares. A few universities force professors to upload their documents (by mandating it for the yearly evaluation).

      • Yearly wouldn't be often enough. You'd need to make it so regular that it becomes a habit. This is essentially the same problem as getting office workers to do their filing. Do you think someone somewhere might have come up with a solution to that problem?
  • Someone already mentioned national libraries [slashdot.org]. If those funding research required that any publications be permanently archived in a national library or other likely-to-last-generations publicly-accessible location, data loss would become a non-issue.

    That's a separate issue from DOIs going stale though. But as long as the citation leads to a permanently-accessible copy, online or on paper/microfilm, the knowledge won't be lost.

    • by HiThere ( 15173 )

      If you think that would make data loss be a non-issue, talk to a data librarian. It's a bit issue, or it sure was the last time I checked. Some programs that I used to run have just disappeared, and no longer show up on any search that I've done. One particular one that I'm thinking of shipped with an AT&T Unix system, so it was probably public domain. My company lost all copies of the 1960 census over a few decades. (The original media turned out to be unreadable after that time.) I'm not sue any

      • by davidwr ( 791652 )

        I meant data loss specific to the "published journal articles disappear" issue.

        I should have been more clear when I said "permanently archived" - I meant that the institution archiving the data would do everything feasible to keep it readable for generations or centuries, such as having multiple copies in multiple locations that aren't dependent on unavailable technology. Even with that, there are limits to "doing everything feasible." If WW3 breaks out and humanity is reduced to the stone age or worse,

  • IMO, they are not "publishers" unless they distribute copies that I can put on my bookshelf.

    I don't know about acedemia, but sometimes works in todays commercial (streaming) market disappear because their publisher explicitly want them gone. And I don't doubt that they would come after third party archiving services in the process.

  • WHAT problem? (Score:2, Interesting)

    by rickb928 ( 945187 )

    Many/most/all of these papers are published by academicians. Their employers, these institutions, can't be bothered to preserve the work they benefit from and profit from?

    This is the problem. The scientists and their institutions really do not value their own work. But in America these institutions are conquered by the Left, which only values itself.

    • What a godawful ignorant comment you've made. Scientists and their orgs do value their work. The whole system is screwy. We pay scientists jack shit, make them scrounge for research money, then write technical papers that they get paid squat for, charge them for things like public access and color graphs, and make a bunch of other scientists review those papers for free and create more unpaid labor for the writing scientists. Finally it gets published, which you pay thousands of dollars to access either as
  • by Gravis Zero ( 934156 ) on Saturday March 09, 2024 @12:40PM (#64302613)

    Scientific works are much different than other published works as they contain knowledge. As such, that knowledge should be released into the public domain far sooner than other copyrighted works. The fact that the only plan for the preservation of scientific knowledge is Sci-hub, an "illegal" repository of scientific papers, shows just how poorly copyright law serves the public good.

    • Lib-gen is also a very useful repository. Sadly neither is really comprehensive and lacks quite a bit compared to what we could have if an open policy was adopted.
      • Open policies were tried in the past, for example India and China were pushing ahead with the Million Books Project. Eventually these kinds of open projects were all defunded. Libgen (and its previous and future incarnations) can only survive through extra legal means, it seems.
  • We can't allow anyone/thing to lock up knowledge... Screw IP "investments"

  • ..showing that publishers need to be paid more
    How about a free, open access alternative
    Traditional publishers are dinosaurs

  • Print or microfiche at least 2 copies of everything* and store them long-term in different cities. That's essentially what science did for decades in the 20th century, at least up until the '80s: The author or author's institution kept at least one copy, the publisher kept a copy, and (for works published in the USA), the Library of Congress may have received a copy.

    That's for the final "journal-published paper" of course. All the work leading up to the final paper wasn't typically saved "forever."

    In tod

    • I forgot to put the target of the

      everything*

      in front of the 2nd paragraph. It should read

      * That's for the final "journal-published paper" of course. All the work leading up to the final paper wasn't typically saved "forever."

  • If any of these closed-access publishers are filing Chapter 11, give me a call and I'll go buy a couple of 16TB drives off Amazon and go there to make a copy then arrange to get copies to several archivists.

    It should be like $2000 per journal which is less than an open-access publishing fee for one paper, then we can move on to distributed science.

  • All of this was predicted decades ago when folks started to realize how transient digital media actually is. Not only can documents and images vanish with the crunch of a hard drive read head hitting the platter. But the latest version of the operating system changes the data format and compression algorithms and does a one time upgrade in the installer... or just declares the old stuff to be corrupt and unreadable. Meanwhile we revel in the insights provided by carbonized scrolls from Herculaneum or a dagu

  • Everyone who downloaded a copy of a paper and keeps it on their computer could be considered part of a distributed backup.

    And if not one single person has a copy of s paper, then maybe itâ(TM)s not that important.
  • There are about 5 million scientific articles per year. Assuming 50 years available in electronic form, thatâ(TM)s 250 million articles. Assuming 4 MB per article, thatâ(TM)s 1000 TB.

    So we start a database. Every article has a hash like git hashes; we can ignore duplicates. We start by feeding all properly documented articles. And then anybody with any scientific articles runs a program that collects the checksums of scientific articles, checks which are missing and uploads them. Some AI might
  • Important news, but many people ignore it. It is worth making open access to files. I recently used essay writing help for a similar article, found https://ca.edubirdie.com/essay-writing-help [edubirdie.com] for this. Science is all we have, without it we are just primates. There are unofficial places, although this breaks the rules.

Your own mileage may vary.

Working...