Study Finds That We Could Lose Science If Publishers Go Bankrupt (arstechnica.com) 66
A recent survey found that academic organizations are failing to preserve digital material -- "including science paid for with taxpayer money," reports Ars Technica, highlighting the need for improved archiving standards and responsibilities in the digital age. From the report: The work was done by Martin Eve, a developer at Crossref. That's the organization that organizes the DOI system, which provides a permanent pointer toward digital documents, including almost every scientific publication. If updates are done properly, a DOI will always resolve to a document, even if that document gets shifted to a new URL. But it also has a way of handling documents disappearing from their expected location, as might happen if a publisher went bankrupt. There are a set of what's called "dark archives" that the public doesn't have access to, but should contain copies of anything that's had a DOI assigned. If anything goes wrong with a DOI, it should trigger the dark archives to open access, and the DOI updated to point to the copy in the dark archive. For that to work, however, copies of everything published have to be in the archives. So Eve decided to check whether that's the case.
Using the Crossref database, Eve got a list of over 7 million DOIs and then checked whether the documents could be found in archives. He included well-known ones, like the Internet Archive at archive.org, as well as some dedicated to academic works, like LOCKSS (Lots of Copies Keeps Stuff Safe) and CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe). The results were... not great. When Eve broke down the results by publisher, less than 1 percent of the 204 publishers had put the majority of their content into multiple archives. (The cutoff was 75 percent of their content in three or more archives.) Fewer than 10 percent had put more than half their content in at least two archives. And a full third seemed to be doing no organized archiving at all. At the individual publication level, under 60 percent were present in at least one archive, and over a quarter didn't appear to be in any of the archives at all. (Another 14 percent were published too recently to have been archived or had incomplete records.)
The good news is that large academic publishers appear to be reasonably good about getting things into archives; most of the unarchived issues stem from smaller publishers. Eve acknowledges that the study has limits, primarily in that there may be additional archives he hasn't checked. There are some prominent dark archives that he didn't have access to, as well as things like Sci-hub, which violates copyright in order to make material from for-profit publishers available to the public. Finally, individual publishers may have their own archiving system in place that could keep publications from disappearing. The risk here is that, ultimately, we may lose access to some academic research.
Using the Crossref database, Eve got a list of over 7 million DOIs and then checked whether the documents could be found in archives. He included well-known ones, like the Internet Archive at archive.org, as well as some dedicated to academic works, like LOCKSS (Lots of Copies Keeps Stuff Safe) and CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe). The results were... not great. When Eve broke down the results by publisher, less than 1 percent of the 204 publishers had put the majority of their content into multiple archives. (The cutoff was 75 percent of their content in three or more archives.) Fewer than 10 percent had put more than half their content in at least two archives. And a full third seemed to be doing no organized archiving at all. At the individual publication level, under 60 percent were present in at least one archive, and over a quarter didn't appear to be in any of the archives at all. (Another 14 percent were published too recently to have been archived or had incomplete records.)
The good news is that large academic publishers appear to be reasonably good about getting things into archives; most of the unarchived issues stem from smaller publishers. Eve acknowledges that the study has limits, primarily in that there may be additional archives he hasn't checked. There are some prominent dark archives that he didn't have access to, as well as things like Sci-hub, which violates copyright in order to make material from for-profit publishers available to the public. Finally, individual publishers may have their own archiving system in place that could keep publications from disappearing. The risk here is that, ultimately, we may lose access to some academic research.
We need a Foundation (Score:5, Interesting)
The idea's always appealed to me. From time to time we re-codify laws, we need to do the same for all human knowledge.
Set a baseline year - presumably whenever you start the project - and try to distill everything into a single encyclopedia of all human knowledge, a reference for getting someone from the stone age to the information age in as few steps as possible. We really don't need every paper ever published, and trying to keep it all means we're not in control of what is missing when something inevitably gets lost. It's a massive project, and would take a long time and inevitably be imperfect, but I think it's worth the attempt.
When you have the best encyclopedia you can churn out, you etch it into quartz tablets that can be human-read with a decent magnifying glass, and you put those tablets into a cave in some very geologically stable place. And then you take your digital copies and give them to whoever wants them.
Re:We need a Foundation (Score:4, Interesting)
This is basically a publically funded version of wikipedia that is maintained by actual scholars, right? And they would release a new official version every decade or something.
I like it. I wish the leaders of human societies had a set of values that would lead them to fund something like this.
Re: (Score:2)
>This is basically a publically funded version of wikipedia that is maintained by actual scholars, right?
More or less, though I would think any 10-year updates should be purely electronic; filling a new cave with etched quartz tablets every decade would be expensive and time consuming. I'd save that for every century.
Re: (Score:2)
Re: (Score:2)
I like it. I wish the leaders of human societies had a set of values that would lead them to fund something like this.
The only thing that matters is profit. Fuck you. I got mine and I will ensure you never get yours. I will monetize your efforts at getting what you think you deserve. You are my slave. Forever.
This is such a wonderfully constructed society that we live in. I can see why the poors support it so vigorously. :)
Re: (Score:2)
Re: (Score:2)
I would say it would be enough to exempt the foundation from IP law and allow them to copy whatever they need to fulfill their mission. The whole point would be to dump the unnecessary stuff and distill our accumulated knowledge down to the straightest possible path from 'small nomadic family groups banging rocks together' to the present day.
Re: (Score:2)
We’d always probably want to make it so anyone could make a clone , via torrent or rsync. This way we wouldn’t be vulnerable to a single point of failure.
Previously I’ve considered the idea of academic and reasearcg research institutions hosting their own PDFs in a form that facilitates publishing and replication.
Re: (Score:3)
Call it "the Encyclopedia Foundation".
Re: (Score:3)
Re: (Score:2)
We could put it somewhere on the edge of civilization, someplace unpleasant and barely habitable so it will be beyond political interference. ...maybe Australia.
Re: (Score:2)
The idea's always appealed to me. From time to time we re-codify laws, we need to do the same for all human knowledge.
Set a baseline year - presumably whenever you start the project - and try to distill everything into a single encyclopedia of all human knowledge, a reference for getting someone from the stone age to the information age in as few steps as possible. We really don't need every paper ever published, and trying to keep it all means we're not in control of what is missing when something inevitably gets lost. It's a massive project, and would take a long time and inevitably be imperfect, but I think it's worth the attempt.
When you have the best encyclopedia you can churn out, you etch it into quartz tablets that can be human-read with a decent magnifying glass, and you put those tablets into a cave in some very geologically stable place. And then you take your digital copies and give them to whoever wants them.
Isn't that why we have the Library of Congress?
Re: (Score:2)
We need a laser printer and a Xerox machine.
Re: (Score:2)
Set a baseline year - presumably whenever you start the project - and try to distill everything into a single encyclopedia of all human knowledge, a reference for getting someone from the stone age to the information age in as few steps as possible.
LOL, fuck no. How would we monetize the learning process then? It is better to go back to the stone age than lose a penny of profit. Fuck the human race, my experience of existence is purely about ME.
Re: Highly Unlikely (Score:2)
Physicists seem to be the only scientific discipline that has their shit together. Maybe it's the expensive experiments, but for whatever reason there seems to be a much better scientific culture.
Re: (Score:2)
It's not the only one. Personally, I think computer science is ahead because, as you might guess, computer scientists have always been pretty skeptical of paying journals to host PDFs. Thus things like this:
https://openaccess.engineering... [oregonstate.edu]
Some of the worst fields are also the ones where citations matter the most though. Fields where the citation isn't about something that's independently verifiable, but rather about what somebody has written.
Re: (Score:2)
Not happening in computer science either (Score:2)
At least in my field (theoretical physics) this is highly unlikely
It's not happening in computer science. At my Uni your thesis gets printed out and bound as a book and goes into the library. The thesis describes any discoveries or algorithms. Source code is an implementation detail anyone can replicate from the thesis content. Its loss is no great impediment to science. The loss of the PDF version of the thesis is an inconvenience. The Uni should have a digital library with the PDF, as they have the physical library section with the printed and bound.
If a paper is fun
Re: (Score:2)
Nasa (and partners) does a good of managing their archive. The technical reports server https://ntrs.nasa.gov/ [nasa.gov] has given me many hours of edutainment.
GitHub (Score:1)
Download everything and put it in a GitHub repo immediately.
It's not perfect but it's way more likely to be there in 100 years than Internet Archive.
Re: (Score:2)
s/Hub//
Git is good, GitHub is currently owned by an extremely vile corporation that 1. actively destroys stuff it disagrees with, 2. may shut it down the moment it believes it's not profitable, or a project manager loses a company politics battle.
Of course, having a GitHub repo mirror the real stuff would be good for exposure.
Re: GitHub (Score:2)
Ok, but the problem is that companies go bankrupt and their servers disappear. GitHub is run by Microsoft. And MS is arguably the company in the world most likely to exist in 100 years.
Re: (Score:2)
I’d see it more as a collection of web servers that host the PDF and an associated metadata file, that mirror their data. I suppose something like the way DNS operates? Each one would also have a basic REST API (standardised via RFC) on how to query the data.
You’d then have a front ends that leverage that data for discovery.
National libraries (Score:2)
Library of Congress, or local National/State libraries with legal/mandatory deposit requirements of publishers? No? The requirement is clearly there for publishers in Australia, at least https://ned.gov.au/resources/l... [ned.gov.au]
Re: (Score:2)
Most countries do, including the US. The problem seems to be that the DOIs break. Which is a good reason why journals (at least any I've ever submitted to) don't let you reference by DOI.
Re: (Score:2)
It it enforced? There are lots of laws on the books that would be just dandy, except that they're only selectively enforced.
Anyone old enough to remember pulp sci-fi books (Score:3)
See over time every single monthly anthology book had come to depend on a single company for distribution. This would have been the late 60s or early 70s.
Turns out that distributor was sitting on a ton of extremely valuable land that they had just accumulated over time. A private equity firm (they didn't call them that back then) noticed and bought up the company and liquidated it for the assets.
Suddenly none of these monthly sci-fi anthologies had a distributor. And it wasn't easy back then to get one so they all went tits up. Margins are tight in an industry like that and they can't go three or four months without any revenue.
We have a very tightly and very vertically integrated economy with extremely little competition. The whole thing is basically Jenga. Now would be a good time to start enforcing antitrust law and breaking up monopolies
Incomplete study, situation probably not that bad (Score:4, Informative)
* sci-hub, hosts 88 million files and 95% of all DOI. The full database is backed up by volunteers. In 2021 it was 77 Tb.
* National libraries are archiving a lot of these things; just the online library of ministry of reseearch of France archives 27.9 million documents from 9617 journals https://www.istex.fr/ [istex.fr]
Re: (Score:1)
sci-hub, hosts 88 million files and 95% of all DOI. The full database is backed up by volunteers. In 2021 it was 77 Tb.
In my country (Belgium, and probably in several other Western countries too), sci-hub and lib-gen are blocked as "illegal websites".
They should be sponsoring it as public service instead of blocking it!
Re: (Score:2)
You're mixing up two things.
1) When a work goes out of copyright or is no longer available, who gets access?
vs
2) We want free access to everything
This is about 1 being screwed up.
Re: (Score:2)
And before the tired comments come in of just email
Re: (Score:2)
And 60% of what is stored is fraudulent, paper pleasing garbage that is not accessible to people with a college education. Their own peers struggle to identify what is actually said enough to refute it.
this looks like a familiar game (Score:5, Interesting)
History is full of scaremongering (often termed FUD) where a group with a major financial interest in maintaining some status-quo tries to make an argument about how bad some looming decision will be "for everyone", and how some industry will be destroyed and vanish if regulated. But upon closer inspection, the "industry" they are referring to is currently operating in coordinated, abusive behaviors that are making it hugely profitable, and the proposed regulation or oversight is threatening to rein that in to a more reasonable and fair practice. This isn't going to "destroy" the industry, but will end the heyday they're currently enjoying, and restore balance to the market they're in. And that's actually what they're fighting to prevent, or at least slow down.
Scientific publishing is the industry in question here, and it ticks all the boxes. It's hugely profitable, is standing on the shoulders of hard-working people, and is siphoning huge money from both the supply and demand sides of the industry to fatten the publishers. None of this is serving the public good. Their "happy times" are coming to an end, and they'll be fighting us tooth-and-nail to make it last as long as they can before the balance is restored and their cash cow wanders off into the sunset. This isn't about protecting an industry, it's about squeezing as much bonus profit as possible out of a broken system before someone manages to repair it.
University Endowments should pay for storage. (Score:2)
Re: (Score:2)
University Endowments should pay for storage
Money is not the problem, universities already have publication repositories. The problem is authors should upload the files themselves, and nobody really cares. A few universities force professors to upload their documents (by mandating it for the yearly evaluation).
Re: (Score:2)
Make it a funding requirement (Score:1)
Someone already mentioned national libraries [slashdot.org]. If those funding research required that any publications be permanently archived in a national library or other likely-to-last-generations publicly-accessible location, data loss would become a non-issue.
That's a separate issue from DOIs going stale though. But as long as the citation leads to a permanently-accessible copy, online or on paper/microfilm, the knowledge won't be lost.
Re: (Score:3)
If you think that would make data loss be a non-issue, talk to a data librarian. It's a bit issue, or it sure was the last time I checked. Some programs that I used to run have just disappeared, and no longer show up on any search that I've done. One particular one that I'm thinking of shipped with an AT&T Unix system, so it was probably public domain. My company lost all copies of the 1960 census over a few decades. (The original media turned out to be unreadable after that time.) I'm not sue any
Re: (Score:1)
I meant data loss specific to the "published journal articles disappear" issue.
I should have been more clear when I said "permanently archived" - I meant that the institution archiving the data would do everything feasible to keep it readable for generations or centuries, such as having multiple copies in multiple locations that aren't dependent on unavailable technology. Even with that, there are limits to "doing everything feasible." If WW3 breaks out and humanity is reduced to the stone age or worse,
Publishers? (Score:2)
IMO, they are not "publishers" unless they distribute copies that I can put on my bookshelf.
I don't know about acedemia, but sometimes works in todays commercial (streaming) market disappear because their publisher explicitly want them gone. And I don't doubt that they would come after third party archiving services in the process.
WHAT problem? (Score:2, Interesting)
Many/most/all of these papers are published by academicians. Their employers, these institutions, can't be bothered to preserve the work they benefit from and profit from?
This is the problem. The scientists and their institutions really do not value their own work. But in America these institutions are conquered by the Left, which only values itself.
Re: WHAT problem? (Score:2)
So it's just money.
Re: (Score:2)
Soluition: make them public domain much sooner (Score:5, Interesting)
Scientific works are much different than other published works as they contain knowledge. As such, that knowledge should be released into the public domain far sooner than other copyrighted works. The fact that the only plan for the preservation of scientific knowledge is Sci-hub, an "illegal" repository of scientific papers, shows just how poorly copyright law serves the public good.
Re: (Score:2)
Re: (Score:2)
What this means (Score:2)
We can't allow anyone/thing to lock up knowledge... Screw IP "investments"
Publishers fund a study.. (Score:2)
..showing that publishers need to be paid more
How about a free, open access alternative
Traditional publishers are dinosaurs
Microfiche (Score:1)
Print or microfiche at least 2 copies of everything* and store them long-term in different cities. That's essentially what science did for decades in the 20th century, at least up until the '80s: The author or author's institution kept at least one copy, the publisher kept a copy, and (for works published in the USA), the Library of Congress may have received a copy.
That's for the final "journal-published paper" of course. All the work leading up to the final paper wasn't typically saved "forever."
In tod
the asterisk (Score:1)
I forgot to put the target of the
everything*
in front of the 2nd paragraph. It should read
* That's for the final "journal-published paper" of course. All the work leading up to the final paper wasn't typically saved "forever."
Time to tax beakers (Score:2)
I'll do it (Score:2)
If any of these closed-access publishers are filing Chapter 11, give me a call and I'll go buy a couple of 16TB drives off Amazon and go there to make a copy then arrange to get copies to several archivists.
It should be like $2000 per journal which is less than an open-access publishing fee for one paper, then we can move on to distributed science.
And the surprise is? (Score:2)
All of this was predicted decades ago when folks started to realize how transient digital media actually is. Not only can documents and images vanish with the crunch of a hard drive read head hitting the platter. But the latest version of the operating system changes the data format and compression algorithms and does a one time upgrade in the installer... or just declares the old stuff to be corrupt and unreadable. Meanwhile we revel in the insights provided by carbonized scrolls from Herculaneum or a dagu
Distributed backup (Score:2)
And if not one single person has a copy of s paper, then maybe itâ(TM)s not that important.
And create a foundation⦠(Score:2)
So we start a database. Every article has a hash like git hashes; we can ignore duplicates. We start by feeding all properly documented articles. And then anybody with any scientific articles runs a program that collects the checksums of scientific articles, checks which are missing and uploads them. Some AI might
Study (Score:1)