Web Pages Are Weak Links in the Chain of Knowledge 361
PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""
Don't do that. (Score:5, Insightful)
Throwing out the baby with the bathwater (Score:5, Insightful)
What would be interesting would be a website that archives those snapshots for posterity. Well, what do you know, there are several such sites already! Looks like we're in good shape. The sky is not falling.
Rigidity stifles creativity (Score:5, Insightful)
Hardcopy (Score:5, Insightful)
Having a hardcopy (1) documents the information and it's (purported) source, and (2) allows offline access for comparison and validation.
Re:Well, (Score:5, Insightful)
Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing
Re:Books have an ISBN..(but web pages are googled) (Score:5, Insightful)
The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).
Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?
With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).
Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get.
William
What's the problem here ? (Score:5, Insightful)
Does anyone archive CB radio traffic ??
It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.
Backup Your Important Data (Score:4, Insightful)
I constantly backup all my digital photos because they are important to me. I also print the best ones for placing in photo albums, distributing to friends, etc.
The website they are published to is just a delivery medium, and not even the primary one. It can disappear and I wouldn't care. People who know me can always get access to them. Scientists should view their work the same way.
revisionism (Score:1, Insightful)
long-term storage needs... (Score:2, Insightful)
Printed media, while having a low data/pound ratio, has managed to survive and span generations for centuries. I think the need for paper libraries cannot be forgotten. The challenge is distilling out what is worth keeping, and this challenge is better met now rather than later because we have more or less a good idea of what is significant information, and what is crap.
Permalinking and archiving (Score:5, Insightful)
Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.
But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.
So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.
Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.
Suggestions of better solutions are always welcome
Reason for this? (Score:5, Insightful)
If you want to do serious research.... (Score:4, Insightful)
For example, a short time ago, I did a white paper on power scavenging sources. About 1/2 the articles I read were HTML or PDF sources. Rather than just citing the URL, I downloaded/saved every online article I referenced. If someone wants the source and cannot find it, I'll just provide it to them. If your paper is going to be read by a number of people, it makes good sense to have those sources on-hand; it never hurts to cover your arse.
Hard drive/Network/Optical space is virtually unlimited, so storage isn't a problem. Paper journals are archived by most libraries, anyway, so until they start archiving technical sources, I'm going to have to do my OWN archiving.
Re:Well, (Score:5, Insightful)
Do you really think goatse will be "disturbing" 100 years from now?
The day goatse.cx is no longer disturbing, is sure to be the first day of ArmageddonURL + date (Score:3, Insightful)
the problem is bigger (Score:5, Insightful)
because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.
i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.
yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)
Give and take - it's cultural change, dummy. (Score:5, Insightful)
Publications that cite [web pages] lose their authorities? Who the hell told you to cite a webpage? Might as well cite a poster you saw downtown. If the webpage is a reputable source in the first place, it'll keep it around permanently. Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.
This is no way to run a culture. Last time I checked, nobody ran our culture... It kinda runs itself. The proliferation of accessable, ephemeral webpages over permanent, priveliged paper publications (wah, too many p's!) is a sign that our information culture has moved on into a new era. Liked the old one? Tough! Now information has to maintain its own relevance in order to be permanent... and I for one welcome that change.
Re:Well, (Score:5, Insightful)
Re:Hardcopy (Score:2, Insightful)
No way to run a culture? (Score:4, Insightful)
To the contrary, I think this is highly typical of the culture we have today, where everything is a transient fad in the media, technology and politics.
And it is also self feeding, I think, since market forces need to clear out the old to make room for the new in order to meet sales forecasts and shareholder expectations. And this is very true for pop, news and technology, which explains the lack of staying power of pop icons these days and becomes interesting when you want to ask yourself if you really need that new 3GHz machine just to surf the web.
And it is highly convenient in politics where a politician doesn't have to be accountable for what he said 100 days ago.
And so, the lack of long time life on the web is simply symbolic of all the rest here really, even if it is highly questionable.
genguid and google (Score:3, Insightful)
and place that number at the bottom of your
page a link with google's "I'm feeling lucky"
searching for the GUID.
Berners-Lee considered harmful (Score:5, Insightful)
You can link to a article which is then changed by the original publisher (or someone else). With scientific papers, you can't do that -- and such behavior is probably not desireable.
On the up side, if you're currently using cited references, you should be able to build such a system without too much problem -- follow links to PDFs or automatically crawl HTML documents (and check images) and serve all papers that you refer to with your paper. It'd be big, but it provides better reliability than do current paper schemes.
Another feature that might be useful is signing of the content (assuming RSA doesn't get broken in the future).
Basically, if you put up a SHA-1 (Gnutella), MD4 (eDonkey), or similar reference, you can host the original referred-to documents as well as the original host.
If Freenet didn't have as a specific drawback the inability of someone to guarantee that a document remains hosted as long as they are willing to host it, Freenet would be a good choice for this.
One possibility is that, with a bit of manual work, one can frequently find an academic work by Googling for its title. At least for now, as long as you host the original papers as well, Google should pick up on this fact. Of course, it does nothing to prevent modification of that paper by another party...
A good system for handling this would be to have a known system that is willing to archive, in perpetuity (probably hosted by the US government or other reasonably stable, trustworthy source [yes, yes, cracks at the US government aside]). This system would act like a Tier 1 NTP server -- it would only grant access to a number of other trusted servers (universities, etc) that mirror it -- perhaps university systems -- which would keep load sane. These servers (or perhaps Tier 3 servers) then provide public access. Questions of whether there would be a hard policy of never removing content or what would be allowed (especially WRT politically controversial content) would have to be answered.
There could be multiple Tier 1 servers that would sync up with each other, and could act as checks in case one server is broken into. I'm partial to the idea of including a signature on each file, but I suppose it isn't really necessary.
Specific formats could be required to ensure that these papers are readable for all time. Project Gutenberg went with straight ASCII. This would probably have to be slightly more elaborate. Microsoft Word and PDF might not be good choices, and international support would be necessary.
Re:Give and take - it's cultural change, dummy. (Score:4, Insightful)
Not always true. The U.S. Government was a good source for research information until the political purge of research articles that disagreed with the administration on key policy issues. The basic response? The NIH, Department of Education, FDA and EPA's responsibility is to promote policy, not provide information to the public. (Although this problem is not limited to the Internet, libraries that were public archives for government documents were ordered to pull "sensitive" material after 9-11.) In addition there is the problem of upgrading infrastructure. The URL may work today, but what happens when the site moves to a more scalable system?
Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.
I don't know about the journals you read, but 90% of the ones I read are already on the web or archived through a distribution service. (Although another loss to reseach for politics may be ERIC which in education has been a source for many interesting "minor" papers and conference proceedings.)
The real value of journals has never been print publication, but in the peer-review process. The reason why citations in professional journals carry more weight is because the reader knows that the article had to have run the gauntlet of critical reviews from expert peers.
Now, granted, web page citations should probably be treated on the level of personal correspondence rather than as authoritative source. But to say that web-based resources move or vanish because they loose their relevance is missing a major flaw in how the web works. One professional organization I'm a member of tottered on the edge of bankrupcy for about a year. If it had gone under, web access to some of the key works in the field would have vanished overnight, and the works themselves dumped into a copyright limbo.
Can't they print them? (Score:2, Insightful)
Why not just print it out?
Not only are web pages transient, but the facts they have are subject to change. This gets back to your "pseudo-science and mis-information" comment.
If you're going to use it in your work, print a copy or save an image of it or something.
Which brings up to "fair use" and copyrights and all kinds of other crap.
Re:Well, (Score:5, Insightful)
If you like that, you might like the books by the historian Fernand Braudel. Rather than the "kings and battles" of most histories, he focusses on how very simple things like the foods people ate, the weather, etc, and the relationships between long-term trends and the emergent properties of those interactions (i.e. over decades or centuries) are responsible for shaping the course of history.
History Palimpsest (Score:2, Insightful)
Re:Worst Record Keeping (Score:2, Insightful)
Alexandria had this problem licked... (Score:1, Insightful)
problem can easily be improve with some thought (Score:5, Insightful)
Non-transparent CGI, PHP and ASP scripts are even worse, they tend to change all the time. Instead they should be using the "path info", or be in the server (mod_perl, etc.)
Example: "http://science.slashdot.org/article/03/11/24/127
The idea that the basic job of a webserver is to pull files off your disk is incomplete: it's job ought to be to take your URL through *any* kind of query lookup, which might map to the filesystem and might not. The HTTP RFC's imply this as well.
reed
Re:Well, (Score:3, Insightful)
I wouldn't rule it out. There are people who are working very hard now to drag us all back into a new era of ignorance and superstition. Can they succeed? Maybe not, but things were pretty wide-open in the 20s, and then look what happened!
The web can hold insight,in the right field-Backup (Score:1, Insightful)
You might want to make certain you have it RAID'ed. I had TWO IBM Deskstars die in the same time period. What a pain to recover what I could. And I believe that Google could fall under the same provisions as a Library.
Re:The main difference being... (Score:3, Insightful)
In 5000 years archeologists will learn so much about us from blogs & archives of
Re:I do think that. (Score:4, Insightful)
The real item of importance is that others have access to what you are citing. They may need/desire this for several reasons such verifying your claims and gaining more background information. By citing an online resource that is not backed by hard-publication (i.e. IEEE offers full-text online articles in addition to print, slashdot has no periodical that i know of) you may cite something that is gone tomorrow, possibily making you work look suspect. Furthuremore, anyone can post pretty much anything they want to the web -- think the onion [theonion.com].
1984 (Score:1, Insightful)
This is already happening. Read a cnn news story (something controversial or important) and save the text. Come back a couple of hours later-- you will often find changes in the text.
What is truth when there is no proof?
It's whatever they want to tell you.
Digital Dark ages (Score:2, Insightful)
It's implications go way beyond web pages, which are just one of the first manifestations of our electronic culture creating records that never touch paper, or other more established and permanent mediums.
Businesses typically only have to archive material for around 7 years legally, although some industries like pharaceuticals have to preserve data considerably longer. This is fine when records are primarly paper based, with some nice computers to speed our current business along. When records are totally electronic from start to end, ("born digital"), we start to have problems, legally and culturally. Some researches are talking about a digital dark ages, where many of our records today will simply vanish from history, totally inaccessible and unpreserved.
This is about storage, migration and emulation. It's about persistent identifiers. It's about technology obsolesence leading to cultural obsolesence.
Matt Palmer Digital Preservation Department UK National Archives.
Even "hard copy" today isn't the same (Score:5, Insightful)
Why?
Current historians learn a lot about each writers creative process, and how writers evolved their ideas, from drafts and corrections. Music scholars pore over every scratched-out note, every furious scribbled comment, in Beethoven's draft scores. Writing music was laborious and hugely frustrating for Beethoven, unlike Mozart, who hardly stopped to think and made few if any corrections.
Future scholars won't know any of this stuff, looking back at our work. We use software to edit our work... so when we fix our errors they are gone forever. We change our minds and the original idea disappears in a puff of electrons. An electronic score of a Beethoven symphony only differs from a Mozart concerto in the musical style -- all of the other data is gone.
It's a sobering thought. Where else are we going to get this data? Not letters, because we write emails now, and regularly delete them (intentionally or not). Diaries? Some people still keep them on paper... but many store them on computer, or publish them in blogs (which as discussed will mostly be gone).
Sobering thought isn't it? It's not neccessarily hubris to say we ought to be saving more of this stuff; people a few hundred years from now should be able to learn from our failures, as well as our successes.
RTFA... it's about references in scientific papers (Score:5, Insightful)
You don't just say "Frotz and Rumble observed that the freeble-tropic factor was 6.32," you say "Frotz and Rumble (1991) observed that the freeble-tropic factor was 6.32." Then, at the end, traditionally, you would put "Frotz, Q. X and Rumble, M (1991): Dilatory freeble-tropism in the edible polka-dotted starfish, Asterias gigantiferus (L) (Echinodermata, Asteroidea), when treated with radioactive magnesium pemoline. J. f. Krankschaft und Gierschift, 221(6):340-347."
Then if someone else wondered about that statement, they'd go to the library and pull down volume 221 of the journal, and see that Frotz and Rumble had only measured that factor on six specimens, using the questionable Rumkohrf assay. If they had more questions, they'd write to Frotz at the address given in the article, asking them whether they remembered to control for the presence of foithbernder residue.
This sort of thing is absolutely essential to the scientific process and makes science self-correcting.
The article says that these days, the papers are published online, the references are URLs, and that an awful lot of them are stale. If so, this cuts to the very heart of the process of scientific scholarship.
But even "reputable" web pages get (re)moved... (Score:3, Insightful)
The problem *is* the short lifespan of web pages. Even "reputable" publications move their pages around, or remove them entirely, breaking all links. I'm talking about major newspapers, scientific journals, etc. It's these people, the supposedly reputable ones, who need to do a better job. The way they're doing things now is indeed, "no way to run a culture."
Re:RTFA... it's about references in scientific pap (Score:3, Insightful)
Recently a colleague of mine published a paper in an online peer-reviewed journal which contained a trivial error (transposition typo) that however would change, in fact reverse, the interpretation results. They were permitted to fix this, months after the article had first been posted. Does this aid Progress, or is it Revisionist?
Re:Well, (Score:2, Insightful)
I have to disagree. An object which produces such trauma should not be preserved simply because the traumatic experience is shared. I think I have some form of post-traumatic stress disorder lingering from the day I saw the goatse thing - complete with horrifying flashbacks. That thing needs to go.
Why should any aspect of "culture" be preserved simpy because it constitutes "culture"? If we preserve everything that we have in common, we will be compulsive hoarders and the people of the earth will soon be living under a heap of obsolete car tires, betamax tapes and floppy disks. When we are done with something, we should let it go.
Re:Even "hard copy" today isn't the same (Score:2, Insightful)
The degree to which drafts of manuscripts and musical scores from earlier periods have survived is already arbitrary, and it will be so in the future: Backups will contain the drafts in the future. Some them will surely survive.
Here's what Tim B-L has to say: (Score:3, Insightful)
A bit over-idealistic, but worth aiming towards even if you don't achieve 100% non-URI-breakage in practice.
I feel that search engines should slightly penalize sites that have a history of breaking links or making them redirect to a completely irrelevant page: partly because there is just less chance that the link you follow from the search engine will have the content you want, and partly because even if you do get to a correct page, its usefulness as a bookmark or a link from your own dcuments is reduced.
Re:Not everything, but... (Score:3, Insightful)
And one of the design goals of the Xanadu server project was to provide exactly this sort of permanent storage and location-redundant backup. (We even refered to it as the "Library of Alexandrea Problem" and named one of the machines after the Alexandrean librarian. B-) )
Unfortunately the project didn't succeed and the web filled the niche.
So now we have a distributed Library of Alexandrea, holding the single copy of every "book", constant brushfires taking out important works, and a few "scribes" frantically trying to make copies of the whole thing (which copies, IF they exist, have to be accessed a different way than the original).
(Also coarse-grained one-way (text snippet->page or image) rather than fine-grained (text snippet, image region, or database entry->text snippet, image region, or database entry), one-way links rather than backfollowable links, and I could go on...)
The survival of backups (Score:4, Insightful)
There's a weird kind of paradox involved in what will survive, though.
Digital media has that wonderful property that it can be reproduced *perfectly* -- such that the copy is indistinguishable from the original -- but it must be copied or it will die.
You can burn your vacation videos to CD so your grandkids will be able to see them -- but that CD won't be readable anymore in a decade, never mind a century. If you faithfully make sure they're recopied every once in a while, though (and possibly converted to whatever new video formats are invented), your descendants 500 years hence will be able to see you waving from behind that sandcastle in California, as if it were filmed yesterday. No more flipping through yellowed photographs or crumbling newspaper clippings.... Imagine it! A scientist may use your video to prove his point about how the sunsets on the west coast have improved since California sank into the ocean.
He has to use family videos, though, because two decades of scientifically-recorded data on weather patters was all wiped out when a massive electromagnetic bomb was set up by terrorists in 2012.
Yeah, far-fetched example. I don't want to force the point, and definitely lots of stuff will survive... but our progeny won't be making the same kinds of attic discoveries that we can today.
"Hey, viddy all these ancient discs that Old Grampy Limp Devil had cached away up here! Can you run them? Nothing, huh? Oh, well."