Web Pages Are Weak Links in the Chain of Knowledge 361

Posted by Hemos on Monday November 24, 2003 @10:27AM from the destroying-our-young dept.

PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""

This discussion has been archived. No new comments can be posted.

Web Pages Are Weak Links in the Chain of Knowledge

Load All Comments

Search 361 Comments Log In/Create an Account

Comments Filter:

Well, (Score:5, Interesting)

by jeffkjo1 ( 663413 ) writes: on Monday November 24, 2003 @10:32AM (#7547383) Homepage

Really, is there a reason to archive everything in the world? Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?

100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)

Share
twitter facebook
- Re:Well, (Score:5, Insightful)
  
  by fredrikj ( 629833 ) writes: on Monday November 24, 2003 @10:38AM (#7547454) Homepage
  
  100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)
  
  Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing :P
  
  Parent Share
  twitter facebook
  - Re:Well, (Score:5, Insightful)
    
    by operagost ( 62405 ) writes: on Monday November 24, 2003 @10:48AM (#7547535) Homepage Journal
    
    Do you really think goatse will be "disturbing" 100 years from now?
    The day goatse.cx is no longer disturbing, is sure to be the first day of Armageddon ...
    
    Parent Share
    twitter facebook
    - - Re:Well, (Score:4, Funny)
        
        by Blue Stone ( 582566 ) writes: on Monday November 24, 2003 @01:33PM (#7548931) Homepage Journal
        
        Bill Gates DID say that. Here's a link [billsaidthis.com] that proves it.
        
        Parent Share
        twitter facebook
  - Scary thought. (Score:4, Funny)
    
    by Channard ( 693317 ) writes: on Monday November 24, 2003 @11:13AM (#7547683) Journal
    
    Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing :P
    Well, I guess we know what Paul McCartney will be doing on the cover of his next album..
    
    Parent Share
    twitter facebook
  - Re:Well, (Score:3, Insightful)
    
    by drooling-dog ( 189103 ) writes:
    
    Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing
    I wouldn't rule it out. There are people who are working very hard now to drag us all back into a new era of ignorance and superstition. Can they succeed? Maybe not, but things were pretty wide-open in the 20s, and then look what happened!
- Re:Well, (Score:2)
  
  by Xzzy ( 111297 ) writes:
  
  > Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?
  
  SSSSHHH.
  
  Don't you see where this is going? The next obvious step is government installed "web vaults" where people can submit their oh-so-valuable chicken scratchings and they will be stored, under the same URL, for eternity.
  
  No more geoshitties man, we're talking lifetime free webspace for every citizen in the US!
- Re:Well, (Score:5, Insightful)
  
  by GeorgeH ( 5469 ) writes: on Monday November 24, 2003 @10:56AM (#7547581) Homepage Journal
  
  100 years from now, should anyone be forced to accidentally stumble over goatse?
  The fact that you and I can refer to goatse and people know what we're talking about means that it's an important part of our shared culture. I think that anything that archives the good and bad of a culture is worth keeping around.
  
  Parent Share
  twitter facebook
  - Re:Well, (Score:5, Funny)
    
    by NanoGator ( 522640 ) writes: on Monday November 24, 2003 @01:05PM (#7548700) Homepage Journal
    
    "The fact that you and I can refer to goatse and people know what we're talking about means that it's an important part of our shared culture."
    
    Amazing that the most remembered asshole of the dawn of the 21st century isn't Michael Eisner or Jack Valenti.
    
    Parent Share
    twitter facebook
- Re:Well, (Score:5, Interesting)
  
  by mlush ( 620447 ) writes: on Monday November 24, 2003 @11:03AM (#7547625)
  
  Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?
  I would be fascinated to see my Great Grandad's first drawings, his school web page, his postings to USENET. I only knew him as on old man ....
  
  To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff
  
  Parent Share
  twitter facebook
  - Re:Well, (Score:5, Insightful)
    
    by sql*kitten ( 1359 ) * writes: on Monday November 24, 2003 @11:30AM (#7547810)
    
    To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff
    
    If you like that, you might like the books by the historian Fernand Braudel. Rather than the "kings and battles" of most histories, he focusses on how very simple things like the foods people ate, the weather, etc, and the relationships between long-term trends and the emergent properties of those interactions (i.e. over decades or centuries) are responsible for shaping the course of history.
    
    Parent Share
    twitter facebook
  - Even "hard copy" today isn't the same (Score:5, Insightful)
    
    by jtheory ( 626492 ) writes: on Monday November 24, 2003 @12:57PM (#7548625) Homepage Journal
    
    I read an interesting article a few years ago about how even our hard copy (books, magazines, musical scores, etc.) won't be nearly as useful to future historians.
    
    Why?
    
    Current historians learn a lot about each writers creative process, and how writers evolved their ideas, from drafts and corrections. Music scholars pore over every scratched-out note, every furious scribbled comment, in Beethoven's draft scores. Writing music was laborious and hugely frustrating for Beethoven, unlike Mozart, who hardly stopped to think and made few if any corrections.
    
    Future scholars won't know any of this stuff, looking back at our work. We use software to edit our work... so when we fix our errors they are gone forever. We change our minds and the original idea disappears in a puff of electrons. An electronic score of a Beethoven symphony only differs from a Mozart concerto in the musical style -- all of the other data is gone.
    
    It's a sobering thought. Where else are we going to get this data? Not letters, because we write emails now, and regularly delete them (intentionally or not). Diaries? Some people still keep them on paper... but many store them on computer, or publish them in blogs (which as discussed will mostly be gone).
    
    Sobering thought isn't it? It's not neccessarily hubris to say we ought to be saving more of this stuff; people a few hundred years from now should be able to learn from our failures, as well as our successes.
    
    Parent Share
    twitter facebook
    - - The survival of backups (Score:4, Insightful)
        
        by jtheory ( 626492 ) writes: on Tuesday November 25, 2003 @12:28AM (#7554903) Homepage Journal
        
        Backups will contain the drafts in the future. Some them will surely survive.
        
        There's a weird kind of paradox involved in what will survive, though.
        
        Digital media has that wonderful property that it can be reproduced *perfectly* -- such that the copy is indistinguishable from the original -- but it must be copied or it will die.
        
        You can burn your vacation videos to CD so your grandkids will be able to see them -- but that CD won't be readable anymore in a decade, never mind a century. If you faithfully make sure they're recopied every once in a while, though (and possibly converted to whatever new video formats are invented), your descendants 500 years hence will be able to see you waving from behind that sandcastle in California, as if it were filmed yesterday. No more flipping through yellowed photographs or crumbling newspaper clippings.... Imagine it! A scientist may use your video to prove his point about how the sunsets on the west coast have improved since California sank into the ocean.
        
        He has to use family videos, though, because two decades of scientifically-recorded data on weather patters was all wiped out when a massive electromagnetic bomb was set up by terrorists in 2012.
        
        Yeah, far-fetched example. I don't want to force the point, and definitely lots of stuff will survive... but our progeny won't be making the same kinds of attic discoveries that we can today.
        
        "Hey, viddy all these ancient discs that Old Grampy Limp Devil had cached away up here! Can you run them? Nothing, huh? Oh, well."
        
        Parent Share
        twitter facebook
- Re:Well, (Score:5, Interesting)
  
  by 4of12 ( 97621 ) writes: on Monday November 24, 2003 @11:22AM (#7547754) Homepage Journal
  
  Really, is there a reason to archive everything in the world?
  
  No, only the good stuff needs to be saved. So what's good and who should save it?
  
  IMHO, anything that gets officially referenced by another work should be saved.
  
  That burden should not fall upon the original creator of the referenced work; it should fall upon the creator of the refering work.
  
  Despite all the hue and cry about lost revenue opportunities from controlled distribution of copyrighted information, knowledge preservation and the overall benefit to society would improve if works were able to save a local cache of referenced works.
  
  This would also help with the problem of morphing or revisionist works. Some works can be improved by editing (something around here comes to mind), but it would be inappropriate to change old web pages that show an earlier mistake in thinking, to show that somehow someone was particularly prescient, or to erase knowledge for a political agenda (a la Stalin).
  
  Just a couple of days ago I was able to retrieve an old recipe from the Google cache that had been summarily removed from a web site due to some time retention policy. An attempt to encourage repeat visits to the website because stuff disappears was circumvented. I would have been particularly annoyed with that website were it not for the delayed action of the Google cache. Google may have enable circumvention of their policy, but they would have garned a lot more ill will from me if their policy were effective.
  
  Guess what? References in scientific papers I write are not just available in libraries capable of paying $1K/year subscription rates, but as photocopies in my file cabinet. That is, I have a local cache of referenced works already.
  
  If a colleague's library did not have the specified volume and journal article, I would let him have a copy for the asking. It's a copyright violation, I know, but I'm not convinced that strict adherence to copyright laws in this case provides the best overall benefit to society.
  
  Parent Share
  twitter facebook
- Not everything, but... (Score:5, Interesting)
  
  by FunkyRat ( 36011 ) writes: <funkyrat&gmail,com> on Monday November 24, 2003 @11:27AM (#7547787) Journal
  
  This is a real problem. When Vannevar Bush [theatlantic.com] conceived the Memex system, his goal was to facilitate the exchange of scientific research. Later, Doug Englebart [virginia.edu] built on Bush's ideas as did Ted Nelson [virginia.edu] (the guy who coined the term "hypertext") and Tim Berners-Lee [w3.org]. While the web today has become a vast sinkhole of pop-up ads, crappy web stores and inane blogs it is important to not forget that its inception was in aiding scientific research.
  
  Yet, that is not possible without some kind of permanence. Probably what is needed is some way to integrate the web into university library collections. If there was some way of indexing web pages the way libraries currently use the Library of Congress scheme to index their physical collections, then web pages could be uniquely numbered with this number incorporated into the URL. If then universities and the Library of Congress itself were to mirror (permanently) these pages, if the original URL were to become unavailable, one could try just about any manjor university or the LOC and retrieve the page. Of course, with the current political climate here in the US I don't forsee this ever happening.
  
  Parent Share
  twitter facebook
  - Re:Not everything, but... (Score:3, Insightful)
    
    by Ungrounded Lightning ( 62228 ) writes:
    
    This is a real problem. When Vannevar Bush conceived the Memex system, his goal was to facilitate the exchange of scientific research. Later, Doug Englebart built on Bush's ideas as did Ted Nelson (the guy who coined the term "hypertext") and Tim Berners-Lee.
    
    And one of the design goals of the Xanadu server project was to provide exactly this sort of permanent storage and location-redundant backup. (We even refered to it as the "Library of Alexandrea Problem" and named one of the machines after the Alexan
- RTFA... it's about references in scientific papers (Score:5, Insightful)
  
  by dpbsmith ( 263124 ) writes: on Monday November 24, 2003 @01:31PM (#7548919) Homepage
  
  The article is not about archiving "everything in the world." It's specifically about references in scholarly papers, which, for the past three or four centuries, have been part of the essential fabric of scientific research. In a research paper, everything you say is either supposed to be the result of your own direct observation, or backed by a traceable, verifiable, and critiquable authority.
  
  You don't just say "Frotz and Rumble observed that the freeble-tropic factor was 6.32," you say "Frotz and Rumble (1991) observed that the freeble-tropic factor was 6.32." Then, at the end, traditionally, you would put "Frotz, Q. X and Rumble, M (1991): Dilatory freeble-tropism in the edible polka-dotted starfish, Asterias gigantiferus (L) (Echinodermata, Asteroidea), when treated with radioactive magnesium pemoline. J. f. Krankschaft und Gierschift, 221(6):340-347."
  
  Then if someone else wondered about that statement, they'd go to the library and pull down volume 221 of the journal, and see that Frotz and Rumble had only measured that factor on six specimens, using the questionable Rumkohrf assay. If they had more questions, they'd write to Frotz at the address given in the article, asking them whether they remembered to control for the presence of foithbernder residue.
  
  This sort of thing is absolutely essential to the scientific process and makes science self-correcting.
  
  The article says that these days, the papers are published online, the references are URLs, and that an awful lot of them are stale. If so, this cuts to the very heart of the process of scientific scholarship.
  
  Parent Share
  twitter facebook
  - Re:RTFA... it's about references in scientific pap (Score:3, Insightful)
    
    by squidfood ( 149212 ) writes:
    
    This sort of thing is absolutely essential to the scientific process and makes science self-correcting.
    Recently a colleague of mine published a paper in an online peer-reviewed journal which contained a trivial error (transposition typo) that however would change, in fact reverse, the interpretation results. They were permitted to fix this, months after the article had first been posted. Does this aid Progress, or is it Revisionist?
"This is no way to run a culture." (Score:2, Flamebait)

by Cokelee ( 585232 ) writes:

This is no way to run a culture.

Tell the RIAA that.

Music is a part of our culture.
Books have an ISBN... (Score:5, Interesting)

by Advocadus Diaboli ( 323784 ) writes: on Monday November 24, 2003 @10:32AM (#7547386)

...which means that with that ISBN I can refer to the book and find it at libraries or bookstores. Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there. Then it would be easy to track an article if its moved to another site or whatever just by looking up a sort of catalog for these numbers.

Share
twitter facebook
- Re:Books have an ISBN... (Score:5, Informative)
  
  by kalidasa ( 577403 ) * writes: on Monday November 24, 2003 @10:36AM (#7547434) Journal
  
  There already is such an identifier. It's called a Universal Resource Identifier, or URI. See Berners-Lee essay Cool URIs Don't Change [w3.org].
  
  Parent Share
  twitter facebook
  - Berners-Lee considered harmful (Score:5, Insightful)
    
    by 0x0d0a ( 568518 ) writes: on Monday November 24, 2003 @11:17AM (#7547714) Journal
    
    URIs don't provide content-based addressing (like a hash of the document). They rely upon trustworthy name registrars, which is an assumption that might have been valid when Berners-Lee was doing his early work, but is not now. They rely on someone willing to continue hosting the original document -- not necessarily the case.
    
    You can link to a article which is then changed by the original publisher (or someone else). With scientific papers, you can't do that -- and such behavior is probably not desireable.
    
    On the up side, if you're currently using cited references, you should be able to build such a system without too much problem -- follow links to PDFs or automatically crawl HTML documents (and check images) and serve all papers that you refer to with your paper. It'd be big, but it provides better reliability than do current paper schemes.
    
    Another feature that might be useful is signing of the content (assuming RSA doesn't get broken in the future).
    
    Basically, if you put up a SHA-1 (Gnutella), MD4 (eDonkey), or similar reference, you can host the original referred-to documents as well as the original host.
    
    If Freenet didn't have as a specific drawback the inability of someone to guarantee that a document remains hosted as long as they are willing to host it, Freenet would be a good choice for this.
    
    One possibility is that, with a bit of manual work, one can frequently find an academic work by Googling for its title. At least for now, as long as you host the original papers as well, Google should pick up on this fact. Of course, it does nothing to prevent modification of that paper by another party...
    
    A good system for handling this would be to have a known system that is willing to archive, in perpetuity (probably hosted by the US government or other reasonably stable, trustworthy source [yes, yes, cracks at the US government aside]). This system would act like a Tier 1 NTP server -- it would only grant access to a number of other trusted servers (universities, etc) that mirror it -- perhaps university systems -- which would keep load sane. These servers (or perhaps Tier 3 servers) then provide public access. Questions of whether there would be a hard policy of never removing content or what would be allowed (especially WRT politically controversial content) would have to be answered.
    
    There could be multiple Tier 1 servers that would sync up with each other, and could act as checks in case one server is broken into. I'm partial to the idea of including a signature on each file, but I suppose it isn't really necessary.
    
    Specific formats could be required to ensure that these papers are readable for all time. Project Gutenberg went with straight ASCII. This would probably have to be slightly more elaborate. Microsoft Word and PDF might not be good choices, and international support would be necessary.
    
    Parent Share
    twitter facebook
- Re:Books have an ISBN... (Score:4, Interesting)
  
  by daddywonka ( 539983 ) writes: on Monday November 24, 2003 @10:38AM (#7547459) Homepage
  
  Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there.
  
  The article mentions this: "One such system, known as DOI (for digital object identifier), assigns a virtual but permanent bar code of sorts to participating Web pages. Even if the page moves to a new URL address, it can always be found via its unique DOI."
  
  But it seems that these current systems must use "registration agencies" to act as the gatekeeper of the unique ID.
  
  Parent Share
  twitter facebook
  - Re:Books have an ISBN... (Score:2)
    
    by Waffle Iron ( 339739 ) writes:
    
    But it seems that these current systems must use "registration agencies" to act as the gatekeeper of the unique ID.
    Why not just embed an off-the-shelf GUID in the header of the document? That doesn't require any central authority.
    The <A> tag could be enhanced with a "guid" attribute. If a browser gets a "page not found" error on a link, it could automatically submit the GUID in the link to Google or some other search service to look for the current location.
- Re:Books have an ISBN..(but web pages are googled) (Score:5, Insightful)
  
  by WillAdams ( 45638 ) writes: on Monday November 24, 2003 @10:40AM (#7547476) Homepage
  
  That was why Tim Berners-Lee wanted URL to stand for ``Universal'' (not Uniform) Resource Locator.
  
  The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).
  
  Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?
  
  With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).
  
  Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get. ;)
  
  William
  
  Parent Share
  twitter facebook
- - Re:Books have an ISBN... (Score:3, Funny)
    
    by NickFitz ( 5849 ) writes:
    
    We could call it an Intellectual Property Address, or IP Address for short.
then don't look for culture in web pages... (Score:5, Interesting)

by TechnoVooDooDaddy ( 470187 ) writes: on Monday November 24, 2003 @10:33AM (#7547392) Homepage

honestly, the transient nature of webpages makes it an unsuitable medium for the long term establishment of "culture" our categorization happy, buzz-word ridden nature so commonly prevalent will have to find a new term for what is the web. boo-freaking-hoo.. meanwhile i'll keep doing my thing, posting pics for my family to see, putting calendar events up on the web so my homebrew-club will know when we're meeting and not worry about any "culture" i might be potentially creating then destroying when i take stuff back down.

man i need coffee, insomnia is a bitch...

Share
twitter facebook
- Re:then don't look for culture in web pages... (Score:2)
  
  by Urkki ( 668283 ) writes:
  
  The problem of some random personal web page perhaps 2 people ever looked at disappearing. The problem is that web pages actually referenced by others are disappearing, thus breaking the big web of knowledge that has been forming for as long as we've had printed press.
  
  There really should be a permanent way of storing web pages, and storing them at the state they were at one given moment of time. So the archiving would naturally be the responsibility of the referer.
  
  We just need a web service for that. I
  - Re:then don't look for culture in web pages... (Score:2, Interesting)
    
    by Araneas ( 175181 ) writes:
    
    "There really should be a permanent way of storing web pages, and storing them at the state they were at one given moment of time."
    Teach browsers to speak CVS.
- Re:then don't look for culture in web pages... (Score:3, Informative)
  
  by YU Nicks NE Way ( 129084 ) writes:
  
  Even if your statement accurately reflected the concerns in the article, it would still be misguided.
  
  Historians are concerned about all the ephemera of a civilization, not just the "official" ones. The random archives of everyday junk can, and often do, tell a very different story about the civilization than the story that the society would like to hear about itself, so historians treasure those postings of pics for your family to see.
  
  For example, if you read the official press, you'd see a lot of articl
Don't do that. (Score:5, Insightful)

by Valar ( 167606 ) writes: on Monday November 24, 2003 @10:33AM (#7547395)

You probably shouldn't be quoting any kind of "Bob's World of Great Scientific Insight" type pages anyway. I mean, the majority of sites that go under in less than 100 days are the one person operations that one should identify as bad sources anyway. So it might seem obvious that quoting someone's blog in a research paper is just a plain stupid idea, but it happens way more often than you might think.

Share
twitter facebook
- The web can hold insight, in the right field (Score:4, Interesting)
  
  by mactari ( 220786 ) writes: <rufwork@g m a il.com> on Monday November 24, 2003 @10:47AM (#7547526) Homepage
  
  That's a fairly reductionist view if taken too far. Not all researchers are tech whizzes (no pun intended), and I've seen a number of, in my case, professors of English Literature who run the same sort of, "Throw up ten pages with Under Construction signs, test publish a few papers, and let the site sit for years, one day to mysteriously disappear," web site lifespan that "Bob's World" might as well.
  
  Perhaps even more interestingly, it doesn't always really matter if you've done great, repeatable research in the "soft science" fields or outright humanities. You don't have to be a literature expect to have a good insight on "Bartleby the Scrivener". A grad student's blog, as an example, might contain excellent contributions to the conversation.
  
  Now that said, in the context of the article -- dealing with "a dermatologist with the Veterans Affairs Medical Center in Denver" -- I would tend to agree with you heartily. Hard science needs to pull, in my layman's view, from research that the article's author researched well enough to see that it wasn't a few 0's and 1's that might be pulled later, in general.
  
  And heck, what's the harm in saving the pages on your drive and contacting the original author if they disppear? Hard drive space is cheap. If you take yourself seriously, you might want to grab a snap, even if it is technically illegal (not that I know that it is; Google seems to do it right often).
  
  Parent Share
  twitter facebook
Throwing out the baby with the bathwater (Score:5, Insightful)

by Liselle ( 684663 ) * writes: <slashdot@@@liselle...net> on Monday November 24, 2003 @10:34AM (#7547404) Journal

People are worried about losing the information on the web: but all that is really happening is that the URLs are no good after a while, you lose the snapshot. The information is not necessarily going anywhere. If there is a need or a want, someone will throw it up, or another will host it. That's the beauty of the web, you get the good with the bad, but time has a way of getting rid of the chaff.

What would be interesting would be a website that archives those snapshots for posterity. Well, what do you know, there are several such sites already! Looks like we're in good shape. The sky is not falling. ;)

Share
twitter facebook
Reliability (Score:5, Interesting)

by lukewarmfusion ( 726141 ) writes: on Monday November 24, 2003 @10:34AM (#7547405) Homepage Journal

It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.

I'm a recent grad of a University... my freshman year, profs wanted us to start using the Internet more so we were asked to submit at least x number of references from Internet sources. By my senior year, they were trying to get us to stop using the Internet. Using a URL as a reference was sometimes forbidden by the professor.

Share
twitter facebook
- But even "reputable" web pages get (re)moved... (Score:3, Insightful)
  
  by aquarian ( 134728 ) writes:
  
  It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.
  
  The problem *is* the short lifespan of web pages. Even "reputable" publications move their pages around, or remove them entirely, breaking all links. I'm talking about major newspapers, scientific journals, et
The final irony? (Score:2, Interesting)

by the real darkskye ( 723822 ) writes:

That matters in part because some documents exist only as Web pages -- for example, the British government's dossier on Iraqi weapons.
"It only appeared on the Web," Worlock said. "There is no definitive reference where future historians might find it." Much like the WMDs themselves then ...
Rigidity stifles creativity (Score:5, Insightful)

by apsmith ( 17989 ) * writes: on Monday November 24, 2003 @10:36AM (#7547416) Homepage

Any extra effort required to make web pages and their URL's preserved for eternity makes it more difficult for people to create them in the first place, which will mean less knowledge available, not more. Something unobtrusive that goes around preserving pages for posterity, like the Internet Archive [archive.org], is the best soplution.

Share
twitter facebook
Hardcopy (Score:5, Insightful)

by Overzeetop ( 214511 ) writes: on Monday November 24, 2003 @10:36AM (#7547424) Journal

This is why every time I use a web reference I make a hardcopy of it and include it in my research folder. It did not take long for me to figure out that web pages are no more useful than manufacturer catalogs - once the year is up, you might never get that tidbit of information back. If it's too large to want to print, I'll hardcopy the couple of pages I need, and PDF the whole thing for digital storage.

Having a hardcopy (1) documents the information and it's (purported) source, and (2) allows offline access for comparison and validation.

Share
twitter facebook
- Re:Hardcopy (Score:2, Insightful)
  
  by lukewarmfusion ( 726141 ) writes:
  
  One problem with using a hard copy is that you're the only one holding that copy. If the site disappears from the Internet, then your readers must rely on your printout (or cache) as a reliable source. You may not have a way to prove that your printout wasn't modified between download and printout. With more traditional methods, there are so many printed copies that such a claim could be disputed easily. I think your solution is the best one under the current situation, though.
Don't forget the damage done by censorship! (Score:2)

by Jerry ( 6400 ) writes:

I was recently looking for pages about the peer review work of the global warming paper underlying the KYOTO Doctrine. Pages less than a month old were removed. Articles on ABC, Time, CNN and newspaper sites by the hundreds have 'old' pages missing.

There is no substitute for the printed page... yet.
Let me get this strait... (Score:3, Informative)

by ericspinder ( 146776 ) writes: on Monday November 24, 2003 @10:36AM (#7547433) Journal

You mean to tell me that those researchers found a dead link on the Internet, the horror. Were can I get one of those jobs!

Another study, published in January, found that 40 percent to 50 percent of the URLs referenced in articles in two computing journals were inaccessible within four years

That's because they were ads for companies that went out of business.
besides if you want to see old pages just go the the the wayback machine [waybackmachine.org]. Between that and backup tapes, everything you ever wrote still lives (in many cases I wish it didn't !).

Share
twitter facebook
Yes, big issue! (Score:5, Interesting)

by Erwos ( 553607 ) writes: on Monday November 24, 2003 @10:37AM (#7547440)

I've personally been working (internally so far) on a website of modern-day Orthodox-Jewish responsa to various issues of Jewish law, so this is an issue I've given some thought to.

To say this is some kind of problem specific to the web is misleading. There are old, well-quoted sources of Jewish thought whose texts are simply lost to us in this current day and age. Example: a famous and extremely popular commentary on the Talmud and Torah, Rashi, is missing for at least a few chapters of Talmud. That would be the equivalent of IEEE misplacing some standards papers and then NO ONE having copies, just lost to the sands of time. Yet it did happen, proving this at least _was_ a serious issue.

However, these days, with such things as the Way-Back Machine and Google caching, actually LOSING entire web pages doesn't happen very often, and, I'd bet, it happens far less frequently than the loss of books.

-Erwos

Share
twitter facebook
Interesting... (Score:5, Funny)

by Rinikusu ( 28164 ) writes: on Monday November 24, 2003 @10:39AM (#7547468)

I found that out years ago.. :P

From a researcher's perspective, I used the web primarily as a quick "google" to get some ideas on where I might do further research. For instance, while a particular paper may have been taking offline regarding my search, many times the search will proffer an author's name. Take that name to the library's database (or googling it, too), and you might can get a list of more publications that the author has penned. Even better: sometimes, you can get a valid email address from other links and you can write and ask the original researcher himself about various publications, many times they have copies on hand and can send them to you. My research involves the web, but does not end with the web, which is where many people find themselves hung.

Hey, guys. See that big building with those obsolete books? Lots of chicks hang out there. :)

Share
twitter facebook
And? (Score:2)

by woodhouse ( 625329 ) writes:

I don't see how this is news. Most people who write science papers are well aware of the problems with citing web pages, and we'll try to cite books and published papers wherever possible. Generally, people with something important to say will publish it properly, so this is not usually a problem.

The only people who exclusively cite web pages are likely to be the same people who write bad papers anyway, so I can't see the issue here.
A problem recognized already some time ago.... (Score:5, Interesting)

by tsvk ( 624784 ) writes: on Monday November 24, 2003 @10:40AM (#7547475)

Usability expert Jakob Nielsen addressed the issue of linkrot in a column already in 1998: Fighting Linkrot [useit.com].

Share
twitter facebook
What's the problem here ? (Score:5, Insightful)

by JackJudge ( 679488 ) writes: on Monday November 24, 2003 @10:42AM (#7547489) Journal

Why would we want to archive 99.9% of today's web content ?
Does anyone archive CB radio traffic ??

It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.

Share
twitter facebook
- Re:What's the problem here ? (Score:5, Interesting)
  
  by southpolesammy ( 150094 ) writes: on Monday November 24, 2003 @10:58AM (#7547597) Journal
  
  Yes, good point. The Internet is much more akin to CB radio since it is uncontrolled, unverified, entirely volunteer-based, entirely virtual, and highly volatile. By contrast, books, TV, and other media are highly controlled, subject to external verification, have a high cost of entry, are either themselves physical media, or require a physical presense in order to communicate, and are largely static in content.
  
  The problem with the Washington Post's article is that their premise is flawed. They assume that the Internet is a mostly static source of information, when it is definitely a mostly dynamic information source. Webpages are meant to be updated, and with updates come change. It's inevitable. To assume that we keep every update to the webpages in separate locations is a false assumption. It's cool to see sites like the Wayback machine do this, but it's not required.
  
  Parent Share
  twitter facebook
- Re:What's the problem here ? (Score:2)
  
  by sporty ( 27564 ) writes:
  
  There's a difference. CB traffic is usually casual conversation. People create websites to give out "important" information to a large spectrum.
  
  One has an active listener with active feedback. One is completely passive.
  
  Only in the case of an interview would CB traffic be completely informational. I don't remember the last time I put up a web page to say something to someone. People do put up "web applications" such as forums and chatrooms of sorts.
  
  While they are of the same fruit, they are still app
Backup Your Important Data (Score:4, Insightful)

by Slider451 ( 514881 ) writes: <slider451&hotmail,com> on Monday November 24, 2003 @10:42AM (#7547490)

Anything worth publishing digitally should be recorded in a more permanent medium.

I constantly backup all my digital photos because they are important to me. I also print the best ones for placing in photo albums, distributing to friends, etc.

The website they are published to is just a delivery medium, and not even the primary one. It can disappear and I wouldn't care. People who know me can always get access to them. Scientists should view their work the same way.

Share
twitter facebook
long-term storage needs... (Score:2, Insightful)

by mwilliamson ( 672411 ) writes:

This is not just a problem with Web pages, it is a problem with all popular media formats today. How can we make sure future generations will be able to make use of any of our media? (makes me think of a buddy's magneto-optical drive...who the hell else has one) One solution is to actively copy from format to format as technologies change, but this requires constand upkeep throughout the ages. Relying on future generations to maintain our most precious information is not a responsible behavior for a cul
Permalinking and archiving (Score:5, Insightful)

by seldolivaw ( 179178 ) * writes: <me@seldUMLAUTo.com minus punct> on Monday November 24, 2003 @10:44AM (#7547508) Homepage

The ephemeral nature of the web is a very real problem, but it's important not to overstate it. The reason so much more information is lost these days is partly a reflection of the fact that we produce so much more of it. The Library of Alexandria was the distilled knowledge of an entire civilisation; it was unique, irreplaceable and massively important information. The web is full of information that is of low quality, often massively redundant (thousands of pages explain the same thing in different ways) and certainly replaceable (the web is not the final repository of the information: it's a temporary place where that information is published). In the same way, for centuries, newspapers have produced thousands of redundant issues with a lifetime of just a few days. The reason no one decries the loss of our newspapers is because the publishers themselves still archive the information, even if this is somewhat hard to get to. The same is true of web pages, only the number of publishers is vastly larger.

Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.

But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.

So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.

Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.

Suggestions of better solutions are always welcome :-)

Share
twitter facebook
Make sure you have a paper reference. (Score:2)

by Kjella ( 173770 ) writes:

Personally, I find web links *can* be much more efficient than having to dig out an issue of some science journal (which the local library will *not* have, and your request will be forwarded by carrier snails), if they're there.

But, always the paper reference. If it doesn't have one, it'd sure better be a reference to a known professor somewhere, so whoever is interested can dig up a homepage somewhere. If it doesn't even have that, don't use it.

Personally, I haven't found it that difficult to cite articl
Reason for this? (Score:5, Insightful)

by bobthemuse ( 574400 ) writes: on Monday November 24, 2003 @10:45AM (#7547512)

The article states that the average life for a website is 100 days, but wouldn't journals and formal publications (the most often cited documents in research) last longer than the average? Also, is the average skewed because websites are more likely to contain 'current information'? "Average lifetime" is misleading, does this mean the average time the page stays the same, or the average time before the information in the page is unavailable?

Share
twitter facebook
If you want to do serious research.... (Score:4, Insightful)

by RobertAG ( 176761 ) writes: on Monday November 24, 2003 @10:46AM (#7547516)

Then DOWNLOAD the pages from your web citations.

For example, a short time ago, I did a white paper on power scavenging sources. About 1/2 the articles I read were HTML or PDF sources. Rather than just citing the URL, I downloaded/saved every online article I referenced. If someone wants the source and cannot find it, I'll just provide it to them. If your paper is going to be read by a number of people, it makes good sense to have those sources on-hand; it never hurts to cover your arse.

Hard drive/Network/Optical space is virtually unlimited, so storage isn't a problem. Paper journals are archived by most libraries, anyway, so until they start archiving technical sources, I'm going to have to do my OWN archiving.

Share
twitter facebook
Cool URIs don't change (Score:5, Interesting)

by KjetilK ( 186133 ) writes: <kjetil.kjernsmo@net> on Monday November 24, 2003 @10:49AM (#7547537) Homepage Journal

May I remind everyone to read and understand TimBL's Cool URI's don't change [w3.org]. It's not that hard to design systems where you do not have to change the URI every 100 days, folks.

Share
twitter facebook
- Re:Cool URIs don't change (Score:3)
  
  by Reziac ( 43301 ) writes:
  
  He touches on one of my pet peeves: just because you rearrange the site doesn't mean the OLD content simply MUST go away. Web pages don't eat much, and it's not the end of the world if someone finds "outdated" information, so long as the site structure makes it fairly evident where to find current information (such as consistent links to a sitemap or default root page -- after all, how often do you change the name of http://www.mysite.com/ ??) So unless there's some pressing reason not to (like ordering pag
URL + date (Score:3, Insightful)

by More Trouble ( 211162 ) writes: on Monday November 24, 2003 @10:49AM (#7547539)

Proper URL citations include the date. I'm not worried so much about the page being taken down (since it is presumably archived), as much as changing. If you don't record which version your were referring to, the content can change dramatically.

:w

Share
twitter facebook
- Re:URL + date (Score:3, Interesting)
  
  by StormyMonday ( 163372 ) writes:
  
  Bingo!
  
  I watch a number of political sites; it's amazing how, when Congressman Sludgepump says something stupid, it tends to disappear from his Website with no indication that it has ever changed. Occasionally, it even changes to show that he said the opposite of what was originally there.
  
  Checksums/digital signatures are potentially a solution, but the problem of doing it right can be quite difficult when you include real-world constraints. PDFs are a pain in the arse, but at least you can do a decent ch
cant erase my usenet postings (Score:5, Interesting)

by peter303 ( 12292 ) writes: on Monday November 24, 2003 @10:50AM (#7547545)

I started posting usenet in the late 1980s. These g*dd*mn things are still are still on the net. I was less guarded at that time. Everyone *knew* them becase disk space ws so scare that usenet postings would disappear in 7-14 days.

Share
twitter facebook
Signal:Noise (Score:2)

by goldspider ( 445116 ) writes:

With such a low signal:noise ratio on the Web, would you really want to capture everything?
Good record-keeping doesn't necessarily mean keeping everything, just stuff worth keeping.
Blogging Fragments, Like the Ancients (Score:2)

by handy_vandal ( 606174 ) writes:

I collect miscellaneous links on my web site. Over time, I've started adding excerpts along with links. The excerpts help remind me what the link was about, but they also serve another purpose: when the link goes bad, I can use keywords in the excerpt to search for related pages on the web.

Our knowledge of ancient history has proceeded in a similar manner. Much of what we know about, say, pre-socratic philosophers, we know because of references in Aristotle and other later scholars. The original sourc
the problem is bigger (Score:5, Insightful)

by professorhojo ( 686761 ) writes: on Monday November 24, 2003 @10:52AM (#7547565)

it's not simply webpages that are the problem. it's digital storage in toto.

because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.

i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.

yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)

Share
twitter facebook
- Re:the problem is bigger (Score:2)
  
  by tjansen ( 2845 ) * writes:
  
  What happens when the library containing all the valuable information on paper burns down (which happened more than once in history)?
  
  Digital data is somewhat easier to destory (not much, a match can be enough to destroy a paper library), but it is much easier to copy and backup. It's trivial to make a backup of your library each day and ship it to some other place. Try that with a house full of paper.
  
  The main problem is that too many people are still trying to delete stuff on their disks, even though that
Give and take - it's cultural change, dummy. (Score:5, Insightful)

by 3Suns ( 250606 ) writes: on Monday November 24, 2003 @10:53AM (#7547567) Homepage

Easy come, easy go... here's another cliche: Give and Take. What's great about the web is that it has effectively demolished the barriers to entry in publishing. Everybody and their grandmother has a blog now - you can't compare webpages to magazine articles or newspapers. There's just so much more information being published now that its average lifespan is bound to go down. So what?

Publications that cite [web pages] lose their authorities? Who the hell told you to cite a webpage? Might as well cite a poster you saw downtown. If the webpage is a reputable source in the first place, it'll keep it around permanently. Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.

This is no way to run a culture. Last time I checked, nobody ran our culture... It kinda runs itself. The proliferation of accessable, ephemeral webpages over permanent, priveliged paper publications (wah, too many p's!) is a sign that our information culture has moved on into a new era. Liked the old one? Tough! Now information has to maintain its own relevance in order to be permanent... and I for one welcome that change.

Share
twitter facebook
- Re:Give and take - it's cultural change, dummy. (Score:4, Insightful)
  
  by kirkjobsluder ( 520465 ) writes: <kirk@@@jobsluder...net> on Monday November 24, 2003 @11:23AM (#7547762) Homepage
  
  Who the hell told you to cite a webpage? Might as well cite a poster you saw downtown. If the webpage is a reputable source in the first place, it'll keep it around permanently.
  
  Not always true. The U.S. Government was a good source for research information until the political purge of research articles that disagreed with the administration on key policy issues. The basic response? The NIH, Department of Education, FDA and EPA's responsibility is to promote policy, not provide information to the public. (Although this problem is not limited to the Internet, libraries that were public archives for government documents were ordered to pull "sensitive" material after 9-11.) In addition there is the problem of upgrading infrastructure. The URL may work today, but what happens when the site moves to a more scalable system?
  
  Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.
  
  I don't know about the journals you read, but 90% of the ones I read are already on the web or archived through a distribution service. (Although another loss to reseach for politics may be ERIC which in education has been a source for many interesting "minor" papers and conference proceedings.)
  
  The real value of journals has never been print publication, but in the peer-review process. The reason why citations in professional journals carry more weight is because the reader knows that the article had to have run the gauntlet of critical reviews from expert peers.
  
  Now, granted, web page citations should probably be treated on the level of personal correspondence rather than as authoritative source. But to say that web-based resources move or vanish because they loose their relevance is missing a major flaw in how the web works. One professional organization I'm a member of tottered on the edge of bankrupcy for about a year. If it had gone under, web access to some of the key works in the field would have vanished overnight, and the works themselves dumped into a copyright limbo.
  
  Parent Share
  twitter facebook
And the Fizz in My Mntn Dew Is Gone Even Sooner (Score:2)

by RobotRunAmok ( 595286 ) * writes:

But wait, I think I still have a Mosaic presskit from the '91 Comdex. Does that count?

It's not Web Pages, its the Web itself that will be the cultural artifact. With the bar for publishing on the Internet placed so low, it falls to Father Time to become the Web's ultimate Editor-in-Chief.

On a related note, I'm moving, and came across reams of stuff I wrote while a college student, and boy does it suck! Tonight I light a candle to Neil Gaiman's I-Net God in thanks that my potentially career-wrecking puk
No way to run a culture? (Score:4, Insightful)

by theolein ( 316044 ) writes: on Monday November 24, 2003 @11:05AM (#7547635) Journal

As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."

To the contrary, I think this is highly typical of the culture we have today, where everything is a transient fad in the media, technology and politics.

And it is also self feeding, I think, since market forces need to clear out the old to make room for the new in order to meet sales forecasts and shareholder expectations. And this is very true for pop, news and technology, which explains the lack of staying power of pop icons these days and becomes interesting when you want to ask yourself if you really need that new 3GHz machine just to surf the web.

And it is highly convenient in politics where a politician doesn't have to be accountable for what he said 100 days ago.

And so, the lack of long time life on the web is simply symbolic of all the rest here really, even if it is highly questionable.

Share
twitter facebook
genguid and google (Score:3, Insightful)

by hey ( 83763 ) writes: on Monday November 24, 2003 @11:12AM (#7547680) Journal

Use genguid (or other tool) to make a globally unique number
and place that number at the bottom of your
page a link with google's "I'm feeling lucky"
searching for the GUID.

Share
twitter facebook
How long does the average conversation take? (Score:5, Interesting)

by freality ( 324306 ) writes: on Monday November 24, 2003 @11:13AM (#7547688) Homepage Journal

Webpages aren't replacements for books. Or rather, you shouldn't use them that way.

If they're lasting on average 100 days, that puts them somewhere between transient culture, like spoken conversation, and printed culture, like newspapers. Big deal.

We want to preserve culture for future generations, no doubt. But we don't want to preserve all culture for future generations. Anything that is lasting for 100 days and isn't being persisted... well, relatively that's not worth much to future culture.

I don't remember the exact saying, but there is a Native American saying to the effect of "We don't write things down. If we don't remember it, it's not worth remembering." Now, they're not the last word (no pun intended) in wisdom traditions, but there is a certain amount of enforced vitality necessitated by forgetting the details.

We'd better get used to the idea. We're only going to be forgetting more and more of the details as we generate more and more useless information.

Share
twitter facebook
An example of broken down copyright laws (Score:2)

by Jerf ( 17166 ) writes:

This is Yet Another Example of how copyright laws are breaking down. If you're going to cite something academically, should you perhaps have the right of mirroring the content you are citing for the sole purpose of providing a backup if the original goes down, or even just changes?

Copyright law says no, that's copyright infringement.

But copyright law is based on the assumption that a published thing, like a book, is concrete and can't be changed, and can be referred to, forever and ever amen, by the same
- Re:An example of broken down copyright laws (Score:3, Interesting)
  
  by WNight ( 23683 ) writes:
  
  I agree. The point of copyright is mainly to encourage the production of commercial works, to enrich the public domain. It was never intended to force a work to remain out of print.
  
  We need to change copyright law so that it doesn't prevent saving of lost works, and so that it can't be used to force a work to moulder away because it's in someone's best interest that it not be for sale. (For instance, old movies that studios don't want cutting into new movie revenue.)
  
  I'd like to see a short total-rights-res
Blame bad writing (Score:2)

by TyrranzzX ( 617713 ) writes:

All the mass media is owned by 6 major corperations as we already know. In our stimulation-happy culture where sitting ontop of a mountain taking in the view isn't appreaciated, so too is long complicated writing. Thanks to this media, people are raised to be consumers, and we're stimulated to the point that things like books are so boring that we fall asleep reading them. Why read a book when you can watch a movie? A person who plays FPS games for months on end is so thoroughly stimulated that sitting
Reviewed Content (Score:3, Interesting)

by neglige ( 641101 ) writes: on Monday November 24, 2003 @11:20AM (#7547742)

Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications.

For true scientific work, this should never happen. Because you should only cite reviewed sources. Such as books, articles or conference papers. This is no guarantee for quality, but at least the review process sorts out the most obvious nonsense. And, if the reviewer is good, it may even increase the quality of the work. Plus, those sources are permanent.

As always, there are sources that are more respected (IEEE, ACM etc.) than others. And using respectable sources is a good thing, because normally you want to prove a point and you base your argument on those publication. So if your basis for your argument is faulty... well ;)

Furthermore, there is hardly any information that can be found on the web but not in a reviewed form. Note that there are (accepted) scientific reviewed journals using the web for publishing. Without a printed edition. And you can quote them. And, as many before me have said, the articles and links do not vanish (the URL is usually not quoted anyway - these articles are listed just like printed articles).

This is just my personal opinion on scientific work. Let's see if my head is still on my shoulders tomorrow :)

Share
twitter facebook
Others will cite this & the post as proof... (Score:2)

by adzoox ( 615327 ) * writes:

... that nothing can be trusted if it is reported on the web.
That's sort of ridiculous, seeing since the source is sometimes bias itself (Washington Post)
To me, knowledge is only truth if both sides to an extreme are presented. Meaning; one cannot understand Abortion rights unless one here's both sides; ProLife and ProAbortion.
The sentiment that web sources, just because they aren't written without journalistic/legal lingo and because that news isn't from "the usual outlets" (CNN,ABC,NBC,CBS, Time, etc)
Longevity (Score:4, Interesting)

by unfortunateson ( 527551 ) writes: on Monday November 24, 2003 @11:21AM (#7547745) Journal

Maintaining a links page for my wife's business' site has always been a low priority, and finally, I put up a MySQL/PHP page to do the majority of the work.

So I've been going through all the old links, and every link request we've gotten in the business' 7-year history. Of the 120 messages in the timeframe of 1997-1999, only about 15 sites still existed. Of those, two-thirds had forwarded URLs -- often from AOL or Homestead to their own brand. A couple still existed, but had totally different content.

Many just plain didn't exist at all. A fair chunk found the server, but no such page. A few had blank pages or nearly no content. The true annoyance though, is the number of domains that are owned by spamdexers/linkfarms that have no content of their own and beg you to set your homepage to them.

I've still got to cover the rest of 2000-2003 link requests, but I expect that anything pre-2001 will be very sparse.

Share
twitter facebook
Archiving web sites (Score:2)

by UncleRoger ( 9456 ) writes:

This problem is the impetus behind ComputerHistory.net [computerhistory.net], a sort of internet archive for computer history web sites.
Site Linking Schemes (Score:3, Interesting)

by Oculus Habent ( 562837 ) * writes: <oculus.habentNO@SPAMgmail.com> on Monday November 24, 2003 @11:26AM (#7547781) Journal

An easy system would be for a server to provide each document it houses with a unique meta-data identifier. Then, when a document, story or paper moves from the "main page" into an archive section, you can still refer to the FileID. This ID should be searchable, so that an article could be linked via something like:

http://www.cnn.com/?2001EXCJA2

The IDs could be system generated and handled by a file system that supports meta-data or they could be designed to mean something and handled by a content management system.

Implementation is the difficult part. Getting everyone - or at least news sites, magazines, and colleges/universities - to set up FileID searching and then document the linking process on their site is no small task.

Share
twitter facebook
Citing URLs is not quite appropriate (yet) (Score:5, Informative)

by c13v3rm0nk3y ( 189767 ) writes: on Monday November 24, 2003 @11:51AM (#7547965) Homepage

Hmmm. I'm not sure most scholary works are allowed to just cite arbitrary URLs for inline references or footnotes.
The idea is that you generally have to cite peer-reviewed, published and presented articles; criteria which the majority of web published material simply does not satisfy. Web reading would fall under the "course reading", and would have to be backed up by a "real" reference.
According to my GF (currently working on a Masters in Anthropology) there is a lot of confusion on how to use the web for scholary references. Many people cite URLs in citations that are really just online archives of previously-published work. In this case, noting the URL is like saying which library you checked the article out, and what shelf it was on. If you are an undergrad and cite a URL, it is almost a sure thing that the prof or the TA's will take marks off for improper citations.
There are a few peer-reviewed journals that are (partly or completely) published online, in which case the URL might be a valid citation. This is likely to changed, and it seems the original article was suggesting that we need to handle this case now, before we lose more good work.
In a much smaller way, this is the kind of thing that those involved in the whole blog phenomenon are trying to resolve [xmlrpc.com]; making sure that their blog-rolls, trackbacks and search-engine cached pages stay historically maintainable.

Share
twitter facebook
The main difference being... (Score:3, Funny)

by artemis67 ( 93453 ) writes: on Monday November 24, 2003 @11:56AM (#7548019)

he Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library.

The main difference being that most of what was in ancient Alexandria's library was considered to be of importance to at least a sizeable group of people, if not the majority, whereas most of the web pages that disappear every day are simply dross.

Share
twitter facebook
- Re:The main difference being... (Score:3, Insightful)
  
  by geeklawyer ( 85727 ) writes:
  
  Thats not an entirely unreasonable view, however archeologists frequently gain important insights into an ancient culture by looking at dross. Near the burial sites of pharaohs were found carved complaints by workmen about poor conditions. in Greece (I think) notes were found n a ceremonial spot with curses aimed at neighbours and slutty wives. Gossip title-tatle for sure but quite informative and used to get a feel for the society.
  
  In 5000 years archeologists will learn so much about us from blogs & a
100 days (Score:3, Funny)

by feed_those_kitties ( 606289 ) writes: on Monday November 24, 2003 @12:05PM (#7548121)

Unless it gets /.ed - then its lifespan might be measured in minutes!

Share
twitter facebook
problem can easily be improve with some thought (Score:5, Insightful)

by agentk ( 74906 ) writes: on Monday November 24, 2003 @12:09PM (#7548168)

This has been a real problem for a long time. But the web is distributed. The only real solution is for people to realize that moving stuff around all the time breaks links, and avoid it. One thing that would help is a translation layer in the web server, that separates the URL from the server's filesystem. This is basic software engineering common sense.

Non-transparent CGI, PHP and ASP scripts are even worse, they tend to change all the time. Instead they should be using the "path info", or be in the server (mod_perl, etc.)

Example: "http://science.slashdot.org/article/03/11/24/1272 50" is a much better permanent URL for this story, than exposing the details of some perl script called "article.pl" that takes a parameter named "sid", and it will be easier to adapt to all future versions of Slash or other software, or to simple archive as a static file someday. Using the PATH_INFO CGI variable you can make a CGI like "article.pl" use URLS like that above.

The idea that the basic job of a webserver is to pull files off your disk is incomplete: it's job ought to be to take your URL through *any* kind of query lookup, which might map to the filesystem and might not. The HTTP RFC's imply this as well.

reed

Share
twitter facebook
Misleading statistics (Score:5, Interesting)

by Alomex ( 148003 ) writes: on Monday November 24, 2003 @12:24PM (#7548354) Homepage

The article claims that "the average life span of a web page is 100 days". This is a very misleading statistic. What it really means is that the average web page is updated every 100 days, not that the page dies and goes away after 100 days.

Moreover, as you can imagine, authorative sources (the type that people are likely to quote) are updated much less frequently.

Share
twitter facebook
Legal citations and authority of internet sources (Score:5, Informative)

by mtpruitt ( 561752 ) writes: on Monday November 24, 2003 @12:40PM (#7548477) Homepage

Law journals have tried to tried to cope with the proper weight of authority to grant web pages by trying to follow the Blue Book [legalbluebook.com], a citation manual.

The general rule has been that whenever you can find something in print, cite to that, but add an internet cite when either it is available and would make it easier to find, or if it is only available online.

Things that are only available online are surprisingly common in citation. The leading court reporter services (WestLaw and Lexis Nexis) both have cases that aren't "officially" printed, but are available online.

Also, many journal articles will cite to web pages such as a company's official description or press releases.

In general, these citations are treated for their functional purpose and not their form of media -- online cases are grouped (last) with other cases, and information from most web site is considered a pamphlet or other unofficial publication.

This system seems to deal with the fact that they are ephemera pretty well. The citations really are only used to make a point that is merely illustrative or is easily accessible to legal practitioners.

Share
twitter facebook
Here's what Tim B-L has to say: (Score:3, Insightful)

by Ed Avis ( 5917 ) writes: <ed@membled.com> on Monday November 24, 2003 @03:19PM (#7549856) Homepage

Cool URIs don't change [w3.org]

A bit over-idealistic, but worth aiming towards even if you don't achieve 100% non-URI-breakage in practice.

I feel that search engines should slightly penalize sites that have a history of breaking links or making them redirect to a completely irrelevant page: partly because there is just less chance that the link you follow from the search engine will have the content you want, and partly because even if you do get to a correct page, its usefulness as a bookmark or a link from your own dcuments is reduced.

Share
twitter facebook
- Re:Worst Record Keeping (Score:2, Funny)
  
  by klokwise ( 610755 ) writes:
  
  i really hope you have some evidence to back that up.
  - Re:Worst Record Keeping (Score:5, Funny)
    
    by richy freeway ( 623503 ) writes: on Monday November 24, 2003 @10:39AM (#7547467)
    
    I had some evidence to back it up but all the links are long dead ;P
    
    Parent Share
    twitter facebook
- Re:Worst Record Keeping (Score:3, Interesting)
  
  by Urkki ( 668283 ) writes:
  
  Nah. There was a time when only very very few could even read, let alone write, let alone keep any kind of records...
  
  But get your point. Too bad there are some restrictions on copying the web pages you are referencing...
  
  There should be some service, a bit like google's cache, you could use to store the referenced pages. I submit the page to the service, then provide two links in my own document, one to the original page (which will likely expire eventually) and one to the cached version. I wonder if t
  - - Re:Worst Record Keeping (Score:2)
      
      by NickFitz ( 5849 ) writes:
      
      But unless you were revising your own work to reflect those changes, surely you should continue to reference the old version?
      
      This would be akin to (off the top of my head) citing a reference in the first edition of A Vision [yeatsvision.com] by the poet W B Yeats. As the second edition was a complete rewrite [yeatsvision.com] bearing virtually no similarity in either argument or conclusions to the first, updating one's references to the second edition would not only be undesirable, it would probably be impossible.
- Re:Worst Record Keeping (Score:5, Interesting)
  
  by robslimo ( 587196 ) writes: on Monday November 24, 2003 @10:38AM (#7547449) Homepage Journal
  
  Ummm, maybe only as applies to this topic, which is to say that web pages are a poor place to keep records.
  
  I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications, because the persistence of that resource relies on too many factors (the whim of the webmaster, backups or lack thereof, fiber seeking and grid seeking backhoes, etc).
  
  I think that will all sort itself out and real scientists will continue or return to citing more traditional resources.
  
  What I think is much more disturbing and disruptive is the pseudo-science and mis-information that is overly abundant on the web. Too many web sites, personal and commercial, spout 'facts' in such great detail that they have the appearance of authority. Too often, novice/amatuer scientists can be seriously mis-lead by some of the crap that can be found on the web masquerading as 'science'.
  
  Parent Share
  twitter facebook
  - Can't they print them? (Score:2, Insightful)
    
    by khasim ( 1285 ) writes:
    
    I'm amazed that anyone doing a professional article would even think of citing a web page as a web page.
    
    Why not just print it out?
    
    Not only are web pages transient, but the facts they have are subject to change. This gets back to your "pseudo-science and mis-information" comment.
    
    If you're going to use it in your work, print a copy or save an image of it or something.
    
    Which brings up to "fair use" and copyrights and all kinds of other crap.
    - - Re:I do think that. (Score:4, Insightful)
        
        by slimak ( 593319 ) writes: on Monday November 24, 2003 @12:24PM (#7548350)
        
        Yes. Because now you have a copy of the source that you're citing.
        
        The real item of importance is that others have access to what you are citing. They may need/desire this for several reasons such verifying your claims and gaining more background information. By citing an online resource that is not backed by hard-publication (i.e. IEEE offers full-text online articles in addition to print, slashdot has no periodical that i know of) you may cite something that is gone tomorrow, possibily making you work look suspect. Furthuremore, anyone can post pretty much anything they want to the web -- think the onion [theonion.com].
        
        Parent Share
        twitter facebook
  - Re:Worst Record Keeping (Score:3, Interesting)
    
    by drooling-dog ( 189103 ) writes:
    
    I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications
    I don't necessarily see a problem here, as long as serious academic research is maintained online by trusted, stable parties. That's not demanding any more than we have up to now now with a print-based distribution system, since that depends on the continuity of a large network of brick-and-morter libraries (and associated infrastructure) to function effectively. Imagine ho
- Re:I got your solution right here, people... (Score:2)
  
  by mausmalone ( 594185 ) writes:
  
  helps, but not for those pages which are wholly removed. For example, a few faculty members here have some research posted on their personal sites, but they died. Now their sites will be taken down, and anyone referencing that research is gonna have a hard time getting a copy of it.
- - (whoops, not anonymous-me) (Score:2)
    
    by Ayanami Rei ( 621112 ) writes:
    
    So anyway, yeah, not everyone uses PHP... in fact it's a whole bunch easier to cover up URL mapping issues when you are using CGI then when you have a bunch of static documents.
    
    And if you remove a document and want to keep it that way, there's always:
    Redirect gone /blah/blah/expired.html
    
    I think part of the problem is that lots of people use FTP to maintain sites still, with a unified view of how people will navigate their content and have little appreciation for .htaccess, etc. unless they are trying to i
- archive.org and copyright? (Score:5, Interesting)
  
  by McDutchie ( 151611 ) writes: on Monday November 24, 2003 @10:43AM (#7547500) Homepage
  
  I've started to keep archivied copies of webpages instead of links, the next time you want it it's gone. Unfortunatly you can't share them like links.
  
  If you can't share them, then how come archive.org can? How come archive.org seems to be above copyright law?
  
  Parent Share
  twitter facebook
  - Re:archive.org and copyright? (Score:5, Interesting)
    
    by Jerf ( 17166 ) writes: on Monday November 24, 2003 @11:44AM (#7547911) Journal
    
    How come archive.org seems to be above copyright law?
    
    Archive.org invokes the DMCA safe harbor provisions [archive.org] (see bottom of that page for the DMCA boilerplate), which is described in Title II of the DMCA [eff.org].
    
    However, you'll find a careful reading of the DMCA reveals that none of the exclusions really quite applies to them; a good lawyer might be able to get them protected but I would bet against them.
    
    Mostly they get by because they will remove content if requested, and nobody who cares cares quite enough to sue them on behalf of "the world" when they are satisfied to have their own content removed. In other words, they are basically OK because nobody cares to sue them. Strictly speaking, archive.org probably is the world's largest copyright violation.
    
    This goes to show that sometimes if you break the law in a big enough way, you can get away with it. ;-)
    
    (Not responsible for the results of any actions based on taking that sentence to heart. For entertainment purposes only. etc.)
    
    Parent Share
    twitter facebook
- Re:Thats why.. (Score:2)
  
  by Urkki ( 668283 ) writes:
  
  I think you're in violation of copyright law! Please stand still and wait for a strike team from local lawyer station to arrive and arrest you, while their research team finds out who's copyright you're infringing upon, ie who should get 10% of the profit of suing you.
- Re:web pages as knowledge (Score:4, Funny)
  
  by theMerovingian ( 722983 ) writes: on Monday November 24, 2003 @10:51AM (#7547557) Journal
  
  I definitely wouldn't trust someone named "Horny Smurf" enough to click the link.
  
  Parent Share
  twitter facebook
- Re:DSPACE (Score:4, Interesting)
  
  by tomknight ( 190939 ) writes: on Monday November 24, 2003 @10:57AM (#7547595) Journal
  
  Bugger, forgot to log in.
  Look at DSpace [mit.edu], the mission of which is "To create and establish an electronic system that captures, preserves and communicates the intellectual output of MIT's faculty and researchers."
  Each data set (collection) has a handle [handle.net], suppoosedly longer lasting than URNs. We're talking about long term data storage here.
  There's an implementation [cam.ac.uk] of it at Cambridge University, and my organisation will be evauluation it as soon as the SuSE Linux Enterprise Server software lands on my desk and I've installed my server.
  Tom.
  
  Parent Share
  twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Well, (Score:5, Interesting)

Re:Well, (Score:5, Insightful)

Re:Well, (Score:5, Insightful)

Re:Well, (Score:4, Funny)

Scary thought. (Score:4, Funny)

Re:Well, (Score:3, Insightful)

Re:Well, (Score:2)

Re:Well, (Score:5, Insightful)

Re:Well, (Score:5, Funny)

Re:Well, (Score:5, Interesting)

Re:Well, (Score:5, Insightful)

Even "hard copy" today isn't the same (Score:5, Insightful)

The survival of backups (Score:4, Insightful)

Re:Well, (Score:5, Interesting)

Not everything, but... (Score:5, Interesting)

Re:Not everything, but... (Score:3, Insightful)

RTFA... it's about references in scientific papers (Score:5, Insightful)

Re:RTFA... it's about references in scientific pap (Score:3, Insightful)

"This is no way to run a culture." (Score:2, Flamebait)

Books have an ISBN... (Score:5, Interesting)

Re:Books have an ISBN... (Score:5, Informative)

Berners-Lee considered harmful (Score:5, Insightful)

Re:Books have an ISBN... (Score:4, Interesting)

Re:Books have an ISBN... (Score:2)

Re:Books have an ISBN..(but web pages are googled) (Score:5, Insightful)

Re:Books have an ISBN... (Score:3, Funny)

then don't look for culture in web pages... (Score:5, Interesting)

Re:then don't look for culture in web pages... (Score:2)

Re:then don't look for culture in web pages... (Score:2, Interesting)

Re:then don't look for culture in web pages... (Score:3, Informative)

Don't do that. (Score:5, Insightful)

The web can hold insight, in the right field (Score:4, Interesting)

Throwing out the baby with the bathwater (Score:5, Insightful)

Reliability (Score:5, Interesting)

But even "reputable" web pages get (re)moved... (Score:3, Insightful)

The final irony? (Score:2, Interesting)

Rigidity stifles creativity (Score:5, Insightful)

Hardcopy (Score:5, Insightful)

Re:Hardcopy (Score:2, Insightful)

Don't forget the damage done by censorship! (Score:2)

Let me get this strait... (Score:3, Informative)

Yes, big issue! (Score:5, Interesting)

Interesting... (Score:5, Funny)

And? (Score:2)

A problem recognized already some time ago.... (Score:5, Interesting)

What's the problem here ? (Score:5, Insightful)

Re:What's the problem here ? (Score:5, Interesting)

Re:What's the problem here ? (Score:2)

Backup Your Important Data (Score:4, Insightful)

long-term storage needs... (Score:2, Insightful)

Permalinking and archiving (Score:5, Insightful)

Make sure you have a paper reference. (Score:2)

Reason for this? (Score:5, Insightful)

If you want to do serious research.... (Score:4, Insightful)

Cool URIs don't change (Score:5, Interesting)

Re:Cool URIs don't change (Score:3)

URL + date (Score:3, Insightful)

Re:URL + date (Score:3, Interesting)

cant erase my usenet postings (Score:5, Interesting)

Signal:Noise (Score:2)

Blogging Fragments, Like the Ancients (Score:2)

the problem is bigger (Score:5, Insightful)

Re:the problem is bigger (Score:2)

Give and take - it's cultural change, dummy. (Score:5, Insightful)

Re:Give and take - it's cultural change, dummy. (Score:4, Insightful)

And the Fizz in My Mntn Dew Is Gone Even Sooner (Score:2)

No way to run a culture? (Score:4, Insightful)

genguid and google (Score:3, Insightful)

How long does the average conversation take? (Score:5, Interesting)

An example of broken down copyright laws (Score:2)

Re:An example of broken down copyright laws (Score:3, Interesting)

Blame bad writing (Score:2)

Reviewed Content (Score:3, Interesting)

Others will cite this & the post as proof... (Score:2)

Longevity (Score:4, Interesting)

Archiving web sites (Score:2)

Site Linking Schemes (Score:3, Interesting)

Citing URLs is not quite appropriate (yet) (Score:5, Informative)

The main difference being... (Score:3, Funny)

Re:The main difference being... (Score:3, Insightful)