Slashdot Log In
Web Pages Are Weak Links in the Chain of Knowledge
Posted by
Hemos
on Mon Nov 24, 2003 09:27 AM
from the destroying-our-young dept.
from the destroying-our-young dept.
PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""
Related Stories
[+]
Technology: We're In Danger of Losing Our Memories 398 comments
Hugh Pickens writes "The chief executive of the British Library, Lynne Brindley, says that our cultural heritage is at risk as the Internet evolves and technologies become obsolete, and that historians and citizens face a 'black hole' in the knowledge base of the 21st century unless urgent action is taken to preserve websites and other digital records. For example, when Barack Obama was inaugurated as US president last week, all traces of George W. Bush disappeared from the White House website. There were more than 150 websites relating to the 2000 Olympics in Sydney that vanished instantly at the end of the games and are now stored only by the National Library of Australia. 'If websites continue to disappear in the same way as those on President Bush and the Sydney Olympics... the memory of the nation disappears too,' says Brindley. The library plans to create a comprehensive archive of material from the 8M .uk domain websites, and also is organizing a collecting and archiving project for the London 2012 Olympics. 'The task of capturing our online intellectual heritage and preserving it for the long term falls, quite rightly, to the same libraries and archives that have over centuries systematically collected books, periodicals, newspapers, and recordings...'" Over the years we've discussed various aspects of this archiving problem.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Well, (Score:5, Interesting)
100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)
Re:Well, (Score:5, Insightful)
Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing
Parent
Re:Well, (Score:5, Insightful)
Do you really think goatse will be "disturbing" 100 years from now?
The day goatse.cx is no longer disturbing, is sure to be the first day of ArmageddonParent
Scary thought. (Score:4, Funny)
Well, I guess we know what Paul McCartney will be doing on the cover of his next album..
Parent
Re:Well, (Score:5, Insightful)
Parent
Re:Well, (Score:5, Funny)
Amazing that the most remembered asshole of the dawn of the 21st century isn't Michael Eisner or Jack Valenti.
Parent
Re:Well, (Score:5, Interesting)
I would be fascinated to see my Great Grandad's first drawings, his school web page, his postings to USENET. I only knew him as on old man ....
To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff
Parent
Re:Well, (Score:5, Insightful)
If you like that, you might like the books by the historian Fernand Braudel. Rather than the "kings and battles" of most histories, he focusses on how very simple things like the foods people ate, the weather, etc, and the relationships between long-term trends and the emergent properties of those interactions (i.e. over decades or centuries) are responsible for shaping the course of history.
Parent
Even "hard copy" today isn't the same (Score:5, Insightful)
Why?
Current historians learn a lot about each writers creative process, and how writers evolved their ideas, from drafts and corrections. Music scholars pore over every scratched-out note, every furious scribbled comment, in Beethoven's draft scores. Writing music was laborious and hugely frustrating for Beethoven, unlike Mozart, who hardly stopped to think and made few if any corrections.
Future scholars won't know any of this stuff, looking back at our work. We use software to edit our work... so when we fix our errors they are gone forever. We change our minds and the original idea disappears in a puff of electrons. An electronic score of a Beethoven symphony only differs from a Mozart concerto in the musical style -- all of the other data is gone.
It's a sobering thought. Where else are we going to get this data? Not letters, because we write emails now, and regularly delete them (intentionally or not). Diaries? Some people still keep them on paper... but many store them on computer, or publish them in blogs (which as discussed will mostly be gone).
Sobering thought isn't it? It's not neccessarily hubris to say we ought to be saving more of this stuff; people a few hundred years from now should be able to learn from our failures, as well as our successes.
Parent
Re:Well, (Score:5, Interesting)
Really, is there a reason to archive everything in the world?
No, only the good stuff needs to be saved. So what's good and who should save it?
IMHO, anything that gets officially referenced by another work should be saved.
That burden should not fall upon the original creator of the referenced work; it should fall upon the creator of the refering work.
Despite all the hue and cry about lost revenue opportunities from controlled distribution of copyrighted information, knowledge preservation and the overall benefit to society would improve if works were able to save a local cache of referenced works.
This would also help with the problem of morphing or revisionist works. Some works can be improved by editing (something around here comes to mind), but it would be inappropriate to change old web pages that show an earlier mistake in thinking, to show that somehow someone was particularly prescient, or to erase knowledge for a political agenda (a la Stalin).
Just a couple of days ago I was able to retrieve an old recipe from the Google cache that had been summarily removed from a web site due to some time retention policy. An attempt to encourage repeat visits to the website because stuff disappears was circumvented. I would have been particularly annoyed with that website were it not for the delayed action of the Google cache. Google may have enable circumvention of their policy, but they would have garned a lot more ill will from me if their policy were effective.
Guess what? References in scientific papers I write are not just available in libraries capable of paying $1K/year subscription rates, but as photocopies in my file cabinet. That is, I have a local cache of referenced works already.
If a colleague's library did not have the specified volume and journal article, I would let him have a copy for the asking. It's a copyright violation, I know, but I'm not convinced that strict adherence to copyright laws in this case provides the best overall benefit to society.
Parent
Not everything, but... (Score:5, Interesting)
This is a real problem. When Vannevar Bush [theatlantic.com] conceived the Memex system, his goal was to facilitate the exchange of scientific research. Later, Doug Englebart [virginia.edu] built on Bush's ideas as did Ted Nelson [virginia.edu] (the guy who coined the term "hypertext") and Tim Berners-Lee [w3.org]. While the web today has become a vast sinkhole of pop-up ads, crappy web stores and inane blogs it is important to not forget that its inception was in aiding scientific research.
Yet, that is not possible without some kind of permanence. Probably what is needed is some way to integrate the web into university library collections. If there was some way of indexing web pages the way libraries currently use the Library of Congress scheme to index their physical collections, then web pages could be uniquely numbered with this number incorporated into the URL. If then universities and the Library of Congress itself were to mirror (permanently) these pages, if the original URL were to become unavailable, one could try just about any manjor university or the LOC and retrieve the page. Of course, with the current political climate here in the US I don't forsee this ever happening.
Parent
RTFA... it's about references in scientific papers (Score:5, Insightful)
You don't just say "Frotz and Rumble observed that the freeble-tropic factor was 6.32," you say "Frotz and Rumble (1991) observed that the freeble-tropic factor was 6.32." Then, at the end, traditionally, you would put "Frotz, Q. X and Rumble, M (1991): Dilatory freeble-tropism in the edible polka-dotted starfish, Asterias gigantiferus (L) (Echinodermata, Asteroidea), when treated with radioactive magnesium pemoline. J. f. Krankschaft und Gierschift, 221(6):340-347."
Then if someone else wondered about that statement, they'd go to the library and pull down volume 221 of the journal, and see that Frotz and Rumble had only measured that factor on six specimens, using the questionable Rumkohrf assay. If they had more questions, they'd write to Frotz at the address given in the article, asking them whether they remembered to control for the presence of foithbernder residue.
This sort of thing is absolutely essential to the scientific process and makes science self-correcting.
The article says that these days, the papers are published online, the references are URLs, and that an awful lot of them are stale. If so, this cuts to the very heart of the process of scientific scholarship.
Parent
Books have an ISBN... (Score:5, Interesting)
Re:Books have an ISBN... (Score:5, Informative)
Parent
Berners-Lee considered harmful (Score:5, Insightful)
You can link to a article which is then changed by the original publisher (or someone else). With scientific papers, you can't do that -- and such behavior is probably not desireable.
On the up side, if you're currently using cited references, you should be able to build such a system without too much problem -- follow links to PDFs or automatically crawl HTML documents (and check images) and serve all papers that you refer to with your paper. It'd be big, but it provides better reliability than do current paper schemes.
Another feature that might be useful is signing of the content (assuming RSA doesn't get broken in the future).
Basically, if you put up a SHA-1 (Gnutella), MD4 (eDonkey), or similar reference, you can host the original referred-to documents as well as the original host.
If Freenet didn't have as a specific drawback the inability of someone to guarantee that a document remains hosted as long as they are willing to host it, Freenet would be a good choice for this.
One possibility is that, with a bit of manual work, one can frequently find an academic work by Googling for its title. At least for now, as long as you host the original papers as well, Google should pick up on this fact. Of course, it does nothing to prevent modification of that paper by another party...
A good system for handling this would be to have a known system that is willing to archive, in perpetuity (probably hosted by the US government or other reasonably stable, trustworthy source [yes, yes, cracks at the US government aside]). This system would act like a Tier 1 NTP server -- it would only grant access to a number of other trusted servers (universities, etc) that mirror it -- perhaps university systems -- which would keep load sane. These servers (or perhaps Tier 3 servers) then provide public access. Questions of whether there would be a hard policy of never removing content or what would be allowed (especially WRT politically controversial content) would have to be answered.
There could be multiple Tier 1 servers that would sync up with each other, and could act as checks in case one server is broken into. I'm partial to the idea of including a signature on each file, but I suppose it isn't really necessary.
Specific formats could be required to ensure that these papers are readable for all time. Project Gutenberg went with straight ASCII. This would probably have to be slightly more elaborate. Microsoft Word and PDF might not be good choices, and international support would be necessary.
Parent
Re:Books have an ISBN... (Score:4, Interesting)
Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there.
The article mentions this: "One such system, known as DOI (for digital object identifier), assigns a virtual but permanent bar code of sorts to participating Web pages. Even if the page moves to a new URL address, it can always be found via its unique DOI."
But it seems that these current systems must use "registration agencies" to act as the gatekeeper of the unique ID.
Parent
Re:Books have an ISBN..(but web pages are googled) (Score:5, Insightful)
The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).
Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?
With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).
Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get.
William
Parent
then don't look for culture in web pages... (Score:5, Interesting)
man i need coffee, insomnia is a bitch...
Don't do that. (Score:5, Insightful)
The web can hold insight, in the right field (Score:4, Interesting)
Perhaps even more interestingly, it doesn't always really matter if you've done great, repeatable research in the "soft science" fields or outright humanities. You don't have to be a literature expect to have a good insight on "Bartleby the Scrivener". A grad student's blog, as an example, might contain excellent contributions to the conversation.
Now that said, in the context of the article -- dealing with "a dermatologist with the Veterans Affairs Medical Center in Denver" -- I would tend to agree with you heartily. Hard science needs to pull, in my layman's view, from research that the article's author researched well enough to see that it wasn't a few 0's and 1's that might be pulled later, in general.
And heck, what's the harm in saving the pages on your drive and contacting the original author if they disppear? Hard drive space is cheap. If you take yourself seriously, you might want to grab a snap, even if it is technically illegal (not that I know that it is; Google seems to do it right often).
Parent
Throwing out the baby with the bathwater (Score:5, Insightful)
What would be interesting would be a website that archives those snapshots for posterity. Well, what do you know, there are several such sites already! Looks like we're in good shape. The sky is not falling.
Reliability (Score:5, Interesting)
I'm a recent grad of a University... my freshman year, profs wanted us to start using the Internet more so we were asked to submit at least x number of references from Internet sources. By my senior year, they were trying to get us to stop using the Internet. Using a URL as a reference was sometimes forbidden by the professor.
Rigidity stifles creativity (Score:5, Insightful)
Hardcopy (Score:5, Insightful)
Having a hardcopy (1) documents the information and it's (purported) source, and (2) allows offline access for comparison and validation.
Yes, big issue! (Score:5, Interesting)
To say this is some kind of problem specific to the web is misleading. There are old, well-quoted sources of Jewish thought whose texts are simply lost to us in this current day and age. Example: a famous and extremely popular commentary on the Talmud and Torah, Rashi, is missing for at least a few chapters of Talmud. That would be the equivalent of IEEE misplacing some standards papers and then NO ONE having copies, just lost to the sands of time. Yet it did happen, proving this at least _was_ a serious issue.
However, these days, with such things as the Way-Back Machine and Google caching, actually LOSING entire web pages doesn't happen very often, and, I'd bet, it happens far less frequently than the loss of books.
-Erwos
Interesting... (Score:5, Funny)
From a researcher's perspective, I used the web primarily as a quick "google" to get some ideas on where I might do further research. For instance, while a particular paper may have been taking offline regarding my search, many times the search will proffer an author's name. Take that name to the library's database (or googling it, too), and you might can get a list of more publications that the author has penned. Even better: sometimes, you can get a valid email address from other links and you can write and ask the original researcher himself about various publications, many times they have copies on hand and can send them to you. My research involves the web, but does not end with the web, which is where many people find themselves hung.
Hey, guys. See that big building with those obsolete books? Lots of chicks hang out there.
A problem recognized already some time ago.... (Score:5, Interesting)
Usability expert Jakob Nielsen addressed the issue of linkrot in a column already in 1998: Fighting Linkrot [useit.com].
What's the problem here ? (Score:5, Insightful)
Does anyone archive CB radio traffic ??
It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.
Re:What's the problem here ? (Score:5, Interesting)
The problem with the Washington Post's article is that their premise is flawed. They assume that the Internet is a mostly static source of information, when it is definitely a mostly dynamic information source. Webpages are meant to be updated, and with updates come change. It's inevitable. To assume that we keep every update to the webpages in separate locations is a false assumption. It's cool to see sites like the Wayback machine do this, but it's not required.
Parent
Backup Your Important Data (Score:4, Insightful)
I constantly backup all my digital photos because they are important to me. I also print the best ones for placing in photo albums, distributing to friends, etc.
The website they are published to is just a delivery medium, and not even the primary one. It can disappear and I wouldn't care. People who know me can always get access to them. Scientists should view their work the same way.
Permalinking and archiving (Score:5, Insightful)
Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.
But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.
So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.
Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.
Suggestions of better solutions are always welcome
Reason for this? (Score:5, Insightful)
If you want to do serious research.... (Score:4, Insightful)
For example, a short time ago, I did a white paper on power scavenging sources. About 1/2 the articles I read were HTML or PDF sources. Rather than just citing the URL, I downloaded/saved every online article I referenced. If someone wants the source and cannot find it, I'll just provide it to them. If your paper is going to be read by a number of people, it makes good sense to have those sources on-hand; it never hurts to cover your arse.
Hard drive/Network/Optical space is virtually unlimited, so storage isn't a problem. Paper journals are archived by most libraries, anyway, so until they start archiving technical sources, I'm going to have to do my OWN archiving.
Cool URIs don't change (Score:5, Interesting)
cant erase my usenet postings (Score:5, Interesting)
the problem is bigger (Score:5, Insightful)
because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.
i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.
yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)
Give and take - it's cultural change, dummy. (Score:5, Insightful)
Publications that cite [web pages] lose their authorities? Who the hell told you to cite a webpage? Might as well cite a poster you saw downtown. If the webpage is a reputable source in the first place, it'll keep it around permanently. Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.
This is no way to run a culture. Last time I checked, nobody ran our culture... It kinda runs itself. The proliferation of accessable, ephemeral webpages over permanent, priveliged paper publications (wah, too many p's!) is a sign that our information culture has moved on into a new era. Liked the old one? Tough! Now information has to maintain its own relevance in order to be permanent... and I for one welcome that change.
No way to run a culture? (Score:4, Insightful)
To the contrary, I think this is highly typical of the culture we have today, where everything is a transient fad in the media, technology and politics.
And it is also self feeding, I think, since market forces need to clear out the old to make room for the new in order to meet sales forecasts and shareholder expectations. And this is very true for pop, news and technology, which explains the lack of staying power of pop icons these days and becomes interesting when you want to ask yourself if you really need that new 3GHz machine just to surf the web.
And it is highly convenient in politics where a politician doesn't have to be accountable for what he said 100 days ago.
And so, the lack of long time life on the web is simply symbolic of all the rest here really, even if it is highly questionable.
How long does the average conversation take? (Score:5, Interesting)
If they're lasting on average 100 days, that puts them somewhere between transient culture, like spoken conversation, and printed culture, like newspapers. Big deal.
We want to preserve culture for future generations, no doubt. But we don't want to preserve all culture for future generations. Anything that is lasting for 100 days and isn't being persisted... well, relatively that's not worth much to future culture.
I don't remember the exact saying, but there is a Native American saying to the effect of "We don't write things down. If we don't remember it, it's not worth remembering." Now, they're not the last word (no pun intended) in wisdom traditions, but there is a certain amount of enforced vitality necessitated by forgetting the details.
We'd better get used to the idea. We're only going to be forgetting more and more of the details as we generate more and more useless information.
Longevity (Score:4, Interesting)
So I've been going through all the old links, and every link request we've gotten in the business' 7-year history. Of the 120 messages in the timeframe of 1997-1999, only about 15 sites still existed. Of those, two-thirds had forwarded URLs -- often from AOL or Homestead to their own brand. A couple still existed, but had totally different content.
Many just plain didn't exist at all. A fair chunk found the server, but no such page. A few had blank pages or nearly no content. The true annoyance though, is the number of domains that are owned by spamdexers/linkfarms that have no content of their own and beg you to set your homepage to them.
I've still got to cover the rest of 2000-2003 link requests, but I expect that anything pre-2001 will be very sparse.
Citing URLs is not quite appropriate (yet) (Score:5, Informative)
Hmmm. I'm not sure most scholary works are allowed to just cite arbitrary URLs for inline references or footnotes.
The idea is that you generally have to cite peer-reviewed, published and presented articles; criteria which the majority of web published material simply does not satisfy. Web reading would fall under the "course reading", and would have to be backed up by a "real" reference.
According to my GF (currently working on a Masters in Anthropology) there is a lot of confusion on how to use the web for scholary references. Many people cite URLs in citations that are really just online archives of previously-published work. In this case, noting the URL is like saying which library you checked the article out, and what shelf it was on. If you are an undergrad and cite a URL, it is almost a sure thing that the prof or the TA's will take marks off for improper citations.
There are a few peer-reviewed journals that are (partly or completely) published online, in which case the URL might be a valid citation. This is likely to changed, and it seems the original article was suggesting that we need to handle this case now, before we lose more good work.
In a much smaller way, this is the kind of thing that those involved in the whole blog phenomenon are trying to resolve [xmlrpc.com]; making sure that their blog-rolls, trackbacks and search-engine cached pages stay historically maintainable.
problem can easily be improve with some thought (Score:5, Insightful)
Non-transparent CGI, PHP and ASP scripts are even worse, they tend to change all the time. Instead they should be using the "path info", or be in the server (mod_perl, etc.)
Example: "http://science.slashdot.org/article/03/11/24/127
The idea that the basic job of a webserver is to pull files off your disk is incomplete: it's job ought to be to take your URL through *any* kind of query lookup, which might map to the filesystem and might not. The HTTP RFC's imply this as well.
reed
Misleading statistics (Score:5, Interesting)
Moreover, as you can imagine, authorative sources (the type that people are likely to quote) are updated much less frequently.
Legal citations and authority of internet sources (Score:5, Informative)
Law journals have tried to tried to cope with the proper weight of authority to grant web pages by trying to follow the Blue Book [legalbluebook.com], a citation manual.
The general rule has been that whenever you can find something in print, cite to that, but add an internet cite when either it is available and would make it easier to find, or if it is only available online.
Things that are only available online are surprisingly common in citation. The leading court reporter services (WestLaw and Lexis Nexis) both have cases that aren't "officially" printed, but are available online.
Also, many journal articles will cite to web pages such as a company's official description or press releases.
In general, these citations are treated for their functional purpose and not their form of media -- online cases are grouped (last) with other cases, and information from most web site is considered a pamphlet or other unofficial publication.
This system seems to deal with the fact that they are ephemera pretty well. The citations really are only used to make a point that is merely illustrative or is easily accessible to legal practitioners.
Re:Worst Record Keeping (Score:5, Interesting)
I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications, because the persistence of that resource relies on too many factors (the whim of the webmaster, backups or lack thereof, fiber seeking and grid seeking backhoes, etc).
I think that will all sort itself out and real scientists will continue or return to citing more traditional resources.
What I think is much more disturbing and disruptive is the pseudo-science and mis-information that is overly abundant on the web. Too many web sites, personal and commercial, spout 'facts' in such great detail that they have the appearance of authority. Too often, novice/amatuer scientists can be seriously mis-lead by some of the crap that can be found on the web masquerading as 'science'.
Parent
Re:Worst Record Keeping (Score:5, Funny)
Parent
archive.org and copyright? (Score:5, Interesting)
Parent
Re:archive.org and copyright? (Score:5, Interesting)
Archive.org invokes the DMCA safe harbor provisions [archive.org] (see bottom of that page for the DMCA boilerplate), which is described in Title II of the DMCA [eff.org].
However, you'll find a careful reading of the DMCA reveals that none of the exclusions really quite applies to them; a good lawyer might be able to get them protected but I would bet against them.
Mostly they get by because they will remove content if requested, and nobody who cares cares quite enough to sue them on behalf of "the world" when they are satisfied to have their own content removed. In other words, they are basically OK because nobody cares to sue them. Strictly speaking, archive.org probably is the world's largest copyright violation.
This goes to show that sometimes if you break the law in a big enough way, you can get away with it.
(Not responsible for the results of any actions based on taking that sentence to heart. For entertainment purposes only. etc.)
Parent
Re:web pages as knowledge (Score:4, Funny)
Parent
Re:DSPACE (Score:4, Interesting)
Look at DSpace [mit.edu], the mission of which is "To create and establish an electronic system that captures, preserves and communicates the intellectual output of MIT's faculty and researchers."
Each data set (collection) has a handle [handle.net], suppoosedly longer lasting than URNs. We're talking about long term data storage here.
There's an implementation [cam.ac.uk] of it at Cambridge University, and my organisation will be evauluation it as soon as the SuSE Linux Enterprise Server software lands on my desk and I've installed my server.
Tom.
Parent