
Computing PageRank on your PC? 186
An anonymous reader writes "A group of CS researchers of the University of Milan has found a way to compress web graphs at 3 bits per link, and to access them in compressed form. They provide data sets representing real snapshots of portions of the web with one hundred million nodes and 1 billion links. You just need some bandwidth to download a few hundred megabytes of data, and you can compute PageRank with your PC. All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!"
The major thing missing from Mozilla (Score:5, Interesting)
Re:The major thing missing from Mozilla (Score:1)
</sillyness>
This was posted by mozilla, don't worry about modding me down for teasing the browser.
Re:The major thing missing from Mozilla (Score:1, Insightful)
The Google toolbar for IE has to ask google.com for the PageRank of each page you view, via XML-RPC. One of the fields in the XML-RPC request is a checksum. Without that checksum, google.com rejects the request. So it's just a matter of finding out how the toolbar calculates the checksum based on your URL. Then you could write a standalone (or Mozilla-based) tool for fetching PageRa
Re:The major thing missing from Mozilla (Score:2, Interesting)
Re:The major thing missing from Mozilla (Score:3, Informative)
Re:The major thing missing from Mozilla (Score:4, Informative)
"We currently have no plans to implement pagerank"
Still - a cool addition to mozilla.
You mean google's spyware (Score:2)
Re:You mean google's spyware (Score:2)
Ridiculous. You "anti"-spyware freaks are more dangerous than most spyware because you stir up shit and cause hysteria by accusing everything and everyone for spyware. You remove the focus from the actual spyware and sleazeware out there.
You ar
This sounds cool.. (Score:5, Funny)
Some webmasters/SEO's are obsessive (Score:5, Interesting)
Re:This sounds cool.. (Score:5, Funny)
And you call yourself a geek. *Sigh*.
It doesn't matter why you need it. It's technical, GPLed, and has to do with Google. That's all the reason you need.
Re:This sounds cool.. (Score:5, Funny)
It's a geek hattrick!
Hattrick (Score:2)
Re:This sounds cool.. (Score:1)
Could I someday use it for my PC - Re:This soun (Score:2, Interesting)
I wonder if I can use pagerank algorithm for the smaller universe of my harddrive itself?
I have over 6,000 files on my PC many of which link to each other, and I am adding more links between them as time goes by. The collection is now so big that I can't even revist my own files and reason out the implications of the links between pages, beacuse of the huge time it would take to even spend a minute on each saved file.
I wonder if something like Pagerank will let the important files that are linked by man
Dumb Question: (Score:5, Interesting)
Xesdeeni
PageRank is part of Google's algo (Score:5, Informative)
Re:PageRank is part of Google's algo (Score:5, Informative)
Re:PageRank is part of Google's algo (Score:1)
Re:PageRank is part of Google's algo (Score:2)
Re:PageRank is part of Google's algo (Score:2, Informative)
Re:PageRank is part of Google's algo (Score:2, Informative)
PageRank is a one-dimensional recursive weighting for a web page. Intially you assume all pages were created equal. Now for each page, compute an updated PageRank based on indegree (number of pages linking to the site). You usually also introduce a weighting factor which is designed to simulate some random chance that you "jump" to the next page by just typing a URL, not following a link. After that, you typically normalize the scores (sum of squares must equal one is the preferred norm).
Now you have to i
Re:Dumb Question: (Score:5, Informative)
Re:Dumb Question: (Score:1)
~Berj
Re:Dumb Question: (Score:5, Funny)
Re:Dumb Question: (Score:2)
Or posting on Slashdot, for that matter.
Re:Dumb Question: (Score:1)
Re:Dumb Question: (Score:3, Funny)
Mod parent and grandparent and great-grandparent up.
Also, mod parents children up.
Also, mod great-great-grandparents great-great-granddaughters up.
Also, say up unto them verily, that the mod of the parent will be cast down the generations to be a mod on the children, and on the children's children, and on the children's children's chilluns.
And also, mod up the nephews of the parents of the sibilings of the grandparent for though they be trolls or flaimbait, they ar
Re:Dumb Question: (Score:2)
So that's how Google ranks its pages. I didn't realize they tracked the number of links. I didn't really think about it long, but I figured they just used how many times the query string appeared, maybe the age of the page, or whatever.
I wonder if this data would be hugely different from the number of visits a page receives, considering easily typed-in page addresses (fewer links needed), or the possibility that a si
Besides - not all browsers show page rank (Score:2, Interesting)
On my Mac for example, I can't see it at all. On my Wintel I can, thanks to the Google toolbar.
Re:Dumb Question: (Score:2, Informative)
To a second order of approximation, it weights the votes of the referencing links by their popularity.
To a third order of approximation, it is a Markov chain that measures the long term likelihood of you arriving at a page, if you to randomly traverse the net: taking random links out of a pages and occasionally take (1/20?) random jumps to arbitrary urls.
Re:Dumb Question: (Score:2)
Think of it as Google-bestowed karma on your website. :-D
Tee Hee (Score:5, Funny)
"Finally, proof!!"
Re:Tee Hee (Score:1, Redundant)
does this mean... (Score:1)
Re:does this mean... (Score:5, Funny)
... to find the fruits of your labor?
What a grape idea! Orange you glad you thought of it?
.
.
.
Ok. Groan fest is over.
Re:does this mean... (Score:1)
IMarv
Re:does this mean... (Score:2)
Was the word you were looking for "onomatopoeia" [reference.com]. You weren't even close, but I can hardly blame you
Which sites are the Root(s)? (Score:5, Interesting)
I mean they've got to start with some site(s) and then go through each link from there.
Explains it all... (Score:2, Informative)
Seriously, I don't know. Here's a page on how Google works though.
http://www.google.com/technology/index.html
Re:Which sites are the Root(s)? (Score:1)
Re:Which sites are the Root(s)? (Score:1)
Re:Which sites are the Root(s)? (Score:5, Informative)
Re:Which sites are the Root(s)? (Score:2, Interesting)
One Billion Linnks? (Score:1, Funny)
beyond PageRank... (Score:3, Interesting)
You're an optimist (Score:1)
Maybe take a new snapshot every day or week, see the web evolve.
How much time do you think it's needed to take a snapshot of the Web? Most certainly much longer than a day or even a week. My bet would be several months at the very least.
Re:beyond PageRank... (Score:2, Interesting)
Anyone got a lot of procesing power and some spare time?
P
I can see it now... (Score:5, Funny)
Somewhere in 1980, milk shoots out of Bill Gates' nose for no apparent reason.
Re:I can see it now... (Score:1)
exactly (Score:1)
web graphs at 3 bits per link, that's a paddling...
compute PageRank with your PC, that's a paddling....
groking PageRank, you better believe that's a paddling...
Google with feedback (Score:3, Interesting)
Anyway, forgive the opportunism, but this is reasonably on-topic. Last weekend I set myself the ambitious task of improving on Google. I came up with a Google front-end which allows you to give feedback on the quality of search results, and thus refine your search. I could really use people's help to test it out - you can find it here [68.65.221.46]. Feedback would really be appreciated.
By the way... (Score:3, Informative)
Re:By the way... (Score:2)
Re:By the way... (Score:2)
then again, a whole lot of "MS SUCKS" posts might qualify.
Re:Google with feedback (Score:4, Insightful)
The problem is that the GPL does not allow distribution of patent-encumbered technology. The authors of the code in question have every right to release their code with whatever license they want (I believe this is a free-speech issue, especially since the purpose of releasing the code is for doing research). People who receive their code may not use the code in a way that violates the patent, and in addition may not redistribute the code at all (since it would violate the GPL).
The other issue is that PageRank is really a mathematical formula, and as such is unpatentable. What they actually patented is an algorithm for computing PageRank. If someone finds another way of computing the same formula, I think the patent holders would have a very hard time showing infringement.
Re:Google with feedback (Score:1)
Feedback is local (Score:2)
This is good, but... (Score:5, Funny)
"I'll be there in a minute! I'm downloading the Internet!"
Re:This is good, but... (Score:2)
011
Proper decompression is left an as exercise for the reader.
Now, let us discuss payment schemes...
Im sorry, I have to say it tho...... (Score:1, Troll)
Technically speaking you'd be donwloading the WEB not the Internet
Trouble is with googles web cache, theres no pics; just think of all the beautiful images of pet dogs and holidays at Weston-super-Mare you'd be missing out on!
Re:This is good, but... (Score:2)
Three-bit compression for web pages. (Score:2)
000: page is spam. Ignore it.
001: page is porn. Porn is all the same, show porn page from disk.
010: page is pop-up ad. Block it.
011: page is a 404.
100: page has javascript. Show random javascript error.
101: page is Slashdot.
110: page is Slashdot.
111: page belongs to the
Live XML Version (Score:2, Insightful)
If these are snapshots then you'll need to keep downloading them for your Page Rank system to be up to date. The web is constantly changing and therefore so is Page Rank. I can't see having a data set on your computer being all that usefull as it'll soon expire.
It would be far better to be able to link to a data set via XML and query it. That way you would have live upto the minute Page Ranks. I know that Google already does a live Page Rank
Re:Live XML Version (Score:2)
You are absolutely right. This is Google does on your behalf. They have the computing power, storage-wise as well as processing-wise to do the needed updating. Not that google can do it 24/7, but they do better than I can with my 4 computers. :\
Google patents? (Score:5, Interesting)
GPL'd? Hmm, I thought that Google did patent the PageRank algorithm (correct me if I am wrong), so re-implementing THEIR algorithm even more efficiently would be incompatible with GPL. OTOH, if it is not THEIR algorithm, it can not be called 'PageRank'
Oh, the evils of software patents...
Paul B.
Re:Google patents? (Score:4, Interesting)
Google hasn't exactly patented the algorithm for all uses, and no court has determined that the code infringes the patent, and software patents aren't valid in most countries, so it's not clear whether or not there is any compatibility.
It would seem that anyone who uses the code to build a search engine would be infringing, but even that is something that lawyers can argue about.
Re:Google patents? (Score:2)
Unless the term is trademarked (is it?), you can call whatever the hell you want "PageRank" and nobody can do a thing about it.
Doesn't actually calculate PageRank? (Score:5, Informative)
Moreover, IANAL, but isn't the PageRank algorithm patented by Google? Wouldn't this prevent anyone from releasing GPL code that computes PageRank?
Re:Doesn't actually calculate PageRank? (Score:2)
Isn't that illegal? (Score:2)
Re:Isn't that illegal? (Score:2, Informative)
Not Page Rank (?) (Score:2)
has to be said (Score:4, Funny)
Proof of concept only (Score:5, Informative)
Also, if the downloadable dataset only covers a small portion of the web, how can this system's utility really compare to Google's?
That said, I think computer science proof-of-concept type project are very useful and serve a valuable purpose in getting the ideas out there for others to improve upon.
Wow! The site is not /.'d! (Score:1)
What a mess (Score:3, Funny)
Re:What a mess (Score:5, Informative)
Re:What a mess (Score:3, Informative)
Re:What a mess (Score:1)
OED Online:
Re:What a mess (Score:2)
I feel so much more enlightened now, and now I have less reason to learn Emacs. Viva vi!
bile@netscape.com (Score:1)
Google's algorithms have changed quite a bit (Score:3, Insightful)
They also use terms that appear in links as a major key in ranking searches.
(Among other things.)
Not that it is not interesting to see these rankings, and note the most widely linked to sites on the net.
Which, by the way, after the obvious winners like Yahoo, include Adobe and Real networks, which have gotten immense numbers of sites to link to them with "Get acrobat reader" style links.
I've often wondered if the makeashorterlink and tinyurl folks are doing it just for the googlejuice.
In reverse, many sites now use javascript links in order to preserve their googlejuice.
Very much a heisenberg phenomenon here.
I wonder... (Score:5, Interesting)
You say Power-Law graph of a billion pages... (Score:1)
Ask and ye shall receive... (Score:5, Informative)
Here (for free) [lumeta.com]
Here too (for free) [ucl.ac.uk]
This one too (for free) [nd.edu]
This one also (free) [mundi.net]
And don't forget this classic ($30 poster) [thinkgeek.com]
-T
can anyone explain what a web graph is? (Score:1)
Maybe it's my ADD. Maybe it's my inherent dumbassedness... but I can't grok that.
So what is a web graph? How is that related to PageRank? If I download all this d
Re:can anyone explain what a web graph is? (Score:2)
Re:can anyone explain what a web graph is? (Score:5, Informative)
In a web graph, vertices are webpages and edges are hyperlinks.
PageRank determines how many incoming edges a vertex has. Given the nature of the web, this is a nontrivial problem because a vertex only knows its outgoing edges.
The assumption for PageRank is that the more incoming edges a vertex has, the more popular it is. So you would use this to figure out how popular a particular vertex is.
Given this you could do like Google and combine it with a search engine to prioritize the results.
Can google sue for reverse engineering pagerank (Score:1)
WebGraph has a PR of NULL (Score:1)
The WebGraph tool may be interesting for college students but the webmasters that are interested in seo techniques are going to find little use out of this tool.
pagerank for the masses (Score:1)
Just use Google (Score:1)
Googling your harddisk (Score:2, Interesting)
Anyone know of anything like that?
Re:Googling your harddisk (Score:2)
I thought I remember Google having a product like this, but I can't find it [google.com] now.
MS Win2k and WinXP have an indexing service that's supposed to do just what you want. It's not enabled by default in 2k; not sure about XP. I've been afraid to try it for various paranoia and stability reasons.
HTdig [htdig.org] was my next thought. It's designed for web pages, but I bet you could restrict it to your hard disk. However, the site says they don'
Re:why is rank/rating necessary? (Score:5, Funny)
For a second there I thought you were just talking like Elmer Fudd! "wating and wanking incwease the welevance of pagewanking..."
Re:why is rank/rating necessary? (Score:2, Insightful)
It could have receded back into the depths and maintained quality but it put page-ranking first, attempting to attract and contain a particular audience.
I disagree. In case you haven't noticed, the title of the /. front page is "News for Nerds, Stuff that Matters." So, of course /. is attracting a particular audience. That's a Good Thing.
Target audience is one of the most important decisions when designing a web site. "Good info" is a subjective concept. What's good to you is not necessarily good
Re:why is rank/rating necessary? (Score:2)
That reminds me of Asimov's pyschohistory and the Second Foundation. That the First Foundation had to be unaware of the influence of the Second Foundation for it to work. Maybe that makes Searchking the Mule?
Re:why is rank/rating necessary? (Score:2)
We have TV without Nielsen ratings. We call it "PBS."
Is PBS better? Sometimes. Perhaps even often in recent years. Certainly no one has ever referred to PBS' content as mindless drivel, the way we talk about things like Survivor and American Idol.
But let me ask you this: If you could have only one TV station and you had to choose between ABC, CBS, FOX, NBC and PBS would you choose PBS? Didn't think so.
Re:why is rank/rating necessary? (Score:5, Interesting)
Google has been, so far at least, a rare exception in the world of privatized communications utilities, by consistently showing a amazing lack of intention to lock people into their service, using either exclusivity agreements of some sort or the simple expedient of proprietary technology (i.e., "increase your PageRank by 10% if you support new encrypted GoogleML tags on your site!"). Nothing is permanent, though, and as we all know, single points of failure are a no-no.
So, to bring all this back somewhere in the general neighborhood of the main story: further distributing the capability to build "mini-Googles", or specialized, community-maintained (but still fairly large-scale in terms of number of pages and links indexed) search tools is very interesting, and a useful body of technology to perpetuate.
Or, even more generally, the technology needed to do large-scale storage, analysis, and manipulation of directed graph structures is a very useful tool. Software analysis often relies heavily on large graphs showing dependencies, caller-callee relationships, variable accesses, etc., as do any number of AI subdomains like knowledge representation and planning systems.