Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Science

Computing PageRank on your PC? 186

An anonymous reader writes "A group of CS researchers of the University of Milan has found a way to compress web graphs at 3 bits per link, and to access them in compressed form. They provide data sets representing real snapshots of portions of the web with one hundred million nodes and 1 billion links. You just need some bandwidth to download a few hundred megabytes of data, and you can compute PageRank with your PC. All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!"
This discussion has been archived. No new comments can be posted.

Computing PageRank on your PC?

Comments Filter:
  • by Anonymous Coward on Thursday June 12, 2003 @02:22PM (#6184019)
    Is a way to look at Google's pagerank. That's the only real thing the IE Google toolbar has over the Mozilla alternative.
    • Heck, at only 12 megs for a download of mozilla, maybe we should incorporate this "feature" into mozilla. Whats a couple hundred more meg right? :-)

      </sillyness>

      This was posted by mozilla, don't worry about modding me down for teasing the browser.
    • by Anonymous Coward
      Is a way to look at Google's pagerank. That's the only real thing the IE Google toolbar has over the Mozilla alternative.

      The Google toolbar for IE has to ask google.com for the PageRank of each page you view, via XML-RPC. One of the fields in the XML-RPC request is a checksum. Without that checksum, google.com rejects the request. So it's just a matter of finding out how the toolbar calculates the checksum based on your URL. Then you could write a standalone (or Mozilla-based) tool for fetching PageRa

      • The problem is, is that the google toolbars checksum changes constantly. So if you were to find out how the google toolbar works exactely regarding pagerank, all it takes if for googles official toolbar to change it and it wont work anymore. The catch is however, that if you send a wrong checksum to google, they don't send back an error message of any sort instead they send back a fake pagerank. So you wouldn't really know if it was still working or not.
    • by Anonymous Coward
      http://googlebar.mozdev.org
    • Don't think for a moment that google is not tracking and saving this.
      • Spyware? How can it be "spying" on the user when the installer practically assaults the user with warnings that it will send the URL to Google so it can return the pagerank. How exactly is it supposed to send the pagerank if it can't send the URL to Google anyway?

        Ridiculous. You "anti"-spyware freaks are more dangerous than most spyware because you stir up shit and cause hysteria by accusing everything and everyone for spyware. You remove the focus from the actual spyware and sleazeware out there.

        You ar

  • by xchino ( 591175 ) on Thursday June 12, 2003 @02:22PM (#6184022)
    Now if I can just think of a reason why I would need this..
    • by Anonymous Coward on Thursday June 12, 2003 @02:25PM (#6184047)
      If Google tweaks one thing, causing result 97 to shift to result 98, they notice. They'd be doing this daily to check on their pages.
    • by Daniel_Staal ( 609844 ) <DStaal@usa.net> on Thursday June 12, 2003 @02:42PM (#6184223)
      Now if I can just think of a reason why I would need this..

      And you call yourself a geek. *Sigh*.

      It doesn't matter why you need it. It's technical, GPLed, and has to do with Google. That's all the reason you need.

    • This is really cool, not just for page rank. Finding pre-compiled data sources like this can be a great catalyst in scientific research. Just my two cents.

    • I wonder if I can use pagerank algorithm for the smaller universe of my harddrive itself?

      I have over 6,000 files on my PC many of which link to each other, and I am adding more links between them as time goes by. The collection is now so big that I can't even revist my own files and reason out the implications of the links between pages, beacuse of the huge time it would take to even spend a minute on each saved file.

      I wonder if something like Pagerank will let the important files that are linked by man

  • Dumb Question: (Score:5, Interesting)

    by Xesdeeni ( 308293 ) on Thursday June 12, 2003 @02:23PM (#6184031)
    What's Page Rank? Does this indicate how often my page is visited?

    Xesdeeni
    • by Anonymous Coward on Thursday June 12, 2003 @02:27PM (#6184067)
      It's basically how well linked to your page is, and how well linked to the pages linking to you are, and so on. It's an advanced form of link popularity. The idea is that the more people that link to something, the more influential/important it is. Some sites have high PageRanks of 10 (like Google), while Slashdot is something like an 8. Many pages are in the 4-6 range. Every link you create is like a "vote" for another web page.
    • Re:Dumb Question: (Score:5, Informative)

      by Chris_Stankowitz ( 612232 ) on Thursday June 12, 2003 @02:31PM (#6184103)
      Do you mods ever stop to wonder if this guy could have been asking a legit question? Its possible he doesn't know. Also possible that others don't. I know...I know..., this is /. how could he not know right. It is still very possible. I'm not saying he should have been modded up, but by modding him down someone may miss the chance to read his post and reply to it with an intelligent answer. All of that being said. I would answer his question. But now that I think about, I'm not sure what it is. I 'think' I know. But, I think he and I are in the same boat. I also thought about posting this as an AC, but I won't. Then surley someone will just think that it was the original poster posting as an AC. He may be trolling. He may not be. It won't hurt to answer the question.
      • Well said =)

        ~Berj
      • by Anonymous Coward on Thursday June 12, 2003 @02:34PM (#6184149)
        Jesus, you created a second account just to defend yourself!
      • I didnt know what it was either. Mod parent and grandparent up
        • I didn't know what is it either.

          Mod parent and grandparent and great-grandparent up.

          Also, mod parents children up.

          Also, mod great-great-grandparents great-great-granddaughters up.

          Also, say up unto them verily, that the mod of the parent will be cast down the generations to be a mod on the children, and on the children's children, and on the children's children's chilluns.

          And also, mod up the nephews of the parents of the sibilings of the grandparent for though they be trolls or flaimbait, they ar

      • Thanks for the defense, but I was kinda enjoying my first post labelled as a "Troll" :).

        So that's how Google ranks its pages. I didn't realize they tracked the number of links. I didn't really think about it long, but I figured they just used how many times the query string appeared, maybe the age of the page, or whatever.

        I wonder if this data would be hugely different from the number of visits a page receives, considering easily typed-in page addresses (fewer links needed), or the possibility that a si
      • And let's not forget... not all of us even get exposed to page rank regularly.

        On my Mac for example, I can't see it at all. On my Wintel I can, thanks to the Google toolbar.
    • Re:Dumb Question: (Score:2, Informative)

      by biomass ( 13779 )
      Page rank, to a first order of approximation, ranks your page by "popularity". Using a voting system,it counts the number of links to your page.

      To a second order of approximation, it weights the votes of the referencing links by their popularity.

      To a third order of approximation, it is a Markov chain that measures the long term likelihood of you arriving at a page, if you to randomly traverse the net: taking random links out of a pages and occasionally take (1/20?) random jumps to arbitrary urls.
    • Think of it as Google-bestowed karma on your website. :-D

  • Tee Hee (Score:5, Funny)

    by teamhasnoi ( 554944 ) <teamhasnoi AT yahoo DOT com> on Thursday June 12, 2003 @02:23PM (#6184035) Journal
    I bet the Searchking is steaming right about now...

    "Finally, proof!!"

  • I wonder if this goes as it's planned to, is it the end of search engines, and the beginning of peap to pear search?
  • by amembleton ( 411990 ) <aembleton@bigf[ ].com ['oot' in gap]> on Thursday June 12, 2003 @02:26PM (#6184056) Homepage
    When these Web Graph or Page Rank things are drawn up which sites do they use as the roots?

    I mean they've got to start with some site(s) and then go through each link from there.
  • by Anonymous Coward
    There is more links than that just at Microsoft's Support page. Although I don't know if you can call them links if they only send you around in a cirlce.
  • beyond PageRank... (Score:3, Interesting)

    by rfischer ( 95276 ) on Thursday June 12, 2003 @02:26PM (#6184063)
    ... I would be interested in how the links change over time. Maybe take a new snapshot every day or week, see the web evolve.
    • Maybe take a new snapshot every day or week, see the web evolve.

      How much time do you think it's needed to take a snapshot of the Web? Most certainly much longer than a day or even a week. My bet would be several months at the very least.

    • That would be amazingly cool.The only problem (and it's not really a problem) would be that generally people never, or rarely, remove links. If you limited this to links only (say) a month old or younger, you could see the paths of memes round the web . . . for example right now, you'd probably see a lot of BitTorrent hotspots, whereas a couple of years ago there'd be lots focussed on "all your base" . . .

      Anyone got a lot of procesing power and some spare time? ;)

      P
  • by AyeRoxor! ( 471669 ) on Thursday June 12, 2003 @02:28PM (#6184074) Journal
    "[...] even on a PC with as little as 256 Mbytes of RAM."

    Somewhere in 1980, milk shoots out of Bill Gates' nose for no apparent reason.
  • by Anonymous Coward
    This is what we need to talk about at our little IRC chat session tonight, commander.

    web graphs at 3 bits per link, that's a paddling...

    compute PageRank with your PC, that's a paddling....

    groking PageRank, you better believe that's a paddling...
  • Google with feedback (Score:3, Interesting)

    by Sanity ( 1431 ) * on Thursday June 12, 2003 @02:29PM (#6184089) Homepage Journal
    Doesn't Google have a patent on PageRank?

    Anyway, forgive the opportunism, but this is reasonably on-topic. Last weekend I set myself the ambitious task of improving on Google. I came up with a Google front-end which allows you to give feedback on the quality of search results, and thus refine your search. I could really use people's help to test it out - you can find it here [68.65.221.46]. Feedback would really be appreciated.

    • By the way... (Score:3, Informative)

      by Sanity ( 1431 ) *
      ...it isn't on a fat pipe, so please understand if its slow.
    • by YoJ ( 20860 ) on Thursday June 12, 2003 @02:48PM (#6184274) Journal
      The whole point of patents is to encourage inventors to publish their inventions in a safe way. In some respects, PageRank is a good example of how the system is supposed to work. They publish the algorithm, people examine it and experiment further with it, but the inventors still have protection against people ripping off their work.

      The problem is that the GPL does not allow distribution of patent-encumbered technology. The authors of the code in question have every right to release their code with whatever license they want (I believe this is a free-speech issue, especially since the purpose of releasing the code is for doing research). People who receive their code may not use the code in a way that violates the patent, and in addition may not redistribute the code at all (since it would violate the GPL).

      The other issue is that PageRank is really a mathematical formula, and as such is unpatentable. What they actually patented is an algorithm for computing PageRank. If someone finds another way of computing the same formula, I think the patent holders would have a very hard time showing infringement.

    • The only problem with getting feedback from anyone is that it would be very easy to reduce a search engines hit on specific things. Example, say I work for Pepsi, I seach to cola, and say "BAD" for every Coca-Cola result. I script it to submit hundreds of "bads" an hour. Now, I search for Cola, and only get Pepsi results.
  • by Prince_Ali ( 614163 ) on Thursday June 12, 2003 @02:29PM (#6184092) Journal
    This is good, but I'd rather have the google cache compressed to 3 bits per page.

    "I'll be there in a minute! I'm downloading the Internet!"

  • Live XML Version (Score:2, Insightful)

    by amembleton ( 411990 )
    data sets representing real snapshots of portions of the web

    If these are snapshots then you'll need to keep downloading them for your Page Rank system to be up to date. The web is constantly changing and therefore so is Page Rank. I can't see having a data set on your computer being all that usefull as it'll soon expire.

    It would be far better to be able to link to a data set via XML and query it. That way you would have live upto the minute Page Ranks. I know that Google already does a live Page Rank

    • If these are snapshots then you'll need to keep downloading them for your Page Rank system to be up to date. The web is constantly changing and therefore so is Page Rank. I can't see having a data set on your computer being all that usefull as it'll soon expire.

      You are absolutely right. This is Google does on your behalf. They have the computing power, storage-wise as well as processing-wise to do the needed updating. Not that google can do it 24/7, but they do better than I can with my 4 computers. :\

  • Google patents? (Score:5, Interesting)

    by PaulBu ( 473180 ) on Thursday June 12, 2003 @02:31PM (#6184111) Homepage
    All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!

    GPL'd? Hmm, I thought that Google did patent the PageRank algorithm (correct me if I am wrong), so re-implementing THEIR algorithm even more efficiently would be incompatible with GPL. OTOH, if it is not THEIR algorithm, it can not be called 'PageRank'
    Oh, the evils of software patents...
    Paul B.
    • Re:Google patents? (Score:4, Interesting)

      by JoeBuck ( 7947 ) on Thursday June 12, 2003 @02:37PM (#6184179) Homepage

      Google hasn't exactly patented the algorithm for all uses, and no court has determined that the code infringes the patent, and software patents aren't valid in most countries, so it's not clear whether or not there is any compatibility.

      It would seem that anyone who uses the code to build a search engine would be infringing, but even that is something that lawyers can argue about.

    • OTOH, if it is not THEIR algorithm, it can not be called 'PageRank'.

      Unless the term is trademarked (is it?), you can call whatever the hell you want "PageRank" and nobody can do a thing about it.
  • by Vultan ( 468899 ) on Thursday June 12, 2003 @02:31PM (#6184116)
    As best as I can tell from the website, the API is only for storing and interacting with a large graph. Nothing there is actually involved with PageRank. You could use this API presumably to write your own PageRank code, but to say "everybody can grok PageRank now!" is misleading at best.

    Moreover, IANAL, but isn't the PageRank algorithm patented by Google? Wouldn't this prevent anyone from releasing GPL code that computes PageRank?
    • Moreover, IANAL, but isn't the PageRank algorithm patented by Google? Wouldn't this prevent anyone from releasing GPL code that computes PageRank?
      It would prevent anyone in the US from releasing that code. Software patents don't apply everywhere (yet)

  • PageRank is patented, isn't it?
  • I don't think this is pagerank, reading the link, this looks more like another rating system that is similar to pagerank. It's great for study, but I don't think reading through the source and finding ways to 'trick' this algorithm will necessarily work on Google. Correct me if I'm wrong someone.
  • by madHomer ( 2207 ) on Thursday June 12, 2003 @02:34PM (#6184142)
    It's just not the same without the pigeons [google.com]...
  • by Saganaga ( 167162 ) on Thursday June 12, 2003 @02:38PM (#6184184) Homepage
    I think this project is really just a proof of concept. As another post pointed out, to make this really useful you'd need to regularly update your local data set, which isn't very practical for most people.

    Also, if the downloadable dataset only covers a small portion of the web, how can this system's utility really compare to Google's?

    That said, I think computer science proof-of-concept type project are very useful and serve a valuable purpose in getting the ideas out there for others to improve upon.
  • Though after a quick read, I can see why ...
  • What a mess (Score:3, Funny)

    by Ignorant Aardvark ( 632408 ) * <cydeweys.gmail@com> on Thursday June 12, 2003 @02:51PM (#6184300) Homepage Journal
    Sure, now everybody can grok PageRank, but I, for the life of me, cannot grok grok.
    • Re:What a mess (Score:5, Informative)

      by dpbsmith ( 263124 ) on Thursday June 12, 2003 @02:57PM (#6184362) Homepage
      Just in case this wasn't an implied rhetorical question... the term, as far as I know, was invented by Robert Heinlein in his novel _Stranger in a Strange Land,_ where it is an expression used by Martians. It literally means "to drink," but the Martians use it to mean an understanding that is both very deep and very complete.
      • Re:What a mess (Score:3, Informative)

        by mbourgon ( 186257 )
        Yes. Basically, "to share water with", which on Mars meant you were more than brothers. Considering how little water was/is on Mars, it was a great honor.
      • Indeed.

        OED Online:

        grok, v. U.S. slang. Also grock.

        [Arbitrary formation by Heinlein (see quot. 1961).]

        a. trans. (also with obj. clause) To understand intuitively or by empathy; to establish rapport with. b. intr. To empathize or communicate sympathetically (with); also, to experience enjoyment.

        1961 R. HEINLEIN Stranger in Strange Land iii. 18 Smith had been aware of the doctors but had grokked that their intentions were benign. Ibid. xxiv. 250 Now that he knew himself to be self he was

      • I never knew the source but figured it had to be Emacs. grep, grok, natch?

        I feel so much more enlightened now, and now I have less reason to learn Emacs. Viva vi!
  • It doesn't mean a lot to me when my brother says he is going to double his efforts to find a job. This is especially true if you know my brother.
  • by HiKarma ( 531392 ) * on Thursday June 12, 2003 @02:55PM (#6184334)
    Since their original papers, according to all posted reports. So I don't think you're really going to get the exact google number from a basic algorithm and this data set.

    They also use terms that appear in links as a major key in ranking searches.

    (Among other things.)

    Not that it is not interesting to see these rankings, and note the most widely linked to sites on the net.

    Which, by the way, after the obvious winners like Yahoo, include Adobe and Real networks, which have gotten immense numbers of sites to link to them with "Get acrobat reader" style links.

    I've often wondered if the makeashorterlink and tinyurl folks are doing it just for the googlejuice.

    In reverse, many sites now use javascript links in order to preserve their googlejuice.

    Very much a heisenberg phenomenon here.
  • I wonder... (Score:5, Interesting)

    by crashnbur ( 127738 ) on Thursday June 12, 2003 @02:58PM (#6184364)
    ...how this can be used to discover the percentage of broken links on the web at any given moment in time.
  • and I say "Dammit, where are all the pretty pictures."
  • I get this from the article: "A set of flat codes, called Î codes, which are particularly suitable for storing web graphs (or, in general, integers with power-law distribution in a certain exponent range). The fact that these codes work well can be easily tested empirically, but we also try to provide a detailed mathematical analysis."

    Maybe it's my ADD. Maybe it's my inherent dumbassedness... but I can't grok that.

    So what is a web graph? How is that related to PageRank? If I download all this d
    • Yeah, I think they should have explained that. Presumably a Web graph is what you get when you treat each URL as a node and each link as an edge in a graph. PageRank is an algorithm used by Google that takes a Web graph as input.
    • by lordbrain ( 172792 ) on Thursday June 12, 2003 @04:11PM (#6184961)
      In a graph is made up of two things, edges and vertices.

      In a web graph, vertices are webpages and edges are hyperlinks.

      PageRank determines how many incoming edges a vertex has. Given the nature of the web, this is a nontrivial problem because a vertex only knows its outgoing edges.

      The assumption for PageRank is that the more incoming edges a vertex has, the more popular it is. So you would use this to figure out how popular a particular vertex is.

      Given this you could do like Google and combine it with a search engine to prioritize the results.
  • even if it was improved upon. Can the idea of ranking based on links popularity be patented? Did google patent it? if not, how much longer before some asshole lawyer in melo park or amazon/ms/aol does and tries to shut down google.
  • If WebGraph can inflate their google PR to 10 then I'm a believer. Until then this looks like one of the many tools available to analyse your PR.

    The WebGraph tool may be interesting for college students but the webmasters that are interested in seo techniques are going to find little use out of this tool.
  • I don't know what the big deal is. I've always been able to do pagerank on my computer [google.com]...
  • Why reinvent the wheel?
  • by |>>? ( 157144 )
    While calculating PageRank seems like a nice idea, I'm much more interested in having a google search available over my harddisk. I recall that AltaVista in the mid-90's had a programme that created an index over your whole disk - it dealt with many filetypes including .doc, .pdf, .mbox and basically gave you an AltaVista search over all your harddisk content.

    Anyone know of anything like that?

    • I'm much more interested in having a google search available over my harddisk.

      I thought I remember Google having a product like this, but I can't find it [google.com] now.

      MS Win2k and WinXP have an indexing service that's supposed to do just what you want. It's not enabled by default in 2k; not sure about XP. I've been afraid to try it for various paranoia and stability reasons.

      HTdig [htdig.org] was my next thought. It's designed for web pages, but I bet you could restrict it to your hard disk. However, the site says they don'

THEGODDESSOFTHENETHASTWISTINGFINGERSANDHERVOICEISLIKEAJAVELININTHENIGHTDUDE

Working...