Stories
Slash Boxes
Comments

News for nerds, stuff that matters

The Math Behind PageRank

Journal written by anaesthetica (596507) and posted by samzenpus on Wed Dec 06, 2006 06:45 PM
from the learn-to-be-number-one dept.
anaesthetica writes "The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis."
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

The Math Behind PageRank 50 Comments More | Login /

 Full
 Abbreviated
 Hidden
More | Login
Keybindings Beta
Q W E
A S D
Loading ... Please wait.
  • 10,000 words (Score:5, Funny)

    by ambivalentduck (1004092) on Wednesday December 06 2006, @06:48PM (#17139072)
    But 9,000 of those words are slang for parts of the human anatomy.  Go figure.
          • Re: (Score:2)

            On the other hand: explain Gallagher and Carrot Top. "Apparently" they are funny, because they have "careers". Yet everyone with an actual sense of humor knows they are just waiting to unhinge their jaws and swallow you whole.
  • by dada21 (163177) * <adam.dada@gmail.com> on Wednesday December 06 2006, @06:50PM (#17139112) Homepage Journal
    I have sites with a PR of 6, and I can tell you that they got that way because of inbound links from other sites. In fact, when other sites dropped those links, my PR dropped (to 5, and even to 4). Getting more inbound links brought the PR back.

    Think about those links, too. How often do you use common words in an HREF? I don't think there's a lot of weeding out of common words since the link to a site is usually either its name, or a description containing some important keywords.

    I love seeing these technoscientists think they understand PageRank, but just like TimeCube, they're way, way off.
  • Bad summary (Score:5, Interesting)

    by Knights who say 'INT (708612) on Wednesday December 06 2006, @07:06PM (#17139318) Journal
    The article specifically says the PageRank eigenvector is only recalculated once a month, approximately. Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.
    • Re: (Score:2, Funny)

      Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.

      Please. I can do that on paper in, like, five minutes.
    • Re: (Score:2)

      Several hours for 25b x 25b? Jeez, it took Slashdot the better part of a day to update the comment id field type in their database... 16.7m by 1. OSTG, we demand that the servers running Slashdot be upgraded to something that could actually withstand a S
      • Re:Bad summary (Score:5, Insightful)

        by martin-boundary (547041) on Wednesday December 06 2006, @09:19PM (#17140646)
        It's nowhere near like that. A web matrix is very sparse, so if you did a true 25Bx25B matrix power iteration, you'd be multiplying zero by zero a gazillion times. Optimization is about not doing things you don't need to do, and optimizing PageRank is about figuring out clever ways to not do the full multiplication. Moreover, PageRank is calculated in parallel over a computer farm. Overall, you can expect a single iteration to take on the order of an hour, and you can expect around 50-80 iterations before Google gives up and says it's converged. You can also try and reuse the previous "converged" PageRank vector to cut down on the 50-80 iterations after you've crawled new pages.

        If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.

        [ Parent ]
  • Nouns maybe? (Score:4, Insightful)

    by Bryansix (761547) on Wednesday December 06 2006, @07:07PM (#17139344) Homepage
    It seems like it would be the nouns, pronouns, etc. that Google should be paying attention to. Who cares about all the verbs, adjectives, etc. that just muddy the indexing waters?
    • Re: (Score:1)

      I believe that a race is on at the moment for semantic searching. Not only nouns, verbs etc, but whether the phases are subjective or objective. I know a blog search company that is working on this. They wanted to borrow some of my code.
    • Re: (Score:2, Insightful)

      Searching for pill and the pill should yield very different results. Yes nouns are more important, but articles and other words cannot be disregarded.
      • Re: (Score:2)

        I actually thought about that after I posted. I know all the words are important for indexing. I'm just saying that looking at keywords and placing more importance on those is a part of the mix too. Those keywords are almost always nouns.
  • I read about this some time ago ... I think the paper was entitled "The 10 billion dollar Eignvector: The math behind google" or something to that effect. Sorry, but I've got a new laptop and cannot find the exact title. It was an excellent introduction
    • Re: (Score:2)

      Here's the bibtex reference.

      @article{bryan:569,
      author = {Kurt Bryan and Tanya Leise},
      collaboration = {},
      title = {The $25,000,000,000 Eigenvector: The Linear Algebra behind Google},
      publisher = {SIAM},
      year = {2006},
      journal = {SIAM Review},
      volume = {48},
      numbe
  • by CrazyJim1 (809850) on Wednesday December 06 2006, @07:10PM (#17139396) Journal
    I skimmed the article and didn't find what I wanted to find. If you make a webpage that you want ranked high, what do you do? Do you make 100 geocities accounts and provide links to your main website, or what? I'm just wondering this out of curiosity, not out of need.
    • At a very basic level a sites page rank is a reflection on how much other sites think it's relevent, and is based on how important the sites are that link to it. Get a link from the BBC, CNN, or somewhere like that and it's worth thousands or millions of
    • Re: (Score:1)

      That's kinda what I thought at first as well, but looking over the lower two-thirds of the article, I started to get a different impression. They talked about a 'strong web' idea, where if your webpage is disconnected from the 'main' web and set up in a s
    • by Anonymous Brave Guy (457657) on Wednesday December 06 2006, @07:52PM (#17139894)

      The underlying idea behind page rank is pretty well-exposed at this point, and is described in TFA. Essentially, it's a big set of simultaneous equations: each incoming link to your page gets a score that is roughly the rank of the source page divided by the number of outgoing links on that page, and then the rank of your page is roughly the sum of the scores of all incoming links.

      Various fudge factors are introduced along the way. For example, if you break Google's rules about displaying the same content to bots as to humans, you can get slapped right down. More subtly, newly registered domains take a modest hit for a while. More nobody-knows-ly, Google's handling of redirects is unclear: information about exactly what adjustments are made is pretty scarce, and there's a lot of conjecture around. One thing that's pretty certain is that they penalise for duplicate content, which is why some webmasters do apparently unnecessary things like redirecting http://www.theircompany.com/ [theircompany.com] to http://theircompany.com/ [theircompany.com] or vice versa.

      So, if you want to get a page with a high rank yourself, then ideally you need would get many established, highly-ranked pages to link to your page and no others. In your example, all those Geocities sites wouldn't help a lot, because (a) they'd have negligible rank themselves, and (b) they'd be penalised for being new and lose some of that negligible rank before they even started. Many times negligible is still negligible, and so would be your target page's rank. OTOH, get a few links from university sites, big news organisations and the like, and your rank will suddenly be way up there. Alternatively, get a grass-roots movement going where a gazillion individuals with small personal sites link to you, and the cumulative effect will kick in.

      [ Parent ]
      • Re: (Score:3, Interesting)

        "if you break Google's rules about displaying the same content to bots as to humans"

        I notice many sites that do that and don't get slapped down - esp subscription sites. And seems Google doesn't cache those, so its probably collusion.

        You see the keywords a
        • by oni (41625) on Wednesday December 06 2006, @08:39PM (#17140342) Homepage
          I notice many sites that do that and don't get slapped down - esp subscription sites.

          I wonder, if I changed my useragent to be whatever the googlebot reports itself to be - would I get by the registration screen on websites like the NYTimes??
          [ Parent ]
          • Re: (Score:2)

            No, because they check the IP you're coming from as well now - they grew wise to user agent spoofing years ago.

            Google for the "bugmenot" Firefox extension.
            • Re: (Score:2)

              Googlebot doesn't use the same IP address all the time (several servers running Googlebot I'd imagine), so filtering based on IP addresses would be infeasible (at least according to Google).
          • Re: (Score:3, Informative)

            As pointed out, the Times site isn't fooled, but there are a good many out there that are fooled. Sometimes if you ever do a Google search, one of the results will contain a keyword or two. However, when you click on the link, you'll find yourself redirect
        • Re: (Score:3, Interesting)

          Here is an email with associated response I received from Google on roughly this topic.

          This is a very general question. I'm creating a website. It is going to be a blogging platform. Obviouslly, the content of the site(s) is the most important thing. I've
      • I now have a nice basic understanding of Google page ranking system. Thats all I was asking for.
      • Re: (Score:2, Insightful)

        Thanks for the informative post. I have one question though. How does it help find the relevant information unless that information just happens to be on a popular page too? What I mean to say is that the idea behind grading/filtering systems like PageRank
    • Re: (Score:2)

      If those 100 geocities pages each have a PageRank of 0 (which they would if they aren't linked to from other high-ranking pages), their total contribution to your main page PageRank will be 0.
  • Does PageRank count? (Score:2, Interesting)

    As a self proclaimed SEO expert - I honestly don't believe PageRank counts nearly as much as it did a few years ago! You'll find lots of PR5 sites ahead in the SERPS of PR9 sites!
    • Re:Does PageRank count? (Score:4, Insightful)

      by Trieuvan (789695) on Wednesday December 06 2006, @07:35PM (#17139726) Homepage
      The pagerank that's reported from toolbar is really old. Google never want to let you know the real number or it will be easy to spam ...
      [ Parent ]
      • Re: (Score:2, Funny)

        by Anonymous Coward
        Concentrate on SERPs, not PR, ASAP for SEO on the WWW

        I searched on Google but I cannot find what "on", "not", "for" and "the" mean...
  • by colourmyeyes (1028804) on Wednesday December 06 2006, @10:01PM (#17141000) Homepage
    I think we can get four or five tomorrow.
  • Great article.

    The character of online content is changing now rapidly. We used to be in an Internet where mostly only the site provider determined the content on the pages they served (/. being a notable, early exception). Now, with the rise of "2.0" sys
    • Re: (Score:2)

      I could not disagree more. Most of the sort of information people search for is not user generated: when did you last do a Google search for which a slasdot comment was the appropriate answer?

      The only exception that I can think of (form my searches) are fo
      • Re: (Score:2)

        Sometimes you want to search through your old posts. Not all sites let you do that (slashdot does if you pay up, I think), and often forums are even norobots space.
      • Re: (Score:2)

        The meme that Google helps us find all the information is a huge marketing Spin.

        Compared to "exactly the information you want, when and how you want it" - Google sucks. It is better that anything else now, but it still is not anywhere close to really solv
  • For a different, somewhat more technical, but more succint discussion, Cleve Moler [of Matlab fame] wrote another view [mathworks.com] of this topic, about 5 years ago.

    The math is the same, of course, but two points of view may provide a greater sense of perspective. S

    • Re: (Score:2)

      Actually, I'm not so sure it's the largest matrix computation. Weather and nuclear bomb simulations are done with matrix algebra, and it wouldn't surprise me to discover that they do some months-long calculations with even larger matrices.
  • I've seen links on google searches that don't exist anymore but were ranked highly when they DID exist and still exist in the top 10 of the query. What happens to those? Do they stay at their ranking till they get overtaken by other more popular pages on
    • Pagerank (Score:5, Funny)

      by Skythe (921438) on Wednesday December 06 2006, @07:37PM (#17139752)
      Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods.

      They use a set of nested if-else statements
      *ducks*
      [ Parent ]
    • Re:Pagerank is cool (Score:5, Interesting)

      by silentounce (1004459) on Wednesday December 06 2006, @08:02PM (#17139982) Homepage
      Interestingly enough, google thinks so, too. [google.com]

      Of course, yahoo has its own opinion. [yahoo.com]
       
      Although, altavista seems to almost agree. [altavista.com] Check the second non-advertised result.
       
      I do find this [google.com] amusing though. Third place, how humble.
       
      I didn't expect such interesting results. The site with the search term in its url was tops for av and yahoo, but not google. Yahoo ranked the wiki entry above google, but av reversed that decision, google of course thought itself was more important than the wiki. Google's own reference site was number one in its own search and near the top in the other two, but pagerank.net wasn't even in the top 10 for google's search. I'm not sure what conclusions can be drawn from all that, but it is definitely food for thought.
      [ Parent ]
      • Re: (Score:2)

        I do find this amusing though. Third place, how humble.

        What I found interesting about that link was the description listed for google's entry:

        Google - 11:54pm
        Enables users to search the Web, Usenet, and images. Features include PageRank, caching and transl
    • Re: (Score:2)

      Why does that make PageRank broken? That's not the problem it tries to solve. Google might be broken for slavishly adhering to PageRank, but that's a different matter entirely...