PageRank-Type Algorithm From the 1940s Discovered 108
KentuckyFC writes "The PageRank algorithm (pdf) behind Google's success was developed by Sergey Brin and Larry Page in 1998. It famously judges a page to be important if it is linked to by other important pages. This circular definition is the basis of an iterative mechanism for ranking pages. Now a paper tracing the history of iterative ranking algorithms describes a number of earlier examples. It discusses the famous HITS algorithm for ranking web pages as hubs and authorities developed by Jon Kleinberg a few years before PageRank. It also discusses various approaches from the 1960s and 70s for ranking individuals and journals based on the importance of those that endorse them. But the real surprise is the discovery of a PageRank-type algorithm for ranking sectors of an economy based on the importance of the sectors that supply them, a technique that was developed by the Harvard economist Wassily Leontief in 1941."
Good advice for all developers (Score:5, Insightful)
Well, this is actually pretty good advice for any developer; Don't reinvent the wheel. Look around, search for what's been done before and adapt it to suit your needs. Of course, as a last resort, one can design something new once he has done his homework and made sure nothing that has been done before may be re-used.
Through my life, I have seen a amazing high level of work that has been done in vain because it yielded poor results and that something doing the same better already existed anyway.
Don't get me wrong here, once you have made sure that nothing already existing suits your needs or can be reused, it is fine to innovate and create real new stuff. Just don't get caught trying to reinvent the wheel unless you reinvent it better ;-)
Also, an exception to that principle could be allowed for trivial tasks that are really quick to implement and where searching for an existing solution might cost more than implementing it yourself but be really careful applying that exception rule, it is an open door that leads to trying to reinvent the wheel sometimes ;-))
Re:Patent? (Score:1, Insightful)
CHECKLIST:
Is Google a megacorp? Check.
Did Google threaten to move to India if Obama raises corporate taxes? Check.
Does Google spy on users and collect data? Check.
Did Google receive monetary assistance from the Taxpayer's Public Treasury? Not sure.
Okay. I'll let them slide and not try to invalidate their pagerank patent, but that likely won't stop Microsoft from making the attempt.
indeed it is not new (Score:1, Insightful)
In many different types of jobs, people use the counts the number of times their research papers have been referenced and quoted with different "points" depending on where it was quoted. Fx. someone working with medicine hos has his work referenced in The Lancet, counts more than a reference in local-hillbilly-news.
I belive there are sources collects this information. Sorry for being so vague, but I can't remember the specifics. (but hey, isn't that just what we do i Slashdot comments)
Re:linearity (Score:5, Insightful)
Re:Good advice for all developers (Score:4, Insightful)
Re:linearity (Score:3, Insightful)
Maybe this is just the elitist snob in me, but I don't feel that the latest American Idol singer is really a thousand times better than Billie Holliday, just because a thousand times more people listen to him than to her.
You are measuring the wrong thing. Google isn't measuring who is 'better,' it is trying to measure what page is more interesting to a web surfer, and pages tend to be more popular because they are more interesting to more people. Thus if you do a search for Brittany, you are more likely to find Brittany Spear's fan club than you are an academic study of why her beautiful innocence was so popular (and oh yes was it beautiful!). People who are looking for more specific academic things learn to add extra search terms to their query, like "Brittany analysis" or "Why is Brittany popular?"
The way to solve this problem better is for Google to get to know you and your preferences: if Google knows that you are mainly interested in academic sorts of things, then it can automatically return that sort of thing when you do a search for Brittany. This is convenient, but drives certain people crazy because of privacy issues.
Re:Good advice for all developers (Score:3, Insightful)
Reinventing the wheel is a phrase that means to duplicate a basic method that has long since been accepted and even taken for granted. [wikipedia.org]
K, so how is Brin and Page developing PageRank when an obscure economics paper published at Harvard in 1941 and only re-discovered in 2010 reinventing the wheel?
Would Page and Brin have gotten there faster if they'd rooted through Harvard's economics library in the hopes that the best-yet algorithm for search results ranking would be there, somewhere? Was the Harvard paper "long accepted and taken for granted" or was it "forgotten and ignored"? Is PageRank a "duplication of a basic method"?
Personally, I think google got there quicker by re-inventing the wheel, if that's what this was. In my opinion, if something is only recognized as reinvention of the wheel after the fact, it ipso facto is not reinvention of the wheel in the sense described in the wikipedia article you cited.
Re:Good advice for all developers (Score:3, Insightful)
I can see that you must be young enough never to have used the search engines you list, if you suggest that you would have been able to find anything useful.
Well the results were there, somewhere between all the adverts.
Re:Patent? (Score:1, Insightful)
Patent?
I think the way the Google maintains its search superiority has more to do with massive banks of machines and keeping their algorithms secret rather than sending lawyers after anyone. The PageRank algorithm is little more than a useful application of Markov Chains... hardly seems patentable. (Of course, RSA doesn't seem like it should have been patentable either...)
Re:linearity (Score:5, Insightful)
You understand neither how the parent post is using the word "linear" nor the PageRank algorithm itself. You can rewrite the eigenproblem at the heart of PageRank as the solution to a linear system, but very few people do. Moreover, this is not the correct intuition to employ to understand what's going on: there are no "massive matrix inversions" here, just a simple iterative algorithm for extracting the dominant eigenvector of a matrix.
Furthermore, you've got it exactly backwards regarding the "connection" between PageRank and light transfer. Since the Markov process used to model a web surfer in the PageRank paper is operating on a discrete domain with an easily-calculable transition function, the stationary distribution (or ranking) can be determined exactly. In rendering, you have a continuous problem for which Markov chain Monte Carlo techniques provide one of the most efficient ways to approximate the solution...but you have to actually simulate a Markov chain to get it (see, for instance, Veach's seminal paper on Metropolis Light Transport). Computing PageRank is an "easy" problem, by comparison.
Re:Good advice for all developers (Score:2, Insightful)
Remember the internet was a LOT smaller in 1997, so while my preferred engine at that time (altavista) may not have been as good as google today, there was also a lot fewer websites to search through.
>>>I can see that you must be young enough never to have used the search engines you list
I'm familiar with the slashdot practice of not RTFA (reading the frakking article), but when did people stop reading user names? You think I'm young? My first computer was a Commodore and my second an Amiga. I was making music, producing primitive videos, and "surfing" the internet before the web even existed.
Re:Good advice for all developers (Score:3, Insightful)
Wow, +5 Pedantic.
Reinventing the wheel implies that they set out to find a measure of importance. So they would have had to decide to make a search engine, have it produce relative results, decide that relative importance is the key, and then go searching for ways to find relative importance. There's a major gap there, a leap in thinking that simply wasn't present at the time. How important a page is indicates what order it should be in results. That makes sense, but it wasn't obvious at the time. Previous results were based on things like the number of times the page mentions your word, or what order the pages were added. People tried to figure out: what is that quality which makes a page more relevant than others? And they failed.
The key was settling on Importance, or we could call it Charisma. When you mention the name "Brad", do more people think of the guy you work with first or Brad Pitt? Brad Pitt is more relevant (technically relevant to more people), but why? More people know of him, and more people speak of him. More importantly, more important people speak of him.
Scientific papers have been measured for their influence factor this way (but that might be a false correlation: http://www.physorg.com/news165950992.html [physorg.com] "... papers published early in a field receive citations essentially regardless of content because they are the only game in town.") If they had looked to science to see what makes something influential, they would have seen the same concept. But they didn't know they needed to find "influential", just "relevant".
The breakthrough Google made was deciding on a quality which made things more relevant, which is roughly equivalent to notoriety. Not just the number of references to a page, but the weight of those references in relation to who references them. That's where link farming sprouted, and they had to figure a way to cancel that effect out.
How many people know this page, as opposed to that page? And then they had to figure a way to find the pages, process the data, calculate notoriety, and serve it up quickly, and create a revenue stream from all of that. It's not about whether you're going to arrive more quickly by re-inventing the wheel. Often times you can, especially if it's in a language where you don't know all of the built-in functions. You can write a linked list with sorting faster than searching for how the language implements it under certain circumstances.
It's about whether you will arrive better, at a better solution in other words. The point was to look elsewhere for implementation, but the part you missed was you have to have inspiration to know where to look.
Google wanted to get something to lots of people. They might have had stuff in baskets. They could re-invent the wheel to make the baskets easier to transport, or they could build up a farm, attracting farm hands and their families and gradually build up a town, making the people come to Google instead. Once you know you need a transportation solution, it's easier to copy an existing idea. It just so happens that once you decide the solution to your problem is relative importance, there's a description of how to do it in a book from the 1940's.