Interesting Concepts in Search Engines 231
TheMatt writes "A new type of search algorithm is described at NSU. In a way, it is the next generation over Google. It works off the principle that most web pages link to pages that concern the same topic, forming communities of pages. Thus, for academics, this would be great as the engine could find the community of pages related to a certain subject. The article also points out this would be good as an actually useful content filter, compared to today's text-based ones."
google contest... (Score:1)
Re:google contest... (Score:1)
But.... (Score:3, Interesting)
Where would Slashdot fit in to this? There's links to everywhere!
Re:But.... (Score:5, Funny)
Slashdot must be the Kevin Bacon of the online world...
Explanation of the joke (Score:3, Informative)
Super Bowl Sunday a commercial aired, featuring none other than Kevin Bacon at a retail store, trying to use a check to pay for his goods. The man behind the counter asked to see ID, but Bacon didn't have any on him. What now? Bacon runs around town gathering people (an extra he played in a movie with, a doctor, a priest, an attracive girl, and maybe one other guy?), who all had some ties to one another, through the other 6 in the group. The attractive girl once dated the sales clerk in the store, so Kevin explains that they are "practically bothers," hence putting to good use the principle of 7 degrees of seperation.
Therefore, the humor lies within.
Re:Explanation of the joke (Score:2)
Methinks you just didn't know about the inspiration for the commercial, which does surprise me
Oracle of Bacon (Score:3, Interesting)
There is also a generic search [virginia.edu] that lets you combine any actor with any other actor. Unfortunately I have forgotten who the best-connected actor was (average to all other actors is smallest). Anyone?
Re:Oracle of Bacon (Score:3, Informative)
*BZZT* please try again... the real origin: (Score:5, Interesting)
It's a strange and beautiful concept. It is fascinating to think that we are all in some way interrelated by only six people or that we have some connection to people even in the remotest part of the world.
The "small world" theory was first proposed by the eminent psychologist, Stanley Milgram. In 1967 he conducted a study where he gave 150 random people from Omaha, Nebraska and Wichita, Kansas a folder which contained a name and some personal data of a target person across the country. They were instructed to pass the document folder on to one friend that they felt would most likely know the target person.
To his surprise, the number of intermediary steps ranged from 2 to 10, with 5 being the most common number (where 6 came from is anyone's guess). What the study proved was how closely we are connected to seemingly disparate parts of the world. It also provided an explanation for why gossip, jokes, forwards, and even diseases could rapidly spread through a population.
Of course, the six people that connect you and the President aren't just any six people. The study showed that some people are more connected than others and act as "short cuts," or hubs which connect you to other people.
Take for example, your connection with a doctor in Africa. Chances are your six childhood friends who you've grown up with aren't going to connect you to someone across the country, much less across the ocean. But let's say you meet someone in college who travels often, or is involved in the military or the Peace Corp. That one person who has traveled and has had contact with a myriad of other people will be your "short cut" to that doctor in Africa.
Likewise, say that you want to figure out your connection to a favorite Hollywood socialite. If you have a friend who is well connected in the Industry, that person will act as a bridge between your sphere of existence and the Hollywood circuit.
The Proof
Mathematicians have created models proving the validity of the "small world" theory.
First, there is the Regular Network model where people are linked to only their closest neighbors. Imagine growing up in a cave and the only people you have contact with for the rest of your life are in that cave with you.
Then there is the Random Network model where people are randomly connected to other people regardless of distance, space, etc..
In the real world, human interconnectedness is a synthesis of these two models. We are intimately connected to the people in our immediate vicinity (Regular Network), but we are also connected to people from distant random places (Random Network) through such means as travel, college, and work. It is by our intermingling with different people that our connections increase.
You may meet someone in class that is from a different country, or whose father works in Hollywood, or whose mother owns a magazine. By this mingling and constant interaction your potential contact with the rest of the world increases exponentially.
The Internet
The Small World theory is interesting in light of recent advances in communication technology--namely, the internet.
You can now instantly make contact with someone across the world through a chat room, email, or through ICQ. In all of human history, it has never been easier to get in tough with someone across the globe.
The great irony, of course, is that although we are making contact with such a vast number of people, the quality of the contact is becoming terribly depersonalized. Our email, chat, and ICQ friends may number in the hundreds, but for the most part we'll only know them as a line of text skittering across the screen and a computer beep.
That's not to say that there is never a cross over from the virtual world of the internet to the "real" world. But a majority of the time, the closest you'll get to actually meeting your fellow e-buddies in the flesh are the pictures they email you (notice how everyone oddly looks like Pam Lee or Tom Cruise), or a series of smilies (meet my friend Sandra
Never in the history of mankind has there been so much technology to keep us connected.This is with so little true connection. Everything from cellular phones, pagers, voice mail, and email were designed so that we would never be alone again. Human contact would only be a few convenient buttons away. But what seems to be happening is that the convenient buttons are superceding real people. Despite the appearance of all this technology, we're still pretty much where we started, with the exception of a motley crew of digital displays, flashing lights, and cutesy computer alerts to keep us company.
Don't get me wrong. The Internet Revolution is great and is making our lives easier. But as with ice cream, money, and sex -- too much of a good thing can be bad (money and sex are sometimes exceptions). What good are all the conveniences and promises of instant material gratification if you don't really live. The virtual world is good, but we shouldn't forsake it for the real world. The macabre image in the Matrix where we are all plugged into computers unbeknownst to us is a parable of what could be our future. A future where people never leave their homes and where we're all so dependent on computers. We wouldn't be able to walk outside without a pang of separation anxiety.
As we enter the new millennium, there is no doubt that we will be living increasingly wired existences. Perhaps Milgram's study will be annotated, and perhaps we will find that we're only separated by three degrees of email. But what good is that if the only "handshakes" going on are between our computers??
Russ [junebug.com]
Re:But.... (Score:2, Interesting)
Re:But.... (Score:2)
Re:But.... (Score:3, Interesting)
Re:But.... what about ad servers? (Score:2, Insightful)
I think it's a great concept that will make lesser known content accessible to the average user. Instead of spending almost all their online time on a few huge sites (AOL, MSN, CNN, and a few other media giants), we can jump to a page with the same topic but no advertising budget. But how do you rank and order the list of members? Traditional text search? Even if a community has only a few hundred members, few users will go to page five in the list to find a site. Admittedly, it's only a matter of time before you can pay to be listed at the top of the community membership, instead of a random listing.
And like all good ideas, this system wouldn't be free of abusers. People could always spam their page with links to major sites using single pixel clear gifs, thus making their page a part of any community I wanted. So it becomes a process of "give me sites with links like this page, but not links like the following black hole listed pages." Useful for filtering content (for good or bad reasons).
Re:But.... what about ad servers? (Score:2, Insightful)
The issue of people creating mass pages of links could be resolved by "teaching" the engine to ignore sites that link to too many different threads, thus cutting out search engine directories, blogs, and other "topic-non specific" pages, or lumping them together as another category.
Sort of "If a page has x number of links to y number of topics then it can be considered for category z but if y is higher than the allowed number..."
Or something... Oh God. I need my caffiene.
-Sara
Re:But.... (Score:2, Insightful)
NSA, anyone? (Score:1)
And for intelligence services, a great way to more quickly compile open source intelligence [slashdot.org].
Content filter (Score:1)
Re:Content filter (Score:2)
Of course, it could also be used to keep you from seeing things they don't want you to see. Then again, most technologies carry that risk, I think.
Just like people surf (Score:3, Insightful)
Jason
Re:Just like people surf (Score:1)
Re:Just like people surf (Score:1)
Re:Just like people surf (Score:1)
I rarely do anything remotely resembling that. My usual routine is: go to a link, find nothing interesting, go back to Google, and repeat until I either find what I want or give up and reformulate my search. It is rare that a page that does not have what I want has links that seem likely to.
Problem. (Score:3, Interesting)
Re:Problem. (Score:1)
Re:Problem. (Score:2)
While I can't say for certain without looking at the exact algorithm, it sounds like good outgoing links would add zero weight to your site. I don't think it would be fooled at all by "links, buried...behind an image"
All it takes is one faked physics/education/sports/religion site to be linked into for all the spammy sites to be brought into the web.
Again, based on my understanding this would also fail. I believe it is measuring multiple inbound routes. While someone might be able to get a "faked" page ranked highly, it would act as a choke point and only add one "point" spread across the "spammy" sites, and wouldn't bring them in.
In order to fool this thing you would have to create several high ranked pages and point them into the "spammy" cluster. It would be quite a bit of work to make enough sitesgood enough to draw in the needed valuable links. It would also be adding valued content to the web - a public service.
If someone does enough work and contributes enough public service to pull it off, you could say he earned whatever he gets out of it.
-
Re:Problem. (Score:1)
Re:Problem. (Score:2)
It's not hard to find popular sites using this methodology, and in this case "popular" is probably as close as you'll ever come to defining a metric for what makes a website a good website. It all depends on what physics (or whatever) sites people link to, which hopefully will be related to how good those sites are. Note that all of the link counts need to take into account some sense of "community" -- i.e. The magazines Popular Science and Science serve very different communities. So link counts need to be taken relative to other sites "around" them, or some such.
And in the end, this solves a lot of things. For instance, the algorithms will be independent of written human lanugage. They'll also be more robust when classifying pages that use graphics for scientific typesetting (LaTeX) constructs that aren't available in HTML (yet). This is important.
-Paul Komarek
I feel sorry for... (Score:1, Funny)
Re:I feel sorry for... (Score:2)
Give me a sec...Ohhh...Ahhh...Mmmmm...
Excuse me, I need a cigarette and a tissue.
Interesting (Score:1)
Bad Idea - What Happens to Science? (Score:3, Interesting)
Currently, I can search google and find things on the destruction of Balsam Fir in Newfoundland by Alces Alces (Moose), with this type of search engine, the journals wouldn't be listed because they themselves don't have links to anywhere (most of them are straight magazine to html conversions or PDF).
It'd be difficult as hell to find pertinent information above the level of "3y3 4m Johnny, And Dis 1s Mai W3bsite, 4nd H3r3 Ar3 Mai LinkZorz!"
Re:Bad Idea - What Happens to Science? (Score:3, Interesting)
Plus the fact that groups mainly link to others doing the same work. So, I can start at one page and soon get an idea of the cluster science community, for example.
Re:Bad Idea - What Happens to Science? (Score:2)
As I understand it, the exact opposite would be true.
If you type "Balsam" into google all the top links are related to balsam products, Balsam Lake and Balsam Beach. Instead imagine you could click on the science community, and then perhaps on the biologist sub-community. Now when you type in "Balsam" you get a list of sites that are the most referenced by biologists.
This can be particularly useful when differnt groups uses a word in very different ways. Inflation means something very different to physisists than it does the general public.
-
DO NOT MESS WITH GOOGLE! :-) (Score:2, Interesting)
http://www.google.com/search?as_lq=www.slashdot
Meaningful category information (Score:1)
This could mean that browsing by category will become more and more useful in the future.
Browser integration (Score:2, Interesting)
Sparse on details and a working demo (Score:2)
Anyway, this would be a much more interesting submission if there had been links to how the algorithm dealt with the computational complexity, or had a site we could Slashdot
Winton
Re:Sparse on details and a working demo (Score:3, Informative)
A postscript document [nec.com] detailing his research.
Also, if you're a member of IEEE Computing, you can see his publication.
Online bookmarks. (Score:2, Insightful)
My fav site on the internet.
A list on unrelated pages all liked from one spot.
I wondering if there any of those left. And how the search engine would cope with them.
And another point. The article states that new categories can be found. How is the "crawler" going to define the name of the new categories? I feel that the article was too short on details. I mean as a concept it's great. But more information would be cool.
The demise of another search engine? (Score:3, Interesting)
As much as I (and all of you) love Google, I wonder whether their moral high ground [google.com]approach to search results would not exist if they did not already have the worlds traffic searching through their site.
Search engines come and go. When Google has to struggle for its existence against the Next Big Thing, how many of you really believe they won't sell out in order to keep themselves running, in effect putting the last nail in their own coffin.
We shall see.
Re:The demise of another search engine? (Score:2)
I'm not saying they wouldn't "sell out"(however a business selling a product could sell out that is), but it seems that their text-based ads work well there, and that they also get a good amount of revenue selling their search tech.
Re:The demise of another search engine? (Score:2)
Google's competitive advantage is their reputation. At this stage, any attempt at sellout would backfire badly: anyone willing to pay them money for a better listing will want to stop paying when no one visits Google any more.
it's been said before (Score:2)
Google.com is popular because of it's high moral ground, which it has had since the beginning.
I personally switched to Google because:
* it gave me more accurate results
* it has a fast loading page
* it had an honest results policy
* it's not a parasite site, running on the coat tails of others (eg. metacrawler)
The reasons I continue to use Google are:
* as above
* it has inoffensive (to me) advertising
* it has a toolbar that saves me time on searching
* it's as good as a spellchecker
* it can display pdf files in html
* it can search pdf files
* google cache
Isn't this just a subset of Google (Score:2, Insightful)
It looks off course for the words from your search but also at the words close to those (so if you look search string is 3 words and it finds them next to each other it gets a higher score than the words randomly found in the text). It also look at the links. Pages about the same topic that are linking to your page give a "vote" for your page. This looks a lot like the "new" search algorithm. Or is the new one the inverse? In stead of giving a vote to, it receives votes if it links to pages about the same topic.
The one thing I'm thinking is that they miss a lot of pages just because they do not contain links.
Anyways, there isn't a lot I haven't found on Google yet (thanks to all it's search engines: regular, open directory, images, news...)
Look out! (Score:2)
Joke
I wish! (Score:5, Interesting)
Good websites link to similar sites -- academic websites link to simialr sites and sources. This type of search engine would be killer on Internet 2. But on our wonderful, chaotic, porn and paid link filled Internet 1, it's useless. Spider MSN and you'll get a circular web leading to homestore, ms.com, Freedom To Onnovate, ZDNet and Slate. Spider Sun and never find a single page in common with their close competitors like IBM.
What happens when sites get associated with their ads? Search on Microsoft Windows and grab a lot of casino and porn links...because a "security" site covered in porn banners was spidered and came up with top relevancy.
Now, combined with a click-to-rate-usefulness engine like Google, this could be an interesting novelty. But it'll never be the simple hands off site hunter the big Goo has become.
Re:I wish! (Score:2, Interesting)
Well when you are not dealing with commercial sites, or even when you are some times, a lot of people.
Google (and most other search engines) link to tons of other search engines, Art Galleries link to other Art Galleries, and gaming sites link to other gaming sites.
A lot of other areas are inter-linking too. Some times when I am trying to find something I get caught in a loop and have to start over again from a different starting point.
For instance when finding out information on LED lights.
I exhausted most of the results that came in for my original search term, so am going to change it to {fibre,fiber} optics LED
Tada, whole new batch of sites to read though.
This Could Actually Help Enhance Accuracy (Score:5, Interesting)
Problem with this: "most" websites do not link to sites with similar content. Most websites link to "partner" sites that have nothing in common with them -- after all, who links to a competitor?
Good websites link to similar sites -- academic websites link to simialr sites and sources.
Combine the algorithm described in this article with google's approach (or some other contextual approach to deterimining relevance) and you not only have a way of identifying "communities," you have a way of easilly identifying "marketdroid mazes of worthless links" as well.
Since the content of most marketdroid sites is usually next to worthless, the hits for a given search could be ordered accordingly. Sites, and groups of sites, that clearly form communities related to the topic you're interested in at the top, single websites as yet to be linked to somehwere in the middle, and marketdroid "partner" sites at the very bottom.
This would actually produce better, more useful results than either approach alone.
Re:This Could Actually Help Enhance Accuracy (Score:3, Interesting)
Re:I wish! (Score:2)
In addition, when I search on Google, more often than not I am not looking for huge commercial sites. I am looking for smaller pages, sometimes written and hosted by individuals, that contain information on the subject I am searching for.
These types of pages completely fall under your argument. They are not big enough to warrant ideas like "competition" and "sponsors." It is just some Joe Public, writing a web page about something he is interested in and housing it in the 5 megs of web space his ISP gives him.
Re:I wish! (Score:2, Interesting)
i guess this [sun.com] page doesn't list a bunch of Sun competitors, like IBM, BEA, and CA, then. even competitors thrive off of partering with each other.
-rp
Re:I wish! (Score:2)
group A) Authorative sites...sites other sites link to but don't necassarily link to a lot of other sites
groub b) Link sites...sites that cobble together links to useful authorative sites based on subject matter.
In your algorithm you keep a track of a particular ranking in both groups A and B interatively.
Yer ranking in group A gets weighted by the quality of the sites in group B that point to you.
Yer ranking in group B gets weighted by how many quality sites you link to in group A.
Iterate the process...and you know what you have...you have an eigenvector problem....and what you get in the end is an eigenvector of highly ranked group B sites which span subsets of group A based on subject matter.
The cononocal example is the word jaguar.
Run this agorithm on a search engine and you will get atleast 3 very distinct collections....
the animal, the car, and the game system..primarily.
The problem is you have to ITERATE for it to be particularly useful...and that costs cpu time....I don't know if a search engine is gonna want to really invest that time.
Frankly I'm suprised any of this is patentable since I saw this at an academic talk like 6 years ago.
-jef
Re:I wish! (Score:2)
At Internet World 2000 there were close to thirty companies offering new search engines, everything from voice controlled searching to variations on a miningco theme.
They're all dead. Not necessarily because their ideas weren't good...the best of these were eaten by google and altavista and lycos and are still around. They're dead because they offered nothing new to the searching public -- no better results, no improved searching. Nothing but good ideas listing no pages with a buggy interface. The searching public has no tolerance for buggy code or crummy results. This tool WILL be a nest of crummy links until they figure out a clever way to omit them...and by that time, we'll have already given up
Useful as second try (Score:1)
If a search engine comes out of this, I think I'll first google for whatever I want, and if I can't find it/come out with too little info, I'd expand my search into this "communi-search"
Web Rings (Score:1)
I do believe it is a good idea but the person that thinks that all relevant pages link to more relevant pages has been taking more than harmless smoke breaks.
Some issues on linking. (Score:5, Informative)
I won't pretend to know all the inner workings of google's search engine technology. But I believe that google DOES care about other links from site A. This falls into the hub and authority model, which is definined recursively. A hub is a site that links to a lot of authority sites. An authority site is a site that is linked to by a lot of hubs. Basically, authorities provide the content, and hubs provide links to the content. In this example, B is an authority site, and A is a hub.
The way the ranking works, is that if B is linked to by a large number of quality hub sites, then it has a respectively large quality rating. Likewise, if a hub links to a large quantity of high quality authority sites, then its quality will also be ranked highly as a result.
This also allows Google to provide links to sites even if the search terms don't match the content of that site. A hub that links to a lot of sites about cars will relate cars to ALL the links regardless if the word "car" is included on the site that is provided.
Of course, I'm not THAT familiar with google. Its possible I'm full of bunk. But I'm pretty sure it works this way to some extent and that google does pay attention to the hub based links.
-Restil
More Info on Extracting Macroscopic Information (Score:2, Informative)
http://www.cindoc.csic.es/cybermetrics/articles
http://www.scit.wlv.ac.uk/~cm1993/papers/2001_E
.
Efficient Identification of Web Communities (Score:2, Informative)
I feel bad for Disney... (Score:5, Funny)
Not that I would know...
Re:I feel bad for Disney... (Score:3, Funny)
Bothered by the yahoo link in position 1? (Score:2)
DON'T feel bad for Disney!!! (Score:2)
you can feel either bad or good... (Score:2)
Clustering (Score:5, Informative)
In a recent interview in c't magazine [heise.de], a Google employee (Urs Hölzle) said, when asked about clustering, that they had tried that a long time ago, but they never got it to work successfully. He mentioned two problems:
- the algorithms they came up with delivered about 20 percent junk links for almost all topics
- it's hard to find the right categories and give them correct names, esp. for very generic queries
Of course, just because Google didn't get it to work properly doesn't mean nobody else can. But it's harder than it looks, and it's been known for quite a while.
Difficulty of Classification (Score:3, Interesting)
This sounds like a useful idea and give use better directory systems, but its utility would be limited. Im sure there will be people poking holes in this algorithm [corante.com]in no time. Slashdot [slashdot.org] has a odd mix of subjects loosely tied together. News for Nerds [slashdot.org] is not a very strict group. Classification and grouping is a hard problem. There is no clear black and white, there are many shades of gray.
Interesting, but not as intersting as Google Bombing [corante.com]
--
Guilty [microsoft.com]
Routers (Score:3, Interesting)
We will really know what is out there on the net when Cisco includes a search function in their routers. Distributed searching. Access to over 90% of the world's data. Anonymous usage statistics. Person X searched for data (a) and spent the most time at www.example.xyz. Cross refrence it all and include hooks into TCP/IP V.x for cataloging search, usage and content statistics.
A website might contain information about leftover wiffle-waffles. That website sends that same data 1000 times an hour to end users. I want the router to pipe up "1000 unique page veiws for leftover wiffle-waffles" somewhere else a router says "500 unique page veiws for leftover wiffle-waffles". So when I do my search, I get 2 hits, most popular and second most popular.
Why incorperate it into TCP/IP? what good is moving all that data if it is just a morass of chaos? Let that which transports it also serve to catalog it. Currently, user data's content is transparent to TCP/IP. But if I wanted my data to be found, I could enclose tags that would allow the Router to sniff my data, insuring my data was included in the next real time search.
Exploiting search engines that rank popularity (Score:5, Interesting)
The basic gist is that google flags pages as more important (or higher relevance) if they have more links pointing to them...so the CoS makes thousands of spam pages that points at its main pages. Google sees the thousands of links, assigns the main CoS pages a high relevance, and thus they're the first to come up in any scientology-related search.
The moral being, for any new cool search technique devised to help fetch more relevant content, there'll be someone out there looking for a way to defeat it.
Re:Exploiting search engines that rank popularity (Score:5, Informative)
Since this was first brought up a few days ago, the Scientology volunteer editor at the Open Directory Project, an upstream content provider for Google, was fired.
Re:Exploiting search engines that rank popularity (Score:2)
Yes it was [slashdot.org], coward. If you're going to whore with this method [slashdot.org], at least stay signed in to defend your tactics.
Re:Exploiting search engines that rank popularity (Score:2)
Hrmm. What to say?
1) I have no need to whore, being at 50 karma for, oh, I dunno, the past half year or so.
2) The AC post wasn't mine, which is more or less impossible to prove, but I thought I'd state it for the record.
3) I hadn't read the comment you referenced, but had been referred to it from a newsgroup I visit elsewhere.
Any other unbiased assumptions you made that I should address?
Many web-marketting businesses based on this (Score:3, Interesting)
There are a number of those "Get More Hits For Your Website Cheap!" sites which try to do so by getting member sites to download an html file which contains links to most of their members, and then have you link this from your own site.
Much like a pyramid scheme, as new members join the get the same file with links to your site, thereby increasing the number of sites with links to you and possibly raising your position in search engines.
Re:Exploiting search engines that rank popularity (Score:2)
Only a few thousands links more needed...
Re:Exploiting search engines that rank popularity (Score:2)
Not always true. (Score:2)
So, this only holds true with a focused sites. Using links, but then checking the links based on text would be useful, but not just links alone.
Degrees of separation? (Score:2, Interesting)
Hmmm, so it's a "Web"? (Score:3, Interesting)
So it's actually working on the basis of webs of related sites - not a novel concept, but useful.
I suspect that some of the commercial knowledge management tools have been doing something much like this for some time, and TheBrain.com [thebrain.com] has had a product to manually build this kind of network of clusters for some time. The key thing about this is that with web indexing/cataloging the information needed to do the automatic linking is available.
TheBrain.com seems to have a working demo of using it for the Web at WebBrain.com [webbrain.com] based on the Open Directory Project [dmoz.org]. It's not a great example because of display limitations that don't really let you see more than one cluster of information at the time, but it's one example of the general concept. Once you dig down in an area you can see how it shows links between related categories as well.
Note: the demo above says it requires Java 1.1 and IE 4.01 or Netscape 4.07+, to bypass that test try here [webbrain.com]. Seems to work fine in Netscape 6.2, and will probably be OK in Mozilla if the JRE is available.)
why not a 3d search engine? (Score:5, Interesting)
I.E. search for linux apache router
linux is one axis, apache is another and router is a 3rd. if the pages are relevent in that context then the closer to zero they will show up while linux apache donuts will resolve close to zero on the XY but be way out on the Z axis..
Um, because your monitor isn't 3D? (Score:2)
Re:why not a 3d search engine? (Score:2)
One of the nice things about Google and other current search engines is that you can easily look at the context in which the search term occurs and determine if the link is relevant. I think this would be harder to do in 3D. It would be nice if you were able to weight your search terms (scale of 1-10?) on Google. That might accomplish the same goal as what you want without the 3d niftyness.
Re:why not a 3d search engine? (Score:2)
What I'd really like to have is the ability to specify that terms need to be near one another within the document. Sometimes using quotes to delimit a phrase works, but often the words can appear in various orders and various ways, but if they're all close together there's a very high chance that the text is discussing the stuff I'm interested in.
Re:why not a 3d search engine? (Score:2)
Summary of book on Web Usability [webreference.com]
Why 3-D navigation is bad (People aren't frogs) [useit.com]
Not to mention how to you display such things to a reader who is blind or has any other type of disability... the list goes on. Beautiful idea, but not very good in practice.
Re:why not a 3d search engine? (Score:2)
color to make a 4th,5th and 6th dimensions in it. R,G,and B components can also be added... giving you a 6d search engine... not only are the ones closest to center what you are after, but the ones that are white. (invert it for the politically correct people)
This is not new work (Score:5, Interesting)
Intuitively this seems reasonable and in practice this is often the case when there is no conflict of interest for a document to link to another document (as in the case of researchers linking to other works in their field). Yet, often this is not the case when there *are* conflicts of interest (a pro-life site will probably not link to a pro-choice site;BMW will probably not link to Honda or any of it's other competitors). Therefore, since the truth of the hypothesis that "similar documents link to each other" is not clear, I worked to test this very idea.
To do this I used The Fish Search, Shark Search, and other more advanced "targeted crawling algorithms" that take connectivity of documents into account (as is discussed in the Nature article), but these algorithms often go further than just using the link relationship by taking the contextual text of the link itself as well as the text surrounding the link into account too when choosing which links are the "best ones" that should be followed in order to discover a community of documents that are related in a reasonable amount of time (you'd have to crawl through a lot of documents if documents have as few as, say, 6 links per page on average! Choosing good links to follow is crucial for timely discovery of communities). The conlusion of my thesis was that it is (unfortunately) still not clear whether the hypothesis holds. I only did this work on a small subset of web documents (about 1/4 million pages) so perhaps a better conclusion would be reached by using a larger set of documents (adding more documents can potentially add more links between documents in a collection). What I did discover however, was that if document communities do exist, you have a statistically good chance of discovering a large subset of the documents in the community by starting from any document within the community and crawling to a depth of no more than 6 links away from the starting point. (This turns out to be useful to know so that your crawler knows a bound on the depth it has to crawl from any starting point). Moreover, if you have a mechanism for obtaining backlinks (ie. the documents that link to the current document) you can do discover even more of the community...
No, this is not the shiny new thing... (Score:2, Informative)
The idea predates Google, it probably predates you. They did it in print, way back when.
A goal. (Score:2)
I would imagine that using modern search engine techniques, one would be able to determine what commercial pages "generally" look like, and what informational pages "generally" look like, and categorize appropriately. If you used a learning neural network, you could even accept user ratings on specific search results and use that to fine-tune the algorithm.
Damn! (Score:2)
Just what we need (Score:1)
I guess when we get bored though we can search for search engines and watch the system grind to a halt.
Privacy Violation (Score:1)
and the opposite twist.. (Score:2)
An interesting technique I worked on many years ago (as a hobby, online databases seem to be a habit of mine) was to take a large collection of pages and effectively 're-link' them based on their content, adding a section at the top of the page giving links to perhaps the top 20 other pages that 'seemed' to be similar by sharing a large number of the more unusual words in the page.
This works suprisingly well after some tuning, and can actually be generated reasonably efficiently, however I never tried it on more than about 100k pages on not totally dissimilar material.
This technique, which specifically ignored the links in the page can often help when you find a page not quite on the subject you are looking for, it is very interesting to see what 'key-words' can link pages, often you would never think of using them as search words, but they are very obvious after the fact.
When using links to cluster pages, you also need some form on supporting analysis of the content to atleast try and tell if the pages are at all related, otherwise the clusters grow too large to be useful.
"next generation over Google" my foot (Score:4, Insightful)
It reminds me of all the graphics chip makers, computer chip makers, heck, even zeosync with their incredible breakthroughs. 90% of the time, when anyone takes a hard look at it it turns out to be a waste of time and money.
So, before proclaiming this the "next generation over Google" why not check to make sure google hasn't already thought of it and discarded the idea. Or that it won't lead to stupid circular clusters, 90% of the time I'm not interested in partner sites, but competitior sites. Is slashdot in the Microsoft cluster?
And above all, stop the judgement calls like "this is the next generation" unless you've got some special insight and qualification to make that call.
This is not a new idea (Score:3, Informative)
Clever does Google one better by separating the results of searches into "hubs" and content. Hubs are sites with lots of links on a particular subject. Content sites are the highly rated sites linked to by the hubs.
I thought it was a very intersting concept and I am surprised that it was not comercialized. Of course, IBM is in the business of buying banner ads rather than selling them. They could always do like /. and OSDN and mostly run ads for their own stuff though....
Re:This is not a new idea (Score:3, Informative)
I know because I have read about both technologies. I discussed the merits of Clever v. Google a few years ago with classmates that were taking the class at Stanford that spawned Google. That is how I know.
End of Rant
There is an excellent article [sciam.com] on Clever that appeared in Scientific American a few years ago. It was linked to from the page I origianlly posted. You should check it out. Clever returns results divided into the catergories of "hubs" and "authorities". I have never noticed Google doing that/
Here is an excellent summary from the article on the differences between Clever and Google:
Google and Clever have two main differences. First, the former assigns initial rankings and retains them independently of any queries, whereas the latter assembles a different root set for each search term and then prioritizes those pages in the context of that particular query. Consequently, Google's approach enables faster response. Second, Google's basic philosophy is to look only in the forward direction, from link to link. In contrast, Clever also looks backward from an authoritative page to see what locations are pointing there. In this sense, Clever takes advantage of the sociological phenomenon that humans are innately motivated to create hublike content expressing their expertise on specific topics.
Of course Google has tweaked their method since this article was written, however it has not become Clever.
Re:This is not a new idea (Score:2)
I simply assumed that Google used a similar algorithm, based on their description of it. Thank you for the link, it was informative.
Unfavorable to E-commerce (Score:2, Interesting)
Amazon.com is an example of this: I bought a pair of speakers from them a several months ago, and yet every time I go there, they helpfully inform me that they have these great new speakers on sale! Buy now! I suppose it works to recommend similar books and CDs, but when someone buys speakers, they usually stop being in the market for speakers after that.
Anyway, I don't know why no-one has thought of making an e-commerce-only search engine. I think there's a clear distinction between those two types of searching that warrants a separate engine. Sometimes you want to buy stuff, and sometimes you just want information. When you are doing one, the other just gets in the way, and disguising advertising as content like AOL/MSN/AltaVista do just discourages you from using their services. Obviously, web-based businesses have a long way to go before they actually realize, "Oh, Internet users don't like to be tricked! Maybe if we were straight-forward with consumers they'd be more trusting of us."
What about for none western cultures? (Score:3, Interesting)
The culture that exists there-in is defiantly quite different.
Japanese sites are even MORE self referencing then American sites. This trend has taken off onto American sites though in the form of Cliques, which themselves tend to lie outside of the sites that many of us
Seriously though, in Japan it seems that sites actually have others ask permission to link to them! (As an aside, whenever that topic is brought up here on
This obviously creates a VERY different social structure that heavily alters the dissemination of information, not to mention the way that sites are linked together.
Here in the states (or any other culture that has pretty much a free linking policy) it is common to say "oh yah, and for more info go to this site over here and also this site here has some good information and and and . . . . "
Anybody who reads www.dansdata.com knows how he (uh, Dan obviously.
(such as a LED light review having links to the Online LED museum)
In a culture where linking is no so free, I would think that there would be more of a trend towards keeping a lot of the information in-house so to speak, and thus at the very least the bias's that the search engine uses to judge relevance of links would have to be altered a bit.
Links would have to be given a higher individual weight, since their would be a larger chance of them being on topic.
Credit where credit is due (Score:3, Interesting)
Disclaimer: I happen to know him, but this is not biased.
Question: (Score:2)
Question: How many dimensions for the web?? (Score:2)
Later, in college, I tried to model a website as a physical system with the links acting like springs. You basically make up some formula causing different web pages to repulse each other and then make links give an attractive "force" that grows with the "distance" between web sites.
This gives you a system of equations that you can solve for an equillibrium point giving an "information distance" between web sites. This will tend to group websites on similar subjects together because they tend to link to each other.... but then again, who cares how close related are to each other, it should still be possible to get a cool picture with this data.
But the stopping point I came up against was how to represent these information distances as a space. I couldn't figure out how to calculate the dimensionality of the space. Was it 2-D, 3-D, or 400-D?
Here's an example of why this is a problem: Take four points that are all 1 meter from each other. In 3-D these points form the corner of a tetrahedron, but you cannot draw these points in 2 dimensions. If I have N equidistant points, I need a space with at least N-1 dimensions.
So how many web pages can be at the same "information distance" from each other? How many dimensions are needed to map the web this way?
Maybe this question only interests me, but I find it fascinating.
I am currently... (Score:2)
At the point where I am reading, Johnson is discussing how emergence is closely tied to feedback, and without feedback, emergence doesn't occur. Thus, cities and businesses tend to be emergent (is that a word?) entities, while the web typically isn't. Because links on the web tend to be "one way", and information isn't communicated back, he argues that emergence can't take place.
Someone else has made a post here discussing how on Japanese web sites, it is expected that before you link to a site, you ask the operator of the site permission. The poster then says that for American sites, it is more of a "sprinkle willy-nilly" (my words) type reference, without regard for the operators of the sites. However, at one time, netiquette was indeed to ask the operators to "swap links" - I remember doing this quite often. But I think what happened is that when businesses and the "ignorant masses" came online, less link-swapping occurred because many times you would email the admin of the machine, and never get a response. The feedback link was broken.
Johnson uses this argument to further his statement that because of this, the Web won't be emergent. But will the Japanese web spawn emergence?
Johnson then goes on to talk about weblogs (though he doesn't use that term), referencing
These kind of search engine technologies might help make the web turn around, and allow it to become emergent. I don't know if such thing would bode well for humanity, but it would be very interesting to see such a thing in practice (I highly reccommend the book I referenced above if you are into this kind of thing - it makes an excellent sequel of sorts to the book "Out of Control")...
Re:Isn't that (Score:1)
The only thing that this "revolutinary" engine does is show the references.
Note that this may not even be usefull. I can link to a good site, but that doesn't make my site atomatically good.