Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming Science IT Technology

"Evolved" Caches Could Speed the Net 195

SpaceDilbert writes "According to New Scientist, evolutionary algorithms could make many network caches twice as efficient. This article describes a study carried out by a US researcher and two German academics, who "evolved" algorithms to determine what data should be held at a cache and for how long."
This discussion has been archived. No new comments can be posted.

"Evolved" Caches Could Speed the Net

Comments Filter:
  • Caching (Score:5, Funny)

    by Zorilla ( 791636 ) on Tuesday July 06, 2004 @10:49AM (#9621489)
    According to New Scientist, evolutionary algorithms could make many network caches twice as efficient.

    That's easy, just cache 4 boobs at a time instead of two
    • Re:Caching (Score:2, Funny)

      by Anonymous Coward
      Wouldn't it more efficient to cache 1 boob and display it x times? Sort of a boob pooling.
  • Algorithms (Score:5, Interesting)

    by th1ckasabr1ck ( 752151 ) on Tuesday July 06, 2004 @10:51AM (#9621523)
    Pablo Funes of US company Icosystem and Jürgen Branke and Frederik Theil of the University of Karlsruhe in Germany used "genetic algorithms", which mimic Darwinian evolution, to develop strategies for internet servers to use when caching data.

    It would be interesting to see exactly which algorithms they are talking about here. I wouldn't be surprised if they drew some ideas from garbage collection algorithms also.

    • by spektr ( 466069 ) on Tuesday July 06, 2004 @11:27AM (#9621943)
      I wouldn't be surprised if they drew some ideas from garbage collection algorithms also.

      I suppose you're the clever garbage man from the Dilbert cartoons. Because as an engineer I don't really get the connection between garbage collection and caching algorithms. Now that we have covered the first two frames of the cartoon - what's in the third frame where the garbage man makes me feel ashamed by explaining some complicated concept of engineering?
      • Re:Algorithms (Score:5, Insightful)

        by ahem ( 174666 ) on Tuesday July 06, 2004 @11:44AM (#9622142) Homepage Journal
        ...as an engineer I don't really get the connection between garbage collection and caching algorithms. ...

        It seems to make sense to me, if only because the two are complimentary problems. Caching is figuring out what stuff is valuable so you keep it around. GC is figuring out what stuff is valuable so you can throw away the rest. Kind of like in probability where it's easier to figure out the likelihood of something not happening and then subtracting from 1 to figure out how likely it is to happen.

        • Re:Algorithms (Score:5, Insightful)

          by ezzzD55J ( 697465 ) <slashdot5@scum.org> on Tuesday July 06, 2004 @12:15PM (#9622518) Homepage
          Humbug, the big difference is: in GC there's perfect knowledge about what you want to throw away (unreferenced or only circularly referenced objects), and the problem is how to find out efficiently.. In caching the whole problem is to figure out which objects to keep, and beyond that, efficiency is no problem.
        • not even close (Score:5, Insightful)

          by Anonymous Coward on Tuesday July 06, 2004 @12:25PM (#9622620)
          Garbage collection is required to be correct, but is allowed to keep extra stuff around as memory permits. Everything that must be kept can be calculated deterministically without future knowledge, and the same holds true for everything that can be discarded. Approximation merely allows the what can be discarded calculation to progress faster. In garbage collection it is not correct to throw out something that will not be used in the future unless it is also unreachable.

          For cacheing there is no requirement of correctness. Performance is improved when things that will be used in the future are kept. There is no correctness requirement for keeping things around, and indeed best performance often requires discarding things that will be used in the future. In addition, there is no way of determining what will be used in the future without knowledge of the future. The correctness requirements of a cache are related to security (don't cache sensitive information) and staleness (don't cache stuff that is too old), but even those may be relaxed rather than strict requirements.

          Fundamentally the two problems are very different, and algorithms that deal are successful in the two different cases will likely be very different.
          • In garbage collection it is not correct to throw out something that will not be used in the future unless it is also unreachable.

            It is correct to do so, but in practice, you cannot be certain that the data will not be used, unless it's unreachable.

        • "GC is figuring out what stuff is valuable so you can throw away the rest."

          I thought Garbage Collection was shuffling stuff around with the goal of consolidating all the stuff you don't want into large blocks (collecting the garbage) so the space can be reallocated efficiently.

          The smart garbage man is on vacation.

        • You seem to have the terms confused. GC refers to algorithims where you know for a fact that some data is not needed ever again. Caching refers to algorithims where you *might* need the data again, but you can also recreate it if it is thrown away. Many programs will do this with their own memory and data structures, and certainly the algorithims used there are relevant to this, but it is still called "caching", not GC.
        • Re:Algorithms (Score:2, Informative)

          by aiyo ( 653781 )
          GC and caching are done very differently. In java memory has a counter associated with it. When you have two pointers to that memory then it has the value 2 stored somewhere with the memory. When you change the reference the counter is decremented. Every once in a while the GC will come along and free memory with a counter of 0. If you dereference that memory so that you have absolutely no pointers anywhere using the memory then you will have lost that memory although it is not freed until later.
    • It would be interesting to see exactly which algorithms they are talking about here. I wouldn't be surprised if they drew some ideas from garbage collection algorithms also.

      I'm not sure how relevant that would be, as garbage collection in general is way more radical than what a cache needs -- a cache operates best when full.

      The best algorithm for improving a cache is, of course, to increase the size of the cache. :-)

      Regards,
      --
      *Art

    • Re:Algorithms (Score:4, Insightful)

      by laudney ( 749337 ) <br260@@@cam...ac...uk> on Tuesday July 06, 2004 @12:58PM (#9622976) Homepage
      Garbage collection problems are far easier than caching problems in networks. There are many reasons. The most important one is that in garbage collection, 'references' are basically 'pointers'. An object that's not pointed to by any other objects is considered garbage. And don't forget, in garbage collection, you can even 'stop the world'!

      By contrast, in caching, 'references' are 'client requests', which are unpredictable and highly variable. Furthermore, on busy networks, there is seldom a moment for you to 'stop the world' and 'evolve' for a while!

      In the story, people use network simulators and randomly generated requests. Since network simulators are notoriously simplistic and inaccurate, and caching is heavily influenced by real-life workloads, I'm interested to know whether their algorithm is applicable in reality.
  • by YankeeInExile ( 577704 ) * on Tuesday July 06, 2004 @10:52AM (#9621526) Homepage Journal

    Well, I have found one flaw in the methodology:

    The starting population of algorithms was tested on the simulator using randomly generated requests.
    I would think this would breed out of the caching system any affinity to locality-of-reference.

    One of the things I did each morning when I was running a cybercafé was "prime" the Squid cache by running a little script that did a wget -p on all of the URLs in the portal page, and a few sites that were not. And it definitely did improve performance for most users.

    One of my unrealized dreams would be some sort of speculative-fetch algorithm for Squid that would basically do a breadth-first fetch on a page while the user was busy reading it.

    Of course in my not-so-humble opinion, the biggest problem with any caching system is the population of websites that, through either malice or incompetence develop content that is cache-hostile, and call it "experience enhanced".

    • by Short Circuit ( 52384 ) <mikemol@gmail.com> on Tuesday July 06, 2004 @10:55AM (#9621575) Homepage Journal
      What you get out of the breeder shouldn't be what you put into production.

      Assuming you can still read the code, you're still going to want to put in common-sense improvements that the GA's didn't discover.
      • So, do you posit that the purpose of the GA is just to give human programmers insight into avenues they might not have otherwise considered?

      • What you get out of the breeder shouldn't be what you put into production. Assuming you can still read the code, you're still going to want to put in common-sense improvements that the GA's didn't discover.

        Could this be the long awaited answer to harmonizing Creation and Evolution :-)

    • "Priming the cache" and "doing a breadth-first fetch on a page" are both things that create *more* network traffic on the off-chance that it might save some number of microseconds for the user. It's basically a Tragedy of the Commons situation. Everyone would be better off if no one pre-fetched links, but any given person is better off if they don't cooperate with that global model. So everyone just grabs what they can get and everyone is worse off.
      • by droleary ( 47999 ) on Tuesday July 06, 2004 @12:08PM (#9622420) Homepage

        "Priming the cache" and "doing a breadth-first fetch on a page" are both things that create *more* network traffic on the off-chance that it might save some number of microseconds for the user.

        More traffic isn't necessarily bad. While what you say is true, you fail to note that user-initiated traffic is done in bursts. Just like your CPU is idle >95% of the time, so is your network connection. So all users benefit, both in real and perceived performance, when there is a steady 100% utilization.

        So everyone just grabs what they can get and everyone is worse off.

        Again true, but naive. What would be better is if there were a mechanism to prioritize the pre-fetch cache. Every page has one link that is pretty much the most likely next page. Then a secondary link, and so on. Ideally, a site owner should be able to put that priority list somewhere in the page such that a user agent can start getting it after the current page has loaded and is being read. Otherwise, maybe the user agent can favor new content (i.e., compare this load of Slashdot with the one done five minutes ago and grab the links in the diff). That's a far cry from a mad grab, and would probably benefit all parties involved.

        • There's the link tag for html headers, which specifies the next page, previous page, and things like that. It's not prioritized, but if it's used, it can give a good idea of what the user will do.
          • There's the link tag for html headers, which specifies the next page, previous page, and things like that.

            Ah, I knew it was too good an idea to be all my own! For those not aware of how to use the link element for that, I found this page [w3.org] as well as this one [w3.org]. I don't know how widely it is used by site designers, and I can't say I know of any web browsers that use it, but I would say it is definitely something a proxy server should be taking advantage of. It's use is even stated directly in the second

  • "An important consideration is what incentives there are for caching information for other users."

    So the ISP's will have to upgrade their soft/hardware to make this work?
    Will that be worth it ?

    In my opinion the article was a bit light on details :(

    • Wouldn't this lessen the bandwidth requirements for an ISP? By caching files locally, your local ISP could host more users with less bandwidth.

      Sounds like it could save an ISP a ~lot~ of money.

  • Algorithms (Score:3, Funny)

    by bchernicoff ( 788760 ) on Tuesday July 06, 2004 @10:52AM (#9621530)
    I wonder if they evolved logic to counter the slashdot effect. 1. Scan slashdot.org for new stories every five minute. 2. Scan new story for links. 3. Cash those pages.
    • by bunyip ( 17018 ) on Tuesday July 06, 2004 @10:56AM (#9621580)
      I wonder if they evolved logic to counter the slashdot effect. 1. Scan slashdot.org for new stories every five minute. 2. Scan new story for links. 3. Cache those pages.

      That would be the "suicide algorithm". As the server goes up in flames from the Slashdot-effect it brought upon itself, it would become the first cyber recipient of a Darwin Award.

      Alan.
    • 1. Scan slashdot.org for new stories every five minute.
      2. Scan new story for links.
      3. Cash
      [sic] those pages.

      4. PROFIT!
  • by garcia ( 6573 ) * on Tuesday July 06, 2004 @10:53AM (#9621539)
    From what I can tell a good majority of the traffic that my machine receives is worm traffic. Would these genetic routines be setup to disregard those as cache data? If that's the case would they be setup to just block that data?

    That alone would save me quite a bit of traffic as people on my subnet hit me constantly with their infected machines.

    66.41.161.120 hit my machine 57 different times (that isn't individual requests, that's total times).
  • LRU Rules (Score:5, Informative)

    by Mirk ( 184717 ) <slashdotNO@SPAMmiketaylor.org.uk> on Tuesday July 06, 2004 @10:53AM (#9621540) Homepage
    There's a good reason why LRU caching (least recently used) is so widespread, and that is that it's very very hard to come up with a sophisticated algorithm that outperforms this very naive one.

    For the uninitiated, elements are added to an LRU cache until it fills up; thereafter, whenever a new element is added, space is made for it by throwing away the least-recently used one. Note, least recently used, not the least recently added, i.e. the oldest, since an element that was cached long ago may be used all the time, and so be well worth its place in the cache. For example, consider the company-logo image that your browser caches when you visit a new site and that is embedded in every page on that site. However old it gets, it's likely to continue to be used while you're on the site. As soon as you move to another site, it gradually shuffles its way down the deck until it falls off the bottom - which is precisely what you want.

    • Re:LRU Rules (Score:4, Informative)

      by rossjudson ( 97786 ) on Tuesday July 06, 2004 @11:12AM (#9621772) Homepage
      ARC caches fairly handily outperform LRU. When LRU is appropriate they adapt towards it. When frequency is important they adapt towards that.
    • Re:LRU Rules (Score:3, Interesting)

      by arth1 ( 260657 )

      There's a good reason why LRU caching (least recently used) is so widespread, and that is that it's very very hard to come up with a sophisticated algorithm that outperforms this very naive one.

      For small cache sizes with one or a few fairly similar users, this is true. For larger cache sizes with a large number of users, it's almost always better to expire based on both LRU (Least Recently Used) and LFR (Least Frequently Requested). You may want to cache all those sucky Christmas links until next year

  • Not DNS (Score:4, Interesting)

    by artlu ( 265391 ) <artlu@3.14artlu.net minus pi> on Tuesday July 06, 2004 @10:54AM (#9621561) Homepage Journal
    It already takes forever for DNS changes to propagate through every network, which can be extremely frustrating when you have a high bandwidth domain. There definitely needs some optimization on the DNS front.

    GroupShares Inc. [groupshares.com]
    • Re:Not DNS (Score:3, Informative)

      If I'm not mistaken, the DNS RFC takes into account the fact that domain records will be cached. That's why the records have expiration and refresh times on them.
      • except when your zones get around to morons ISP that force your zones to have a higher TTL than you specified, which unfortunately here is being done by the big player
    • It already takes forever for DNS changes to propagate through every network, which can be extremely frustrating when you have a high bandwidth domain. There definitely needs some optimization on the DNS front.

      If your zones are taking too long to propogate, maybe you need to Read The Fine Manual [isc.org] ... especially the parts around Refresh and Time to Live

  • Great (Score:2, Funny)

    by CptChipJew ( 301983 ) *
    So now Slashdot is going to get cached for a long time and I'll never get first post again :(
  • bittorrent tie in? (Score:4, Interesting)

    by Chuck Bucket ( 142633 ) on Tuesday July 06, 2004 @10:56AM (#9621578) Homepage Journal
    I"ve always wondered if something along the lines of cache complimented with a Bittorrent type of scheme couldn't help speed up the internet. that way bits would be mirrored all over, and a server could pull them in faster since more servers did less work each.

    just something I'm thinking about today, well, that and the Kerry/Edwards pairing.

    PCBRwer342$#
    • I've been wondering this as well. I think it would be cool for all of the internet/browsers to like a big p2p setup. Where the originial webserver is the canonical source for info, and it keeps history of its visitors with ways to figure out the best "route" or network neighbor.

      However, in practice, I'm not sure how useful somehting like this would be. Most websites today have so much dynamic info, that it simply would not work. For example, /. would only be able to cache its graphics.

      This would be c
      • Yes, so there's be a dameon on servers that could be a beacon to others, and only have it catalog certain directories that contained 'static' content that didn't change very often. then the main server would only be required to generate the dynamic content, whilest the static would be provided by others.

        I'm thinking of almost a distcc way of handling this, where the server can dole out tasks to any/many other servers that it finds that are 'willing' to help out.

        PCB%$s
    • I understand the parent suggestion as a dynamic list of current mirrors at every site (or page).

      IMO peer-to-peer caching/mirroring would work for big chunks of data where the overhead of finding a cache (guesstimate a few to a few tens of round-trips among peers and the original site) would be insignificant compared to the actual download time. For short pages it would be better just to return the page as opposed to the current list of mirrors.

      On the other hand if the page contained (in the HTTP metadat

  • Next Gen Networking? (Score:5, Interesting)

    by arieswind ( 789699 ) * on Tuesday July 06, 2004 @10:56AM (#9621584) Homepage
    From the article:
    He[Pablo Funes] suggests networks might in the future be designed to work out who deserves the most help for themselves. "Sophisticated network behaviours might implement rules for reciprocity and trust," he says. "And conversely, for not cooperating with other others who try to abuse our resources."

    The future of network security? Imagine the next computer virus outbreak: Every network in the world could recognize the virus type activity and allocate them lesser or zero resources, maybe sending them a "Virus detected, please run antivirus software or contact your IT Department" notice, and detecting outside attacks from viruses and automatically flagging them as unsafe, and not give much(or any) attention to traffic from or to that site
    • Every network in the world could recognize the virus type activity and allocate them lesser or zero resources, maybe sending them a "Virus detected, please run antivirus software or contact your IT Department" notice

      Could also work if it communicated "$IP being DDOSed, please trace back and isolate those who are doing this"?

      Basically, I think we really should be thinking about how to build an immune system for a network and what minimal behaviour from the nodes would achieve it.

  • Stuff like this is what keeps computing interesting, I think. This technique can be used by almost every business in a situation in which optimization might be necessary. I assume that covers most tasks someone would want to accomplish. By utilizing these algorithms to explore search spaces previously thought to be too large I predict we'll experience some pretty explosive advances in the near future in areas from farming to software development. But that's kind of obvious.

  • by Dezer ( 545178 ) on Tuesday July 06, 2004 @11:02AM (#9621660)
    Actually, I do genetic algorithm / genetic programming research at the university of michigan. It's unlikely that these guys are using genetic algorithms to develop a new algorithm, but are rather using an existing algorithm and *tuning* the associated parameters using a GA. Given a list of parameters, GA's work by finding the best combination of parameters. As a result, the settings could be constantly tweaked (say on a daily/hourly basis) and different servers could still have different regional settings. My only problem with the concept is that it still depends on the tuning of pre-existing algorithms... but still - the results they share (2x improvement) is encouraging.
    • You seem to have missed out on Gecco in Seattle this year, where the guys in the article presented their paper. They nearly got the first prize in the GP track, but were beating by some researchers doing quantum programming using GP. What it boils down is that they evolved the priority function that was used in caching, i.e., the thing that makes the caching tick. So it was not a parameter optimization method they used, but a true induction of a priority function.
  • by hugesmile ( 587771 ) on Tuesday July 06, 2004 @11:05AM (#9621683)
    evolutionary algorithms could make many network caches twice as efficient

    Once again, Al Gore has his hand in the shaping of the internet.

    I'm sure everyone in the Slashdot community will miss him - even if you didn't enjoy his work, there's no denying his contributions to popular culture. Truly an American icon.

  • by Animats ( 122034 ) on Tuesday July 06, 2004 @11:08AM (#9621718) Homepage
    Genetic algorithms are methods for optimization in bumpy spaces. The basic goal is "find x such that f(x) is maximized". As optimization algorithms, they should always be tested against the two simple optimization algorithms - basic hill climbing, and random search.

    If the search space is not dominated by local maxima, basic hill climbing (go for the best neighboring value) will work. And it will be fast. If the function is differentiable, it can be orders of magnitude faster than other methods, because you can use a variant of Newton's Method.

    If the search space is small, random search (just guessing) will work by exhaustively searching the space. This is obvious, but tends to be ignored in academic papers all too often.

    This discussion also applies to neural nets and simulated annealing.

    Now this article at least describes a problem for which a GA might actually be useful. Many such articles don't. But they haven't demonstrated that you need a bumpy hill-climbing algorithm.

    This is why, despite all the hype, GAs, neural nets, and such aren't used all that much. The search space has to have the right properties. Not too small, not too big, bumpy, but not too bumpy.

    • Misses the point. GAs do not look at search spaces and ask 'is this function too bumpy'. GAs (as well as many other evolutionary algorithms, see GECCO [uiuc.edu]) are more similar to stochastic greedy strategies. I'd suggest looking at Walsh coefficients instead of derivatives as a rule of thumb.

      A large part of the reason GAs aren't catching on more is that there aren't enough really good text books on them. All of the really useful information is still in the form of research papers, these things take time. Also,
  • by the frizz ( 242326 ) on Tuesday July 06, 2004 @11:16AM (#9621807)
    The article link is light on details of the evolution algorithm, but Pablo Funes's home page [brandeis.edu] has the text of his thesis on Evolution of Complexity in Real-World Domains [brandeis.edu]. It talks about his use of evolving algorithms on topics like designing the strongest lego brick structure and playing Tron. Very cool, but not its application to caching.

    The is even a link to an online Tron game [brandeis.edu] where us humans can play versus his evolving algorithms. The win/loss stats [brandeis.edu] for his algorithm is approaching even. Given that humans can also evolve in the Tron game play, I imagine that the algorithm will have a head start over the new influx of slashdot visitors and start to win more often than not over the next week. I never got to play though. The SQL db to mange the stats was already down then I tried.

  • <sarcasm>
    What happens if someone caches Metallica, or the new super hot J-Lo movie? Then how is the guy who brings the director a warm towel ever going to make his money?

    You guys should really think before you go out and start making technology that blatently abuses copyrights

    </sarcasm>

  • ... has been evolved
    with Critticall tool. [wikipedia.org]
    Original site. [critticall.com]
  • by TMacPhail ( 519256 ) on Tuesday July 06, 2004 @12:28PM (#9622650)
    50% of all bits sent over the internet are 0s. Just cache that and we have a 50% cache hit rate. :)
  • The problem with caches is whether you can trust them. Even assuming you can keep the caches properly up-to-date, how do you prevent cache poisoning from taking place?
  • by brunokummel ( 664267 ) on Tuesday July 06, 2004 @12:57PM (#9622958) Journal
    Caching information can also be used as a backup of data in case the server crashes or has its data compromised by a virus , hacker, whatever you feel like ..
    so what happens if you, for instance, have a security hole in one of your "smart" servers?? or even a breach in the protocol structure (DoS)?

    you could get the server to "breed" algorithms that would stop all servers by either corrupting the data in all servers that are providing the service or just DoS-ing them.

    I guess i'm not yet convinced that this solution is of any good for real world network which the whole structure is based on insecure protocols.
    maybe after IPV6, IPV9, who knows?
  • Freenet (Score:3, Interesting)

    by zoeblade ( 600058 ) on Tuesday July 06, 2004 @03:14PM (#9624530) Homepage

    Doesn't Freenet [freenetproject.org] already do this within its own network?

  • I'm afraid this doesn't seem like a particularly appropriate use of genetic algorithims. In fact the entire piece seems to suggest a poor research project (someone who did genetic algorithms needed a thesis) which got picked up on a slow news day.

    First of all I would point out that genetic algorithms are most appropriate for pure optimization problems with minimal mandatory constraints. While loading speed is important for internet caching it must be optimized inside certain constraints. A good cache s

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...