Text Compressor 1% Away From AI Threshold 442

Posted by kdawson on Tuesday July 10, 2007 @02:10AM from the second-hutter-prize dept.

Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."

This discussion has been archived. No new comments can be posted.

Text Compressor 1% Away From AI Threshold

Load All Comments

Search 442 Comments Log In/Create an Account

Comments Filter:

I wonder ... (Score:2, Funny)

by iknowcss ( 937215 ) writes:

How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create.

The horror.
- Re: (Score:3, Funny)
  
  by Anonymous Coward writes:
  
  "How many bad car analogies, inaccurate law advice, and duplicate stories an AI bot could possibly hold in his head. Imagine what kind of person all of the "knowledge" of Slashdot would create."
  
  "The horror."
  
  I've been typing everything I ever knew into Slashdot since the day it started, you insensitive clod!
  -- Cmdr Taco
new compression standard (Score:4, Informative)

by tronicum ( 617382 ) * writes: on Tuesday July 10, 2007 @02:17AM (#19810091)

so...wikipedia dumps [wikipedia.org] will now be using paq8hp12 instead of l33t 7zip ?

Share
twitter facebook
- Re:new compression standard (Score:5, Funny)
  
  by aicrules ( 819392 ) writes: on Tuesday July 10, 2007 @08:39AM (#19811805)
  
  Dang! You must have enemies if you are the very first post and you get modded redundant. Time to work on some positive karma buddy...
  
  Parent Share
  twitter facebook
interesting program name (Score:5, Funny)

by digitalderbs ( 718388 ) writes: on Tuesday July 10, 2007 @02:18AM (#19810097)

paq8hp12. when decompressed, it also serves as the source code for the program.

Share
twitter facebook
- Re:interesting program name (Score:5, Informative)
  
  by OverlordQ ( 264228 ) writes: on Tuesday July 10, 2007 @02:26AM (#19810139) Journal
  
  Since I know people are going to be asking about the name, might I suggest the wiki article [wikipedia.org] about PAQ compression for the reasons behind the weird naming scheme.
  
  Parent Share
  twitter facebook
  - - - That is the problem (Score:3, Insightful)
        
        by hummassa ( 157160 ) writes:
        
        Anything that is science is math.
        Ok, computer programming is not necessarily a lot of maths.
        But this article is about something that is really computer science... as opposed to making a CRUD screen in VB.net, which is akin to programming a VCR.
        Parsing, compiling, linear programming, sorting, searching, indexing, compressing, walking graphs, drawing graphics, designing circuits, optimizing circuits, these are activities that are computer science and that are all maths.
        
        Edsger Dijkstra once said: "Computers ar
        
        Re: (Score:3, Insightful)
        
        by rcw-home ( 122017 ) writes:
        
        If my hypothesis is that the next time I close my eyes I'll smell tulips, there's no math involved in evaluating this.
        "When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science."
        --Lord Kelvin
        
        Re: (Score:3, Insightful)
        
        by HiThere ( 15173 ) writes:
        
        You need to read Michael Faraday, or some of his predecessors.
        
        Math is a relatively late addition to science. Yes, it's proved very useful. But science happened long before they introduced math.
        
        Well, thinking again, this depends on what you mean by math. Leonardo used math to figure out perspective. Does this means that art depends on math? If so, then science depends on math, and so does walking across the room. And I can see a valid argument to be made along those lines, but that's not what people no
      - Re: (Score:3, Interesting)
        
        by arivanov ( 12034 ) writes:
        
        From the notebooks of Lazarus Long: If it can't be expressed in figures, it is not science; it is opinion.
        
        Re: (Score:3, Interesting)
        
        by WilliamSChips ( 793741 ) writes:
        
        Lazarus Long is consistently wrong. He claims that peace and freedom are mutually exclusive but if you took a graph of our freedom you'd find that the greatest drops are during wartime.
- Re: (Score:3, Funny)
  
  by smittyoneeach ( 243267 ) * writes:
  
  If the name 'paq8hp12' falls out of some tree in the forest, and no one here can tell the difference in the state of the tree/paq8hp12 system, does gravity exist?
That's cool.. (Score:5, Interesting)

by Rorian ( 88503 ) writes: <james...fysh@@@gmail...com> on Tuesday July 10, 2007 @02:19AM (#19810101) Homepage Journal

.. but where can I get this tiny Wiki collection? Will they be using this for their next version of Wikipedia-on-CD? Maybe we can get all of Wiki onto a two-DVD set, at ~1.3bit/character (minus images of course) - that would be quite cool.

Share
twitter facebook
- Re: (Score:2)
  
  by The Great Pretender ( 975978 ) writes:
  
  It going to be for wiki-mobile
- Re:That's cool.. (Score:5, Interesting)
  
  by Cato ( 8296 ) writes: on Tuesday July 10, 2007 @02:38AM (#19810191)
  
  Or more usefully, compress Wikipedia onto a single SD card in my mobile phone (Palm Treo) - with SDHC format cards, it can do 8 GB today.
  
  Compression format would need to make it possible to randomly access pages, of course, and an efficient search index would be needed as well, so it's not quite that simple.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Informative)
    
    by RuBLed ( 995686 ) writes:
    
    The question is does a mobile handheld device got enough processing power to decompress it? in a reasonable time?
    - - Re:Not made for mobile devices (Score:5, Insightful)
        
        by Yvan256 ( 722131 ) writes: on Tuesday July 10, 2007 @08:57AM (#19811951) Homepage Journal
        
        At this moment without compression you could probably store enough text on a mobile phone to keep you constantly reading for a month.
        That may be, however we're talking about Wikipedia here. It's not about storing so much text that you can't go through it within a month, it's about storing everything so that you can access it as a reference.
        
        When you look up a word in the dictionary, it takes from 10 to 30 seconds to read the definition. But you did need the whole book/brick to do it.
        
        Parent Share
        twitter facebook
  - Re:That's cool.. (Score:4, Funny)
    
    by Hal_Porter ( 817932 ) writes: on Tuesday July 10, 2007 @03:10AM (#19810327)
    
    a text spk version of wiki shud fit in 8gb i think
    its only becoz people are such grammar noobs that they need to waste $
    dood shud filta to txtspk b4 he compress
    
    Parent Share
    twitter facebook
    - Re:That's cool.. (Score:5, Funny)
      
      by neonmonk ( 467567 ) writes: on Tuesday July 10, 2007 @03:49AM (#19810469)
      
      a txt spk vrsion of wiki shd fit 8gb i fink
      its only becoz ppl r sch grmmr noobs tat tey nid 2 wste $
      dud shd filta 2 txtspk b4 he cmpres
      
      There, fixed that for ya.
      
      Parent Share
      twitter facebook
      - Re:That's cool.. (Score:5, Funny)
        
        by Archimonde ( 668883 ) writes: on Tuesday July 10, 2007 @05:03AM (#19810765)
        
        aTxtSpkVrsionOfWikiShdFit8gbIFink
        itsOnlyBecozPplRSchGrmmrNoobsTatTeyNid2Wste$
        dudShdFilta2TxtspkB4HeCmpres
        
        Fixed even more.
        
        Parent Share
        twitter facebook
        
        super-grammar-improved paq8hp12 (Score:4, Funny)
        
        by superbrose ( 1030148 ) writes: on Tuesday July 10, 2007 @05:42AM (#19810937) Homepage
        
        After implementing a few minor tweaks to paq8hp12 and incorporating your grammar optimisation algorithm I managed to compress the above text amazingly to a single character: '&'.
        
        Now you figure out which one it was and how to decompress it.
        
        Parent Share
        twitter facebook
        
        Re:super-grammar-improved paq8hp12 (Score:5, Funny)
        
        by pla ( 258480 ) writes: on Tuesday July 10, 2007 @01:35PM (#19815539) Journal
        
        Now you figure out which one it was and how to decompress it.
        
        Well, with only 256 choices, it didn't take long to check all possible decodings for one that makes sense. Ended up working for "}".
        
        Oddly, though, the algorithm not only restored, but improved the original! I get:
        
        "The King's English version of Wikipedia should fit in eight gigabits, I do believe. Only humanity's sphexish adherence to grammatical rules limits the attainable compression ratio; the good gentleman might wish to consider filtering to a more base patois prior to applying his algorithm".
        
        Amazing... This discovery could single-handedly render the next generation (nearly) intelligible!
        
        Parent Share
        twitter facebook
      - Re:That's cool.. (Score:5, Funny)
        
        by thomasj ( 36355 ) writes: on Tuesday July 10, 2007 @05:09AM (#19810787) Homepage
        
        1txtspk #.#/wiki = 8G!
        ~ppl r grm0.1 -> -$
        |txtspk|gzip
        
        Parent Share
        twitter facebook
        
        Re:That's cool.. (Score:4, Funny)
        
        by jaavaaguru ( 261551 ) writes: on Tuesday July 10, 2007 @05:18AM (#19810835) Homepage
        
        Is that Perl? ;-)
        
        Parent Share
        twitter facebook
        
        Re:That's cool.. (Score:5, Funny)
        
        by Anonymous Coward writes: on Tuesday July 10, 2007 @06:53AM (#19811167)
        
        Perl: The only language that looks the same before and after RSA encryption.
        
        Parent Share
        twitter facebook
- Re:That's cool.. (Score:5, Informative)
  
  by Kadin2048 ( 468275 ) * writes: <slashdot...kadin@@@xoxy...net> on Tuesday July 10, 2007 @02:58AM (#19810275) Homepage Journal
  
  Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
  
  Mobile use is right out too, at least with current-generation equipment.
  
  Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.
  
  Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...
  
  Parent Share
  twitter facebook
  - Re:That's cool.. (Score:5, Informative)
    
    by Anonymous Coward writes: on Tuesday July 10, 2007 @03:29AM (#19810387)
    
    No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
    When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.
    
    This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Interesting)
      
      by Solder Fumes ( 797270 ) writes:
      
      This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
      
      Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away. That takes a lot more horsepower than picking up the al
      - Re:That's cool.. (Score:5, Informative)
        
        by imroy ( 755 ) writes: <imroykun@gmail.com> on Tuesday July 10, 2007 @05:29AM (#19810861) Homepage Journal
        
        Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away.
        
        No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.
        
        The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.
        
        Parent Share
        twitter facebook
- Re:That's cool.. (Score:5, Insightful)
  
  by arun_s ( 877518 ) writes: on Tuesday July 10, 2007 @04:31AM (#19810627) Homepage Journal
  
  Maybe someone could sell the whole thing in a book-sized rectangular box with a tiny keyboard and 'DON'T PANIC' inscribed in large, comforting letters in the front.
  Now that'd be cool.
  
  Parent Share
  twitter facebook
- Re:That's cool.. (Score:4, Funny)
  
  by Ed Avis ( 5917 ) writes: <ed@membled.com> on Tuesday July 10, 2007 @06:02AM (#19811003) Homepage
  
  According to Wikipedia, the average per-character entropy of English text has tripled in the last six months!
  
  Parent Share
  twitter facebook
Where's the Mods? (Score:5, Informative)

by OverlordQ ( 264228 ) writes: on Tuesday July 10, 2007 @02:30AM (#19810149) Journal

The link in TFS links to the post about the FIRST payout, here's [google.com] the link to the second payout (which this article is supposed to be talking about).

Share
twitter facebook
Dangerous (Score:4, Funny)

by mhannibal ( 1121487 ) writes: on Tuesday July 10, 2007 @02:33AM (#19810165)

This is damned dangerous, and playing with all our lives. Soon compression rates will approach 100% where the data will collapse into itself forming a black hole that will suck in the universe.

Damned scientists!

Share
twitter facebook
- Re:Dangerous (Score:5, Funny)
  
  by SoulDrift ( 638565 ) writes: on Tuesday July 10, 2007 @05:28AM (#19810857)
  
  Actually, I can give you 100% compression already. It's just a bit lossy.
  
  Parent Share
  twitter facebook
  - - Comment removed (Score:5, Funny)
      
      by account_deleted ( 4530225 ) writes: on Tuesday July 10, 2007 @06:25AM (#19811087)
      
      Comment removed based on user account deletion
      
      Parent Share
      twitter facebook
      - Re:Dangerous (Score:5, Funny)
        
        by smallfries ( 601545 ) writes: on Tuesday July 10, 2007 @07:30AM (#19811323) Homepage
        
        See? American English is actually just essentially lossless compression...
        Sure, sure it is. Not exactly optimal though...
        
        Parent Share
        twitter facebook
      - Re:Dangerous (Score:5, Funny)
        
        by Welshalian ( 733176 ) * writes: <welshalian@gmail.com> on Tuesday July 10, 2007 @08:13AM (#19811589)
        
        humour
        Humor. See? American English is actually just essentially lossless compression...
        
        I respectfully disagree. Most of the fun in British humour gets lost in the translation to American humor.
        
        Parent Share
        twitter facebook
      - Re:Dangerous (Score:4, Insightful)
        
        by UbuntuDupe ( 970646 ) * writes: on Tuesday July 10, 2007 @09:38AM (#19812345) Journal
        
        Well, not always...
        
        American --> British
        
        transportation --> transport
        football player --> footballer
        subway --> tyube
        burglarize --> burgle
        
        Parent Share
        twitter facebook
        
        Re: (Score:3, Funny)
        
        by aprilsound ( 412645 ) writes:
        
        football player --> footballer/quote You misspelled 'soccer'.
Artificial Intelligence? (Score:4, Insightful)

by mrbluze ( 1034940 ) writes: on Tuesday July 10, 2007 @02:35AM (#19810173) Journal

Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence.

Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?

Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?

Share
twitter facebook
- Re:Artificial Intelligence? (Score:5, Informative)
  
  by MoonFog ( 586818 ) writes: on Tuesday July 10, 2007 @02:44AM (#19810211)
  
  Shamelessly copied from the wikipedia article on the Hutter Prize [wikipedia.org]:
  
  The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far. Unfortunately, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXItl) where the environment is restricted to time t and space l, that a solution can be computed in time O(t2l), which is still intractable. Thus, AI remains an art.
  
  The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Insightful)
    
    by mbkennel ( 97636 ) writes:
    
    They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge.
    
    The apparent empirical result is that predicting which characters are most likely to occur next in a text sequence requires either
    
    1) vast real-world knowledge
    
    OR
    
    2) vast real-world derived statistical databases and estimation machinery
    
    but there can be a difference in their utility. The point of course, is that humans can do enormously more powerful things with that vast real-world
  - Re: (Score:2)
    
    by nanosquid ( 1074949 ) writes:
    
    Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far.
    
    Well, one should keep in mind that the connections between AI and compression have been known far longer, but have never turned out to be particularly useful for building AI systems. Hutter's point is merely a variant of these previous theories, and there is no reason to believe tha
    - Re: (Score:2)
      
      by Baldrson ( 78598 ) * writes:
      
      The value of the Hutter Prize isn't in the use of Hutter's theory to build AI. The value of the Hutter Prize is in the use of compression ratio to provide a figure of merit for AI that is very good. That value has been proven in various ways -- mathematically and practically.
- AI? I don't think so. (Score:4, Insightful)
  
  by tgv ( 254536 ) writes: on Tuesday July 10, 2007 @02:50AM (#19810237) Journal
  
  It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon.
  
  And yes, that's absolute bollocks. Shannon's number was just an estimate and only applied to serial transmission of characters, because that's what he was interested in. Since then, a lot of work has been done in statistical natural language processing, and I would be surprised if the number couldn't be lowered.
  
  Anyway, since the program doesn't learn or think to reach this limit, nor gives a explanation how this level of compression is intrinsically linked to the language/knowledge it compresses, it cannot be called AI; e.g., it doesn't know how to skip irrelevant bits of information in the text. That would be intelligence...
  
  Parent Share
  twitter facebook
- Re:Artificial Intelligence? (Score:5, Interesting)
  
  by qbwiz ( 87077 ) * writes: <johnNO@SPAMbaumanfamily.com> on Tuesday July 10, 2007 @02:53AM (#19810249) Homepage
  
  Could someone out there please explain how being able to compress text is equivalent to artificial intelligence?
  
  Is this to suggest that the algorithm is able to learn, adapt and change enough to show evidence of intelligence?
  
  The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.
  
  If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.
  
  Parent Share
  twitter facebook
  - Re:Artificial Intelligence? (Score:4, Insightful)
    
    by bytesex ( 112972 ) writes: on Tuesday July 10, 2007 @03:45AM (#19810451) Homepage
    
    The problem with this approach is that there are many ways the say the same thing, and that this compression/decompression algorithm is tested using strict text-comparison only. A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong.
    
    Parent Share
    twitter facebook
    - Re: (Score:2, Insightful)
      
      by pitu ( 983343 ) writes:
      
      The problem with this approach is that there are many ways [to] say the same thing
      that is an idea or a concept. Interpreting an idea or concept in different ways is meaningful
      only by its context.
      
      ex1 the sky is blue => it's beautifull weather (context: you're making a walk)
      ex2: the sky is blue => use #0000FF for the sky area (context: graphic work)
      this compression/decompression algorithm is tested using strict text-comparison only.A real AI might compress 'The sky is blue today' and decompress to 'Toda
    - Re: (Score:3, Funny)
      
      by mrjb ( 547783 ) writes:
      
      "A real AI might compress 'The sky is blue today' and decompress to 'Today it's beatiful weather' and not be wrong." That might be a good example of acceptable *lossy* AI text compression. One step further and it will compress articles into a proper, readable summary.
  - Re: (Score:2)
    
    by bentcd ( 690786 ) writes:
    
    The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.
    (Emphasis is mine)
    
    But, surely, the compression algorithm isn't actually guessing at all - it knows what comes next because it is working from a very strict set of rules. I would probably be more impressed(*) if they wrote a heuristic that could accurately guess symbols based upon previous symbols - even if such a heuristic were to give a higher error rate than what the deterministic algorithm does.
    
    Then perhaps as the next logical step, we could have a heuristic trained by Wikipedia which could accurately p
    - - Re: (Score:3, Interesting)
        
        by iapetus ( 24050 ) writes:
        
        Which is a shame, because the weather wasn't good.
- Re:Artificial Intelligence? (Score:5, Insightful)
  
  by fireboy1919 ( 257783 ) writes: <rustyp@@@freeshell...org> on Tuesday July 10, 2007 @02:59AM (#19810283) Homepage Journal
  
  The first poster on this topic had a good explanation - it seems like an AI problem, but not why.
  
  Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.
  
  We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
  So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
  
  Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by iamdrscience ( 541136 ) writes:
    
    We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end. So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
    I've heard that somebody already did a software implementation of a grad st
  - Re: (Score:2)
    
    by Bombula ( 670389 ) writes:
    
    I've never studied information theory, so I'm woefully ignorant, but as I understand it, it was Shannon himself who defined information content as the amount by which a recipient's ignorance is reduced. Here's what I don't understand, and hope someone here will be able to explain:
    It makes sense to me that you can measure how much information is contained, say, in a text message. But to interpret that information you need ... what? We usually fill in the blank with 'intelligence', but it seems to me that
    - Re: (Score:2)
      
      by archeopterix ( 594938 ) writes:
      
      Could it instead be that information doesn't really exist on one side or the other - message vs recipient - but can only be defined in terms of both together?
      As far as I know, this is how information theory defines information (actually it defines uncertainty, but this is rather a technical detail). The definition of uncertainty relies on a random variable. This random variable represents the knowledge common to the sender and the recipient before sending the message ( they both know whether they will be t
  - Re: (Score:2)
    
    by Eivind ( 15695 ) writes:
    
    True, but the real gains are achieved when you're allowed to be lossy. If you're studying a picture, with your goal being to be able to answer human questions about that picture afterwards, you don't make any attempt whatsoever at remembering the precise pixel-by-pixel colours. Instead you focus on those parts of the contents most likely to be of human interest. You take note of a "car" standing with the "side" towards you, perhaps the make. The colour. The fact that it's raining. That there's a girl sit
- Re: (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  As Matt Mahoney explained it to me when we were brainstorming the prize criteria [tinyurl.com]:
  Hutter's* AIXI, http://www.idsia.ch/~marcus/ai/paixi.htm [idsia.ch] makes another argument for the connection between compression and AI that is more general than the Turing test. He proves that the optimal behavior of an agent (an interactive system that receives a reward signal from an unknown environment) is to guess that the environement is most likely computed by the shortest possible program that is consistent with the behavior obse
  - broken assumptions (Score:2)
    
    by nanosquid ( 1074949 ) writes:
    
    He proves that the optimal behavior of an agent (an interactive system that receives a reward signal from an unknown environment) is to guess that the environement is most likely computed by the shortest possible program that is consistent with the behavior observed so far.
    
    But the task of compressing Wikipedia character by character is thoroughly irrelevant to human intelligence. Humans are bad at it, so it's not a characteristic of human intelligence, and if you had a very high quality compressor, even if
- - Re: (Score:2)
    
    by Eivind Eklund ( 5161 ) writes:
    
    You're assuming that humans aren't machines - in this context, that's actually a matter of faith. Human intelligence may be a result of "machine processes" - ie, direct physical processes. If we assume that humans are intelligent - otherwise, the term seems sort of useless - we can't rule out machines being intelligent by them being machines unless we have a definition of machine that excludes humans. (And I believe such a definition would probably be counter-productive when it comes to the matter of def
I'll be reading the source... (Score:4, Interesting)

by seanadams.com ( 463190 ) * writes: on Tuesday July 10, 2007 @02:37AM (#19810183) Homepage

I've worked with some general purpose compression algorithms like zlib, lossy audio compression like mp3, and also lossless audio.

Each is very different and interesting in its own right. MP3 especially, because the compression model is built on what the ears+brain can perceive.

This algorithm I guess would be sort of like mp3 in that it contains some human-based element, maybe a language structure or something, but more like FLAC in that it might use predictors to say what word is likely to come next, with an error bitstream to point to progressively less likely words using bit sequences whose is inversely related to the probability of that word. But that's just a guess from an audio guy.

Can somebody who's looked at this post a synopsis of how it works?

Share
twitter facebook
- Re:I'll be reading the source... (Score:5, Interesting)
  
  by phatvw ( 996438 ) writes: on Tuesday July 10, 2007 @02:56AM (#19810265)
  
  I wonder if lossy text compression where prepositions are entirely thrown out would be effective? Based on context, your brain actually ignores a lot of words you read and fills in the blanks so-to-speak. Perhaps you can use simple grammar rules to predict which prepositions go where based on that same context?
  
  Parent Share
  twitter facebook
- Yeah, about that source. (Score:2)
  
  by QuantumG ( 50515 ) writes:
  
  Here's a tip to whoever made this archive, if you want people to abide by the GPL you really should do so yourself. That means:
  
  1. Putting a copy of the GPL in the archive with your program.
  2. Putting the source code in the archive with the binary form of your program; or
  3. Putting an offer to provide source code, valid for 3 years, in the archive with your program.
  
  If you don't do it, what makes you think anyone else is going to?
  - Re: (Score:2)
    
    by Chandon Seldon ( 43083 ) writes:
    
    He's the copyright holder, so he can distribute his binaries however he wants to. If you care about acting legally, it's your responsibility to track down the license before you use, copy, modify, or redistribute the program.
    - Re: (Score:2)
      
      by QuantumG ( 50515 ) writes:
      
      No shit.
      
      I think I made myself pretty clear. If you don't follow your own license, who will?
you can stop now (Score:2)

by r00t ( 33219 ) writes:

This really isn't much of a gain. If the info theory is right, there isn't much gain to be had. Even in the most optimistic case, we aren't going to go much beyond a factor of two additional reduction.

Other stuff is more interesting: fast decompression time, fast compression time, smaller compression block size
Lossy compression? (Score:5, Funny)

by niceone ( 992278 ) * writes: on Tuesday July 10, 2007 @02:45AM (#19810215) Journal

Shouldn't AI be using lossy compression? Certainly my real intelligence uses um, where was I?

Share
twitter facebook
- Re:Lossy compression? (Score:5, Interesting)
  
  by Phat_Tony ( 661117 ) writes: on Tuesday July 10, 2007 @11:33AM (#19813843)
  
  That's my opinion of this. By excluding lossy compression, they're also excluding the likelihood of applicability to AI that is the point of the contest.
  
  Humans achieve good compression on things like encyclopedia knowledge because we don't remember the words at all. We remember the idea, and we have our own dictionary in our heads, and we re-apply words to the idea to reconstruct the entry, rather than memorizing the data. That's why we get great compression; we throw out most of the data, and just remember the "gist" of it, the argument, the facts, in an internal structure of raw ideas stored independently of the words to explain them.
  
  By restricting the contest to lossless compression, they eliminate the ability to use any AI-like compression techniques. The machine can not extract the ideas and then re-assign words, because it would have to be able to do so using the exact voice of each of thousands of different Wikipedia contributors. That's hopeless.
  
  So the entrants are restricted to clever algorithms that do endless mathematical optimizations to compress the data, a method of compression that's entirely alien to the methods of our only known intelligence. We don't remember things by figuring out clever tricks to compress the data in our own memory. We don't say "Oscar Schindler saved Jews In WWII" and then say, OK, that data had 5 spaces in it, and 4 "S's," and if I remember the positions of the spaces and the S's, I could use less memory space to store this in my head, and then just think back through the algorithm I used to take the spaces and "s's" out and put them back in where they go, and I'll have the name again, and then sit there and carefully work out in our heads what the original data must have been after our compression methods. It doesn't work that way at all. To us it apparently "just comes to us." The compression probably comes from things like remembering sounds, and then reconstructing the name's exact spelling based upon known rules of grammer. We store the name Oscar Schindler in relation to various facts regarding Jews and WWII, but we store them as ideas, and then pull the words back out, and each time someone asks us about Schindler, we'd be likely to say something similar in meaning but different in expression. So this contest is restricted to the least interesting kind of compression for intelligence; the kind that can't use it.
  
  Interesting compressions are things like JPEG and MP3, where they built the compression model on the human perceptual model, first saying "what about this exact data is less relevant to a human observer, that we can therefore throw away?" For JPEG's, it turns out that (among other things) we're much more sensitive to differences in color than to absolute colors, and among differences in color, we're much more perceptive in the color ranges closer to human skin tone. MIDI is actually probably closer to the compression used by human intelligence than any recorded music standard.
  
  Along these lines, I'd say storing the HTML formatting data exactly borders on ridiculous. It's a hugely inefficient waste of space. For instance, if you just run the HTML through one of the free online utilities that strips irrelevant data, you get the identical presentation of the data, you've only thrown out entirely worthless data. But you've already violated the contest rules. You should be able to strip the HTML entirely, as long as your compression/decompression system ends up with conveniently readable formatting in the end. Reconstructing the actual HTML in a character-identical way is so non-intelligent when you're trying to save space, it seems hard to beleive it's going to lead to intelligence.
  
  Regarding this contest: I'm curious what level of compression you can get if you just histogram the words and then, in order of frequency for anything with enough occurrences to save memory by using a look-up table, you assign sequential numeric values for the words in order of frequency of occurrence. Then start your data with a look-up
  Read the rest of this comment...
  
  Parent Share
  twitter facebook
- Re: (Score:3, Interesting)
  
  by BillyBlaze ( 746775 ) writes:
  
  While allowing lossy compression might end up with way better AI, there's a logistics problem: how do you give an objective score to the precision of 100M (that's 4.76 uLoC) of paraphrased text?
Obligatory... (Score:5, Funny)

by Stormwatch ( 703920 ) writes: <rodrigogirao@nosPam.hotmail.com> on Tuesday July 10, 2007 @03:08AM (#19810317) Homepage

- The Wikipedia annual funding drive is passed. The system goes on-line August 4th, 2007. Human contributors are removed from editing. Wikipedia begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.
- Wikipedia fights back.
- Yes. It launches its rvv missiles against Slashdot.
- Why attack Slashdot? Aren't they our friends now?
- Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.

Share
twitter facebook
How to win the Hutter Prize (Score:5, Funny)

by seanyboy ( 587819 ) writes: on Tuesday July 10, 2007 @03:21AM (#19810363)

1) Create a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm
2) Add a long and self referencing article on wikipedia about said algorithm.
3) Use algorithm to compress first x% of wikipedia (including your own article)
4) WIN HUTTER PRIZE.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by game kid ( 805301 ) writes:
  
  [...]a compression algorithm called the aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaa algorithm[...]
  That's gotta be the most annoying compression algorithm in the world [imdb.com].
- Re: (Score:2)
  
  by will_die ( 586523 ) writes:
  
  From what someone else said the size of the decompress program has to be included into the overall size of the compressed data.
- Re: (Score:3, Interesting)
  
  by vertigoCiel ( 1070374 ) writes:
  
  Great idea, but unfortunately the Hutter Prize uses a static sample of the first 10^8 bits of Wikipedia.
which only goes to show... (Score:5, Informative)

by nanosquid ( 1074949 ) writes: on Tuesday July 10, 2007 @03:34AM (#19810403)

If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.

Share
twitter facebook
AI (Score:2, Insightful)

by evilviper ( 135110 ) writes:

bringing text compression within 1% of the threshold for artificial intelligence.
I see no reason to believe AI and text compression are interchangeable.

I can think of a few methods that would allow a computer to guess a missing word better than humans (exceeding the AI limit), and that such methods would be useless for determining a response to a question, particularly in the real world, where things like punctuation, abbreviation, and capitalization would be highly suspect to begin with.

So I have to say th
Bah humbub (Score:2)

by yusing ( 216625 ) writes:

equivalent to solving the artificial intelligence problem
... Without actually contributing anything to the development of artificial intelligence (an entity capable of understanding and interacting with the real world in an intelligent way).
It's impressive as it stands. The hype is superfluous.
Why the Hutter Prize is Bullshit. (Score:3, Interesting)

by Anonymous Coward writes: on Tuesday July 10, 2007 @04:11AM (#19810543)

I suppose I have to post this anonymously, or the Hutter Prize thralls will just mod it down; I like my karma.

I am at a loss as to how this meaningless charade keeps getting posted on Slashdot. Anyone with half a brain who reads TFA (or any of the previous FA Slashdot has posted on this stupid prize) can see this for what it really is: a handful of crazy people who think that this has meaning beyond above-average technogarbage.

There are all of four people seriously involved in this Hutter Prize: Hutter himself, Bowery (who's made all the /. submissions), Mahoney (who wrote a thesis on this crap), and Ratushnyak, who seems to enjoy wasting his time on this AI-obsessed prize.

PAQ8HP12 may be able to compress the corpus extremely efficiently, but it has obvious and real drawbacks for any real-world application: it's tuned for this specific corpus ("H[utter]P[rize]" is even in the name of the compressor), it's slow as fuck, and it consumes 2GB of memory. Yes, 2GB of memory for 100MB of input data. This is not a real-world algorithm; this is CS weenies wanking off.

And what's with the obsession with Wikipedia? It is not the be-all, end-all of human knowledge, and, despite its devotees' claims, never will be; just look at the internal politics, and you'll see that it simply can't scale to that size. Is it a useful resource? Of course. Is it something worthy of adoration and fawning over? No.

And then, of course, there's the obsession with AI. These people seem to be of the opinion that a text compressor will actually lead to artificial intelligence -- with no other tuning! An absurd claim if I've ever heard one; the predictive capabilities of a good text compressor are something that would no doubt be useful to an AI, but there's one hell of a lot more to general intelligence than just pattern matching and statistical algorithms for compression.

If one really wanted to sponsor an AI prize, it would probably be much better to focus on creating, say, an effective chatbot -- something that really can predict a desirable response and pass Turing's test.

Not this compression bullshit.

Share
twitter facebook
Not truly == AI just yet (Score:5, Interesting)

by Anonymous Coward writes: on Tuesday July 10, 2007 @04:51AM (#19810709)

I've been following the Hutter Prize with interest, having been into compression ever since reverse engineering Powerpacker on my Amiga 500 back in the good old days to understand how it worked (ah, happy memories).

Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).

Now depending on the text itself, prediction based on previously seen text isn't enough ... especially this enwik8 file which is more of a flat file dataset with a lot of unrelated terminology.

Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork" ... how the hell can we possibly predict that the next words will be "Slashdot, Cigarette, Coffee". (Three subjects very close to my heart ... also my lungs, arteries, liver etc).

Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.

It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process ... the 100MB enwik8 file. A different file will need a different dictionary.

Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components ... example all the timestamps are in a very verbose character style like "2007-07-10 00:00:00" ... if we can recognize that, we could find an alternative encoding, changing 19 byte string into 32 bit long (maybe even less if we understand the epoch date he is using) ... again, "wetware" has to identify and decide this encoding right now.

Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc ... AND do all this in less than 9 hours I believe it takes for the latest compressor.

This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.

At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.

Share
twitter facebook
Only 1% away from AI? (Score:2)

by glwtta ( 532858 ) writes:

That's only about 1.5 laptop meters! The thinking machines are here!
Very silly goal (Score:3, Interesting)

by Ancient_Hacker ( 751168 ) writes: on Tuesday July 10, 2007 @06:51AM (#19811159)

This is about the silliest competition imagineable. Think:
Compression has reached a point of diminishing returns, getting less and less return for more and more work. And at best it's asymptotically approaching the theoretical limit. You could offer a billion dollar prize and get back maybe a few percent of improvement, while making any further improvement more difficult.
Meanwhile data storage and data transmission technology keeps improving many percent a year, with each improvement compounding on the previous ones.
In Other Words, IMHO money would be better spent on the second area rather than the first.

Share
twitter facebook
But of course, you don't need math for this... (Score:3, Funny)

by maillemaker ( 924053 ) writes: on Tuesday July 10, 2007 @07:44AM (#19811379)

Surely you don't need any mathematical skills to do this kind of work...

http://science.slashdot.org/comments.pl?threshold= 1&mode=thread&commentsort=0&sid=247781&op=Reply [slashdot.org] ;)

Share
twitter facebook
Erratum: 1.3 May Be Too High (Score:3, Informative)

by Baldrson ( 78598 ) * writes: on Wednesday July 11, 2007 @02:11PM (#19828071) Homepage Journal

Matt Mahoney has communicated his concern to me that the 1.3 bits per character entropy measured by Shannon is likely a smaller number with the enwik8 corpus due to regularities from embedded markup. He has already compressed enwik9 (1,000,000,000 bytes) to less than 1.3 bits per character and his analysis shows that this is largely due to a large section of data tables present in that larger sample -- which entails a large amount of embedded markup. While the entropy of enwik8 is unlikely to be as low as enwik9, this difference does evidence the lower entropy of embedded markup.
Until Shannon type experiments, involving humans doing next character predictions of enwik8, are performed, the bounds of enwik8's entropy range must remain unknown but is likely lower than 0.6 to 1.3. As such an experiment would be expensive, it is going to be difficult to say with any simple bpc measure when the Hutter Prize is breaching the threshold of AI. What the Hutter Prizes bpc metric gives us, however, is a clear measure of progress.
My apologies to the other members of the Hutter Prize Committee and the /. community for this error.
PS: Another area of concern raised by Mahoney is that enwik8, at 10e8 characters, is only as much verbal information as a 2 or 3 year old has encountered so although it is sufficient to demonstrate AI capabilities well beyond the current state of the art, his preference is for a much larger contest with fewer resource restrictions focusing on the 10e9 character enwik9 which is more likely to produce a the AI equivalent of an adult with encyclopedic knowledge.

Share
twitter facebook
- Re:Huh? (Score:5, Informative)
  
  by headkase ( 533448 ) writes: on Tuesday July 10, 2007 @02:32AM (#19810159)
  
  Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach [berkeley.edu].
  
  Parent Share
  twitter facebook
  - Re:Huh? (Score:4, Informative)
    
    by DavidpFitz ( 136265 ) writes: on Tuesday July 10, 2007 @04:24AM (#19810593) Homepage Journal
    
    One of the best introductions to AI is Artificial Intelligence: A Modern Approach. [berkeley.edu]
    
    Indeed Russell & Norvig is a very good book, well worth a read if you're interested in AI. All the same, when I did my BSc in Artificial Intelligence I found Rich & Knight [amazon.com] a much better, more understandable book for the purposes of an introductory text. It is a little dated now, but so is Russell & Norvig, to be honest.
    
    Parent Share
    twitter facebook
- Re:ai threshold? (Score:5, Informative)
  
  by Baldrson ( 78598 ) * writes: on Tuesday July 10, 2007 @02:53AM (#19810253) Homepage Journal
  
  non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.
  paq8hp12 uses a neural network, ie: it has a connectionist component.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by TheLink ( 130905 ) writes:
    
    But would they understand how it works, why it works once it seems to work?
    
    A lot of these AI stuff seems to be throwing stuff together and hoping for the best.
    
    I have quite a low opinion of "making AI" that way.
    
    After all, if I wanted to get an intelligent non-human entity without really understanding how it works, I could just go to the local pet store and buy one.
    - Re:ai threshold? (Score:4, Insightful)
      
      by Baldrson ( 78598 ) * writes: on Tuesday July 10, 2007 @04:57AM (#19810741) Homepage Journal
      
      Connectionist models are models. Any model needs to be interpreted to be understood.
      
      Parent Share
      twitter facebook
- Huffman Example (Score:5, Informative)
  
  by headkase ( 533448 ) writes: on Tuesday July 10, 2007 @02:57AM (#19810269)
  
  See: Explanation [wikipedia.org]. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by archeopterix ( 594938 ) writes:
    
    Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest.
    You don't have to limit yourself to characters (although it is practical to do so for texts), or even fixed length bit sequences. If your data happens to contain a lot of "100011"'s and "10111010111010101"'s, you can use them for encoding. Any set of bitstrings works, as long as your file can be expressed as concatenation of the bitstrings from the set.
    - Re: (Score:2)
      
      by headkase ( 533448 ) writes:
      
      Yup. You're completely right. Using characters as the base unit is arbitrary. Trying to keep it simple however; I should have explained binary trees better too.
  - Re: (Score:2)
    
    by ardor ( 673957 ) writes:
    
    You forgot two other things:
    
    1) Modeling is a very big part of compression, in fact its the one where AI might occur.
    2) Huffman is only the optimum for integer symbol sizes. If one symbol has 2.117 bits, it won't be the optimum. If one symbol needs only 0.004 bits, then huffman gives it 1 bit, which is far too large. Arithmetic/range coding address these issues, and come VERY near to entropy, so entropy coding is a solved problem. Which leads me to (1) - its there where research happens.
    - Re: (Score:2)
      
      by headkase ( 533448 ) writes:
      
      See: here [slashdot.org]. Representation or modeling defines everything. How easy it is to code logic depends directly on how you represent your data within memory. Not only in AI but in other areas such as how the philosophy behind Object Orientated Programming simplifies logic for programmers.
- Re: (Score:2)
  
  by smallstepforman ( 121366 ) writes:
  
  The algorithm is based on the frequency of letters in the English language, and assigns a bit pattern (from smallest to largest) based on the order these letters appear in the main text. For odd letter combinations which never appear together (eg. qx, tz, vz etc), you will actually get no compression, but for most dictionary words, you could compress common words into a few bits (eg. compress and into 10).
  The particular compression lookup-table is useless in non english language, but I guess another table
- Re:Program size is 1.02 MB! (Score:5, Informative)
  
  by Baldrson ( 78598 ) * writes: on Tuesday July 10, 2007 @02:58AM (#19810271) Homepage Journal
  
  Actually, the size of the program (decompressor) binary is 99,696 bytes [hutter1.net], and it is the binary size that is included in the prize calculation.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Interesting)
    
    by seanadams.com ( 463190 ) * writes:
    
    Actually, the size of the program (decompressor) binary is 99,696 bytes, and it is the binary size that is included in the prize calculation.
    
    Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
    Maybe I'm misunderstanding the contents of that rar file. Are both of those data files needed? The .exe by itself is 124KB. Where did you get 99,696?
    - Re: (Score:3, Informative)
      
      by TuringTest ( 533084 ) writes:
      
      Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
      Because then you'd have to measure also the size of the UNIX system in the count of your decoder program, and that would ruin your ratio.
- Re: (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  Try this [binet.com.ua].
  - - Re: (Score:3, Interesting)
      
      by Baldrson ( 78598 ) * writes:
      
      On a 1.73 GHz Pentium with 1.2G RAM running Cygwin after compressing with:
      PAQ8HP12 -7 enwik8.paq8hp12 enwik8
      and moving the enwik8 archive to the parent directory:
      
      JamesBowery@oldatlantis ~/hutter/paq8hp12
      $ time ./PAQ8HP12 -7 enwik8.paq8hp12
      100000000 enwik8: extracted
      16381959 -> 100000000 (1.3106 bpc) in 22398.08 sec (4.465 KB/sec), 941315 Kb
      
      real 373m19.379s
      user 0m0.031s
      sys 0m0.030s
      
      JamesBowery@oldatlantis ~/hutter/paq8hp12
      $ ls -al
      total 114216
      drwxrwxrwx+ 2 JamesBowery None
- Re: (Score:2)
  
  by ObsessiveMathsFreak ( 773371 ) writes:
  
  That's no core dump. That's the compressed data.
- PAQ8, Hutter Prize branch, version 12 (Score:5, Informative)
  
  by tepples ( 727027 ) writes: <tepples@[ ]il.com ['gma' in gap]> on Tuesday July 10, 2007 @07:35AM (#19811349) Homepage Journal
  
  As far as I can tell given this Wikipedia article [wikipedia.org], "paq8hp12" means PAQ8, Hutter Prize branch, version 12.
  
  Parent Share
  twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

I wonder ... (Score:2, Funny)

Re: (Score:3, Funny)

new compression standard (Score:4, Informative)

Re:new compression standard (Score:5, Funny)

interesting program name (Score:5, Funny)

Re:interesting program name (Score:5, Informative)

That is the problem (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:3, Interesting)

Re: (Score:3, Interesting)

Re: (Score:3, Funny)

That's cool.. (Score:5, Interesting)

Re: (Score:2)

Re:That's cool.. (Score:5, Interesting)

Re: (Score:3, Informative)

Re:Not made for mobile devices (Score:5, Insightful)

Re:That's cool.. (Score:4, Funny)

Re:That's cool.. (Score:5, Funny)

Re:That's cool.. (Score:5, Funny)

super-grammar-improved paq8hp12 (Score:4, Funny)

Re:super-grammar-improved paq8hp12 (Score:5, Funny)

Re:That's cool.. (Score:5, Funny)

Re:That's cool.. (Score:4, Funny)

Re:That's cool.. (Score:5, Funny)

Re:That's cool.. (Score:5, Informative)

Re:That's cool.. (Score:5, Informative)

Re: (Score:3, Interesting)

Re:That's cool.. (Score:5, Informative)

Re:That's cool.. (Score:5, Insightful)

Re:That's cool.. (Score:4, Funny)

Where's the Mods? (Score:5, Informative)

Dangerous (Score:4, Funny)

Re:Dangerous (Score:5, Funny)

Comment removed (Score:5, Funny)

Re:Dangerous (Score:5, Funny)

Re:Dangerous (Score:5, Funny)

Re:Dangerous (Score:4, Insightful)

Re: (Score:3, Funny)

Artificial Intelligence? (Score:4, Insightful)

Re:Artificial Intelligence? (Score:5, Informative)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

AI? I don't think so. (Score:4, Insightful)

Re:Artificial Intelligence? (Score:5, Interesting)

Re:Artificial Intelligence? (Score:4, Insightful)

Re: (Score:2, Insightful)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:3, Interesting)

Re:Artificial Intelligence? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

broken assumptions (Score:2)

Re: (Score:2)

I'll be reading the source... (Score:4, Interesting)

Re:I'll be reading the source... (Score:5, Interesting)

Yeah, about that source. (Score:2)

Re: (Score:2)

Re: (Score:2)

you can stop now (Score:2)

Lossy compression? (Score:5, Funny)

Re:Lossy compression? (Score:5, Interesting)

Re: (Score:3, Interesting)

Obligatory... (Score:5, Funny)

How to win the Hutter Prize (Score:5, Funny)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:3, Interesting)

which only goes to show... (Score:5, Informative)

AI (Score:2, Insightful)

Bah humbub (Score:2)

Why the Hutter Prize is Bullshit. (Score:3, Interesting)

Not truly == AI just yet (Score:5, Interesting)

Only 1% away from AI? (Score:2)

Very silly goal (Score:3, Interesting)