Forgot your password?
typodupeerror
Math Science

Text Compressor 1% Away From AI Threshold 442

Posted by kdawson
from the second-hutter-prize dept.
Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
This discussion has been archived. No new comments can be posted.

Text Compressor 1% Away From AI Threshold

Comments Filter:
  • by tronicum (617382) * on Tuesday July 10, 2007 @02:17AM (#19810091)
    so...wikipedia dumps [wikipedia.org] will now be using paq8hp12 instead of l33t 7zip ?
  • by OverlordQ (264228) on Tuesday July 10, 2007 @02:26AM (#19810139) Journal
    Since I know people are going to be asking about the name, might I suggest the wiki article [wikipedia.org] about PAQ compression for the reasons behind the weird naming scheme.
  • Where's the Mods? (Score:5, Informative)

    by OverlordQ (264228) on Tuesday July 10, 2007 @02:30AM (#19810149) Journal
    The link in TFS links to the post about the FIRST payout, here's [google.com] the link to the second payout (which this article is supposed to be talking about).
  • Re:Huh? (Score:5, Informative)

    by headkase (533448) on Tuesday July 10, 2007 @02:32AM (#19810159)
    Compression is searching for a minimal representation of information. Along with representation of knowledge you add other things such as learning strategies, inference systems, and planning systems to round-out your AI. One of the best introductions to AI is Artificial Intelligence: A Modern Approach [berkeley.edu].
  • by MoonFog (586818) on Tuesday July 10, 2007 @02:44AM (#19810211)
    Shamelessly copied from the wikipedia article on the Hutter Prize [wikipedia.org]:

    The goal of the Hutter Prize is to encourage research in artificial intelligence (AI). The organizers believe that text compression and AI are equivalent problems. Hutter proved that the optimal behavior of a goal seeking agent in an unknown but computable environment is to guess at each step that the environment is controlled by the shortest program consistent with all interaction so far. Unfortunately, there is no general solution because Kolmogorov complexity is not computable. Hutter proved that in the restricted case (called AIXItl) where the environment is restricted to time t and space l, that a solution can be computed in time O(t2l), which is still intractable. Thus, AI remains an art.

    The organizers further believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test. Thus, progress toward one goal represents progress toward the other. They argue that predicting which characters are most likely to occur next in a text sequence requires vast real-world knowledge. A text compressor must solve the same problem in order to assign the shortest codes to the most likely text sequences.
  • Re:ai threshold? (Score:5, Informative)

    by Baldrson (78598) * on Tuesday July 10, 2007 @02:53AM (#19810253) Homepage Journal
    non-connectionist previous attempts (the stuff that came from the functionalists) has come up pretty short - and will continue to do so even if scaled up massively.

    paq8hp12 uses a neural network, ie: it has a connectionist component.

  • Huffman Example (Score:5, Informative)

    by headkase (533448) on Tuesday July 10, 2007 @02:57AM (#19810269)
    See: Explanation [wikipedia.org]. Basically the smallest unit of information in a computer is a bit. Eight bits make a byte and with text it takes one byte to represent one character. Generally, with Huffman coding you count the frequency of characters in a file and sort the frequency from largest to smallest. Then instead of using the full eight bits to represent a character you build a binary tree from the frequency table. Each possible branching code or going "left" or "right" down the branches is associated with a particular sequence of bits. You give the most frequent characters the shortest sequence of bits which "tokenizes" the information to be compressed. Reversing the process you run through the bit stream converting tokens back into a stream of characters.
  • by Baldrson (78598) * on Tuesday July 10, 2007 @02:58AM (#19810271) Homepage Journal
    Actually, the size of the program (decompressor) binary is 99,696 bytes [hutter1.net], and it is the binary size that is included in the prize calculation.
  • Re:That's cool.. (Score:5, Informative)

    by Kadin2048 (468275) * <slashdot...kadin@@@xoxy...net> on Tuesday July 10, 2007 @02:58AM (#19810275) Homepage Journal
    Given that it takes something like ~17 hours (based on my rough calculations using the figures on WP) to compress 100MB of data using this algorithm on a reasonably fast computer ... I don't think you'd really want to use it for browsing from CD. No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)

    Mobile use is right out too, at least with current-generation equipment.

    Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.

    Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...
  • Re:That's cool.. (Score:3, Informative)

    by RuBLed (995686) on Tuesday July 10, 2007 @03:10AM (#19810325)
    The question is does a mobile handheld device got enough processing power to decompress it? in a reasonable time?
  • Re:That's cool.. (Score:5, Informative)

    by Anonymous Coward on Tuesday July 10, 2007 @03:29AM (#19810387)

    No decompression figure is given but I don't see any reason why it would be asymmetric. (Although if there's some reason why it would be dramatically asymmetric, it'd be great if someone would fill me in.)
    When compressing a file the program has to figure out the best way to represent the data in compressed form before it actually compresses it, when decompressing all it has to do is put it back together according to the method the program previously picked.

    This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
  • by nanosquid (1074949) on Tuesday July 10, 2007 @03:34AM (#19810403)
    If you look at the description of PAQ, you'll see that it doesn't attempt to understand the text; it's just a grab-bag of other compression techniques mixed together. While that is nice for compression, it doesn't really advance the state of the art in AI.
  • Re:Huh? (Score:4, Informative)

    by DavidpFitz (136265) on Tuesday July 10, 2007 @04:24AM (#19810593) Homepage Journal

    One of the best introductions to AI is Artificial Intelligence: A Modern Approach. [berkeley.edu]

    Indeed Russell & Norvig is a very good book, well worth a read if you're interested in AI. All the same, when I did my BSc in Artificial Intelligence I found Rich & Knight [amazon.com] a much better, more understandable book for the purposes of an introductory text. It is a little dated now, but so is Russell & Norvig, to be honest.
  • by TuringTest (533084) on Tuesday July 10, 2007 @04:30AM (#19810625) Journal

    Wha wha wha? So why couldn't I just include a 100MB data file with my decompressor and claim an infinite compression ratio with just the following shell script: "cat datafile"
    Because then you'd have to measure also the size of the UNIX system in the count of your decoder program, and that would ruin your ratio.

  • Re:That's cool.. (Score:5, Informative)

    by imroy (755) <imroykun@gmail.com> on Tuesday July 10, 2007 @05:29AM (#19810861) Homepage Journal

    Probably not the best example. MPEG4 encoding takes so much time because it's not classical compression, the encoder has to figure out which pieces are less psychorelevant to big picture, and throw them away.

    No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.

    The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.

  • by Anonymous Coward on Tuesday July 10, 2007 @07:24AM (#19811295)
    Actually, here's a paper on just that.

    Text Compression as a Test for Artificial Intelligence
    http://citeseer.ist.psu.edu/171781.html [psu.edu]
  • by tepples (727027) <tepples AT gmail DOT com> on Tuesday July 10, 2007 @07:35AM (#19811349) Homepage Journal

    As far as I can tell given this Wikipedia article [wikipedia.org], "paq8hp12" means PAQ8, Hutter Prize branch, version 12.

  • Re:Huffman Example (Score:1, Informative)

    by Anonymous Coward on Tuesday July 10, 2007 @08:30AM (#19811719)
    This behaviour is configurable in your Slashdot preferences, under Comments > Display Link Domains.
  • Re:How much memory? (Score:1, Informative)

    by Anonymous Coward on Tuesday July 10, 2007 @09:14AM (#19812101)

    I think the Wikipedia dumps are built to run on machines that don't have nearly 1 GiB of RAM [binet.com.ua].
    For compression, that is. Chances are it won't need anywhere near that to decompress (though I can't find any references). So as long as someone else compressed the dump you'll likely be able to use it on your weaker hardware.
  • by dkoulomzin (320266) on Tuesday July 10, 2007 @02:43PM (#19816453)

    It is not equivalent, so I'm not surprised you didn't get it. As far as I know, the reasoning goes as follows: Shannon estimated that each character contains 1.something bits of information. Shannon was an intelligent human being. So if a program reaches this limit, it is as smart as Shannon.
    I don't know where you got that idea... but trust me, no computer scientist or mathematician would ever present such an illogical argument. It's kind of like saying "Dan thinks elephants have trunks. Dan is smart. So if an elephant has a trunk, it's as smart as Dan."

    And yes, that's absolute bollocks.
    Worse, it's absolutely spurious: there's a good chance you made it up to sound smart.

    A reasonable explanation is in order. No one said that compression to 1.x bits per character is sufficient to call something intelligent. They just claim that this is the best that humans can do, so logically we can't call anything as intelligent as a human unless it's at least this good at compression.

    People are (incorrectly) presenting the link between compression and AI as: "If it can compress as well or better than Shannon's estimate, then it is intelligent." In fact, only the converse is true: "If it is intelligent, it can compress as well or better than Shannon's estimate." This is interesting, because the contrapositive of the latter ("If it cannot compress as well or better than Shannon's estimate, it is not intelligent") is a negative test of intelligence.

    Recall the Turing Test of Intelligence: a human "decider" sits at a terminal and chats with whatever is on the other end. On the other end is the subject; randomly either a computer or a human. After some fixed amount of time, the decider is asked whether he was chatting with a human being. This test is run lots of times, for lots of deciders and lots of subjects. If the decider answers yes for the computer at least as often as he answers yes for the human, then the computer can be regarded to be intelligent. A compression algorithm that beats the human threshold would imply that one gap between humans and computers has been closed, since the decider can no longer simply test to see if the subject can compress stuff. But there are billion other tests a decider could employ that AIs still can't do... so AI is still 10 years off, just like it's been for 50 years now.
  • by Baldrson (78598) * on Wednesday July 11, 2007 @02:11PM (#19828071) Homepage Journal
    Matt Mahoney has communicated his concern to me that the 1.3 bits per character entropy measured by Shannon is likely a smaller number with the enwik8 corpus due to regularities from embedded markup. He has already compressed enwik9 (1,000,000,000 bytes) to less than 1.3 bits per character and his analysis shows that this is largely due to a large section of data tables present in that larger sample -- which entails a large amount of embedded markup. While the entropy of enwik8 is unlikely to be as low as enwik9, this difference does evidence the lower entropy of embedded markup.

    Until Shannon type experiments, involving humans doing next character predictions of enwik8, are performed, the bounds of enwik8's entropy range must remain unknown but is likely lower than 0.6 to 1.3. As such an experiment would be expensive, it is going to be difficult to say with any simple bpc measure when the Hutter Prize is breaching the threshold of AI. What the Hutter Prizes bpc metric gives us, however, is a clear measure of progress.

    My apologies to the other members of the Hutter Prize Committee and the /. community for this error.

    PS: Another area of concern raised by Mahoney is that enwik8, at 10e8 characters, is only as much verbal information as a 2 or 3 year old has encountered so although it is sufficient to demonstrate AI capabilities well beyond the current state of the art, his preference is for a much larger contest with fewer resource restrictions focusing on the 10e9 character enwik9 which is more likely to produce a the AI equivalent of an adult with encyclopedic knowledge.

Faith may be defined briefly as an illogical belief in the occurence of the improbable. - H. L. Mencken

Working...