Text Compressor 1% Away From AI Threshold 442
Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
new compression standard (Score:4, Informative)
Re:interesting program name (Score:5, Informative)
Where's the Mods? (Score:5, Informative)
Re:Huh? (Score:5, Informative)
Re:Artificial Intelligence? (Score:5, Informative)
Re:ai threshold? (Score:5, Informative)
paq8hp12 uses a neural network, ie: it has a connectionist component.
Huffman Example (Score:5, Informative)
Re:Program size is 1.02 MB! (Score:5, Informative)
Re:That's cool.. (Score:5, Informative)
Mobile use is right out too, at least with current-generation equipment.
Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.
Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...
Re:That's cool.. (Score:3, Informative)
Re:That's cool.. (Score:5, Informative)
This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
which only goes to show... (Score:5, Informative)
Re:Huh? (Score:4, Informative)
Indeed Russell & Norvig is a very good book, well worth a read if you're interested in AI. All the same, when I did my BSc in Artificial Intelligence I found Rich & Knight [amazon.com] a much better, more understandable book for the purposes of an introductory text. It is a little dated now, but so is Russell & Norvig, to be honest.
Re:Program size is 1.02 MB! (Score:3, Informative)
Re:That's cool.. (Score:5, Informative)
No, the most time-consuming part of most video encoders (including h.263 and h.264) is finding how the blocks have moved - searching for good matches between one frame and another. For best results, h.264 allows for the matches to not only come from the last frame, but up to the last 16! That allows for h.264 to handle flickering content much better, or situations where something is quickly covered and uncovered again e.g a person or car moving across frame, briefly covering parts of the background. Previous codecs did not handle those situations well and had to waste bandwidth redrawing blocks that were on screen just a moment prior.
The point does remain, most "compression" involves some sort of searching which is not performed when decompressing.
Re:Artificial Intelligence? (Score:1, Informative)
Text Compression as a Test for Artificial Intelligence
http://citeseer.ist.psu.edu/171781.html [psu.edu]
PAQ8, Hutter Prize branch, version 12 (Score:5, Informative)
As far as I can tell given this Wikipedia article [wikipedia.org], "paq8hp12" means PAQ8, Hutter Prize branch, version 12.
Re:Huffman Example (Score:1, Informative)
Re:How much memory? (Score:1, Informative)
Re:AI? I don't think so. (Score:2, Informative)
A reasonable explanation is in order. No one said that compression to 1.x bits per character is sufficient to call something intelligent. They just claim that this is the best that humans can do, so logically we can't call anything as intelligent as a human unless it's at least this good at compression.
People are (incorrectly) presenting the link between compression and AI as: "If it can compress as well or better than Shannon's estimate, then it is intelligent." In fact, only the converse is true: "If it is intelligent, it can compress as well or better than Shannon's estimate." This is interesting, because the contrapositive of the latter ("If it cannot compress as well or better than Shannon's estimate, it is not intelligent") is a negative test of intelligence.
Recall the Turing Test of Intelligence: a human "decider" sits at a terminal and chats with whatever is on the other end. On the other end is the subject; randomly either a computer or a human. After some fixed amount of time, the decider is asked whether he was chatting with a human being. This test is run lots of times, for lots of deciders and lots of subjects. If the decider answers yes for the computer at least as often as he answers yes for the human, then the computer can be regarded to be intelligent. A compression algorithm that beats the human threshold would imply that one gap between humans and computers has been closed, since the decider can no longer simply test to see if the subject can compress stuff. But there are billion other tests a decider could employ that AIs still can't do... so AI is still 10 years off, just like it's been for 50 years now.
Erratum: 1.3 May Be Too High (Score:3, Informative)
Until Shannon type experiments, involving humans doing next character predictions of enwik8, are performed, the bounds of enwik8's entropy range must remain unknown but is likely lower than 0.6 to 1.3. As such an experiment would be expensive, it is going to be difficult to say with any simple bpc measure when the Hutter Prize is breaching the threshold of AI. What the Hutter Prizes bpc metric gives us, however, is a clear measure of progress.
My apologies to the other members of the Hutter Prize Committee and the /. community for this error.
PS: Another area of concern raised by Mahoney is that enwik8, at 10e8 characters, is only as much verbal information as a 2 or 3 year old has encountered so although it is sufficient to demonstrate AI capabilities well beyond the current state of the art, his preference is for a much larger contest with fewer resource restrictions focusing on the 10e9 character enwik9 which is more likely to produce a the AI equivalent of an adult with encyclopedic knowledge.