Slashdot Log In
Text Compressor 1% Away From AI Threshold
Posted by
kdawson
on Tue Jul 10, 2007 02:10 AM
from the second-hutter-prize dept.
from the second-hutter-prize dept.
Baldrson writes "Alexander Ratushnyak compressed the first 100,000,000 bytes of Wikipedia to a record-small 16,481,655 bytes (including decompression program), thereby not only winning the second payout of The Hutter Prize for Compression of Human Knowledge, but also bringing text compression within 1% of the threshold for artificial intelligence. Achieving 1.319 bits per character, this makes the next winner of the Hutter Prize likely to reach the threshold of human performance (between 0.6 and 1.3 bits per character) estimated by the founder of information theory, Claude Shannon and confirmed by Cover and King in 1978 using text prediction gambling. When the Hutter Prize started, less than a year ago, the best performance was 1.466 bits per character. Alexander Ratushnyak's open-sourced GPL program is called paq8hp12 [rar file]."
Related Stories
[+]
News: Compress Wikipedia and Win AI Prize 324 comments
Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
interesting program name (Score:5, Funny)
Re:interesting program name (Score:5, Informative)
Parent
That's cool.. (Score:5, Interesting)
Re:That's cool.. (Score:5, Interesting)
Compression format would need to make it possible to randomly access pages, of course, and an efficient search index would be needed as well, so it's not quite that simple.
Parent
Re:That's cool.. (Score:5, Funny)
its only becoz ppl r sch grmmr noobs tat tey nid 2 wste $
dud shd filta 2 txtspk b4 he cmpres
There, fixed that for ya.
Parent
Re:That's cool.. (Score:5, Funny)
itsOnlyBecozPplRSchGrmmrNoobsTatTeyNid2Wste$
dudShdFilta2TxtspkB4HeCmpres
Fixed even more.
Parent
Re:That's cool.. (Score:5, Funny)
~ppl r grm0.1 -> -$
|txtspk|gzip
Parent
Re:That's cool.. (Score:5, Informative)
Mobile use is right out too, at least with current-generation equipment.
Looking at the numbers this looks like it's about on target for the usual resources/space tradeoff. It's a bit smaller than other algorithms, but much, much more resource intensive. It's almost as if there's an asymptotic curve as you approach the absolute-minimum theoretical compression ratio, where resources just climb ridiculously.
Maybe the next big challenge should be for someone to achieve compression in a very resource-efficient way; a prize for coming in with a new compressor/decompressor that's significantly beneath the current resource/compression curve...
Parent
Re:That's cool.. (Score:5, Informative)
This isn't true of all compression techniques, but it's true for many of them, especially advanced techniques, i.e. to compress a short video into MPEG4 can take hours, but most computers don't have a lot of trouble decompressing them in real time.
Parent
Re:That's cool.. (Score:5, Insightful)
Now that'd be cool.
Parent
Where's the Mods? (Score:5, Informative)
Lossy compression? (Score:5, Funny)
Obligatory... (Score:5, Funny)
- Wikipedia fights back.
- Yes. It launches its rvv missiles against Slashdot.
- Why attack Slashdot? Aren't they our friends now?
- Because Wikipedia knows the GNAA counter-attack will eliminate its enemies over here.
How to win the Hutter Prize (Score:5, Funny)
2) Add a long and self referencing article on wikipedia about said algorithm.
3) Use algorithm to compress first x% of wikipedia (including your own article)
4) WIN HUTTER PRIZE.
which only goes to show... (Score:5, Informative)
Not truly == AI just yet (Score:5, Interesting)
Now what just about all the compressors do, whether they are based on Neural Nets, Markov Models, Predictive Partial Matching or whatever, is to use patterns in the already seen text to predict the most likely following bit (0/1).
Now depending on the text itself, prediction based on previously seen text isn't enough
Try to predict the next word, byte or bit, when your previous text has been "Frog, Toilet, Woodwork"
Therefore some of these compressors are supplemented by a dictionary containing "useful" English words arranged so that the ones used most frequently get assigned a lower "size" of encoded string in the text pre-processor before the actual compression kicks in.
It seems that all the advances have been made on finding the optimum arrangement for this dictionary based on the text they have to process
Note also, as the enwik8 file is not truly a passage of text, more a collection of data in XML wrapper, there is also a lot to be gained simply be understanding the structure of the file itself, and finding an alternative representation for the XML components
Now for me, REAL AI would come when the compressor can actively SCAN the file to be compressed himself, recognize the file structure (be it XML, plaintext or whatever), and optimize it into a more compressible format, decide the optimum arrangment for the dictionary, decide the optimum compression technique, context orders to be used etc etc
This high bits/character rate comes at a heavy price in speed and memory, especially when good old WinZIP can get a pretty good result in a couple of minutes.
At the moment there is just too much "wetware" involvement to say this is truly AI, regardless of the bits/character rate they are achieving.
Re:Huh? (Score:5, Informative)
Parent
Re:Artificial Intelligence? (Score:5, Informative)
Parent
Re:Artificial Intelligence? (Score:5, Interesting)
The (unproven) idea is that if you want to do the best at guessing what comes next (similar to compression), you have to have a great understanding of how the language and human minds work, including spelling, grammar, associated topics (for example, if you're talking about the weather, "sunny" and "rainy" are more likely to come than "airplane"), and so on.
If you feed in the previous words in a conversation, the perfect compressor/predictor would know what words will come next. Such a machine could easily pass the Turing test by printing out the logical reply to what had just been stated. The idea is that the closer to the perfect compressor you have, the closer to artificial intelligence you are.
Parent
Re:Artificial Intelligence? (Score:5, Insightful)
Compression is about recognizing patterns. Once you have a pattern, you can substitute that pattern with a smaller pattern and a lookup table. Pattern recognition is a primary branch of AI, and is something that actual intelligences are currently much better at.
We can generally show this is true by applying the "grad student algorithm" to compression - i.e., lock a grad student in a room for a week and tell him he can't come out until he gets optimum compression on some data (with breaks for pizza and bathroom), and present the resulting compressed data at the end.
So far this beats out compression produced by a compression program because people are exceedingly clever at finding patterns.
Of course, while this is somewhat interesting in text, it's a lot more interesting in images, and more interesting still in video. You can do a lot better with those by actually having some concept of objects - with a model of the world, essentially, than you can without. With text you can cheat - exploiting patterns that come up because of the nature of the language rather than because of the semantics of the situation. In other words, your text compressor can be quite "stupid" in the way it finds patterns and still get a result rivaling a human.
Parent
Re:ai threshold? (Score:5, Informative)
paq8hp12 uses a neural network, ie: it has a connectionist component.
Parent
Re:I'll be reading the source... (Score:5, Interesting)
Parent
Huffman Example (Score:5, Informative)
Parent
Re:Program size is 1.02 MB! (Score:5, Informative)
Parent
Re:Dangerous (Score:5, Funny)
Actually, I can give you 100% compression already. It's just a bit lossy.
Parent