Slashdot Log In
First Hutter Prize Awarded
Posted by
kdawson
on Sun Oct 29, 2006 09:07 PM
from the compress-this dept.
from the compress-this dept.
stefanb writes, "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia, has been awarded for the first time. Alexander Ratushnyak managed to improve the compression factor to 5.86 and will receive a 3,416-Euro award. Being able to compress knowledge well is believed to be related to acting intelligently." The Usenet announcement notes that at Ratushnyak's request, part of the prize will go to Przemyslaw Skibinski of the University of Wroclaw Institute of Computer Science, for his early contributions to the PAQ compression algorithm.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Seems like a strange contest (Score:5, Interesting)
Re: (Score:3, Informative)
And yes, offering cash in this way is a great incentive for programers.
Also, if its a cpu friendly (read: reasonable cpu useage or even able to be offloaded on a processing card) method it could potentially add 20% onto your bandwidth
Re:Seems like a strange contest (Score:5, Insightful)
Parent
Re: (Score:2)
Whatever is "reasonable" is just personal preference.
Re:Seems like a strange contest (Score:4, Informative)
Parent
Re: (Score:2)
Re: (Score:3, Interesting)
Re:Seems like a strange contest (Score:4, Insightful)
Parent
Re: (Score:2)
Re: (Score:2)
Wow, that makes me really wish I had entered. There are some great multipass lossless text compression systems that would work well for Wikipedia...
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
No clue whether it requires more code to compress than to decompress, but I would guess your overhead would be less than 30k. On an order of maybe 20MB of compressed data, that's not much.
GZIP is a bad example because it also works with binary data; it's too much. If you only have to worry about encountering 128 possible values, compression can get real interesting, and there are a lot mor
Re: (Score:2)
I really don't know jack about how compression is done, but I can toss out an idea. Scan article text for all of the used characters... expect to find your A-Z a-z 0-9 and some char
Re: (Score:2)
Re: (Score:2)
Most compressors in this space use a prediction algorithm to generate a signal, and then only compress the differences between the generated signal and the actual data. To get the best possible compression, yo
Re: (Score:2, Informative)
Lossless from Lossy? (Score:2, Funny)
Wikipedia? Knowledge? Isn't that already a lossy compression mechanism?
Hmm... (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
But there is for the PAQ algorithm (see link in summary) with mention of the awarding of the Hutter prize.
Re: (Score:3, Informative)
For comparison .... (Score:4, Informative)
WIkipedia-specific compression algorithms (Score:2)
makes one wonder (Score:2)
Re: (Score:2)
Re: (Score:3, Informative)
In all likelihood, that trick won't work: your program would need to know the start index of the Wikipedia text in the digit sequence of pi, and that index would be so astronomically huge that writing it down would be about as long as writing down Wikipedia itself. As a little example of the effect, think about
Re:makes one wonder (Score:5, Funny)
Then again, maybe we're lucky, and this universe is God's way of storing the universal encyclopedia in the digits of pi. Wikipedia might be up near the front somewhere.
Parent
Re: (Score:2)
Feynman Point [wikipedia.org]. From the Wikipedia article: "The Feynman point comprises the 762nd through 767th decimal places of pi
So, I dunno about "0000," but for some interesting sequences we may get lucky
Re: (Score:2)
If someone's program takes close to 10 hours to compress 100MB it better get considerably more than just an additional 1% out of it!
Re: (Score:2)
Indeed. Who in their right mind would spend 10 hours compressing instead of simply moving the entire dataset in a fraction of the time?
Oh, and for a good compression scheme:
uuencode filename filename | mail external@stora.ge
Then all the decompresser needs to do is fetch the mail and uudecode it.
Regards,
--
*Art
Re: (Score:2, Insightful)
If you'd RTFA you'd find the running times ranged from 30 minutes to 5 hours. They have a whole table and everything.
The whole point of the challenge was to create a self-executing compression program that made a perfect copy of their 100MB file. Final file sizes were in the 16MB range. Geeze, seriously RTFA.
compress knowledge = intelligence (Score:3, Insightful)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
If we want to make something less intelligent more intelligent, it seems likely that at some point, when its intelligence is in some sense the same as the average human's, that its behavior would be more similar to a human's than it is now. Of course, that's hardly a cer
Error in Wikipedia (Score:2)
The range for Shannon's experiments are actually between .6 and 1.3 bit per character.
This error even caught Dr. Dobbs compression expert, Mark Nelson [marknelson.us] so I guess it isn't too
Done before.... (Score:2)
what matters most... (Score:2)
This reminds me of the cryptographer paid to crypto-compress datastreams into musical notation. The work laid the foundation to real time packet sniffing of the Internet
Cryptographer? Got atta boys
Done on 9/25/06? (Score:3, Interesting)
The site also gives some of the requirements..
Create a compressed version (self-extracting archive) of the 100MB file enwik8 of less than 17MB. More precisely:
* Create a Linux or Windows executable of size S L
* If run, it produces (without input from other sources) a 108 byte file that is identical to enwik8.
* If we can verify your claim, you are eligible for a prize of 50'000×(1-S/L). Minimum claim is 500.
* Restrictions: Must run in 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.
Re: (Score:2)
There is waiting period for public comment/verification etc...
Interesting related webpage (Score:2)
Info about contenders and results of common compression programs on the testset. (All the "just use gzip/rar/winrk/..." fools can stop jabbering now...)
Re:what the hell? (Score:4, Funny)
Parent
Re: (Score:2)
Re: (Score:2)
A system that understands the text it is compressing will compress better than one that doesn't.
The winner improved upon previous compression programs by adding semantic modelling. Improving the modeling further would improve the results. You're heading towards language understanding.
Re:Compression related to acting intelligently? (Score:5, Interesting)
Nope, I had heard about the contest before seeing it today on Slashdot. The summary is fine. That is in fact the thesis that the contest is designed to investigate.
As for why this is the case, I spent some time studying compression two or three years ago, and one of the things that you quickly realize is that there is no such thing as a truly general-purpose compression algorithm. Compression algorithms work on the principle that there is some underlying pattern or structure to your data. Once the structure is understood, the data can be transformed into a smaller format. Think of compressed files as a set of parameters to a magic function which generates a larger file. The real art is in finding the function.
To give a concrete example, one of the simplest forms of compression is Huffman coding. You analyze your stream of symbols, and you realize that some occur more often than others. You then assign shorter bit strings to more frequently-occurring symbols and longer ones to less frequent symbols. This gives you a net gain. You were able to do this because you had the insight that in human language, some letters (like "e") occur more frequently than others (like "q").
There are, of course, other patterns that can be exploited. You can take the frequency distribution trick above up to another level by noting that the frequency of a symbol is not really independent of the symbol before it. For example, the most likely symbol after a "q" is "u". Sure, you could have other things. But "u" is the most likely. You can exploit that to get better compression.
But of course, you might be compressing something other than English text. There are different techniques for whatever kind of data you're trying to compress. A row of pixels in a faxed (1-bit) image tends to be very close to the same as the previous row. Each pixel is a bit. If you represent a row of pixels not as on or off directly but instead represent it as its bit value XORed with the pixel above it, you end up with 0 bits for every pixel that hasn't changed and 1 bits for every value that hasn't. Presto, you have just managed to skew the probability distribution radically towards 0 bits, and you can use further tricks to gain compression from that.
The point of all this is that to come up with these tricks, you have to understand something about the data. One of the better definitions I've ever heard for data compression was simply "applied modeling".
Given that, you have to ask a question: have we reached the point today where compression algorithms are as good as they're going to get at compressing English text based on its surface-level characteristics such as character frequencies and repeated strings? Have we exhausted the low-level modeling that is possible? There has been a lot of work on this, so it's possible that we may have. And if so, then any further gains in the compression ratio could very well be the result of some sort of higher-level modeling. Maybe even modeling the knowledge rather than just the language. And that is what this contest is about, as I understand it.
It's not for sure that a winning contest entry necessarily requires the submitter to have developed some kind of machine intelligence. But it's an intriguing enough idea that it might be worthwhile to run the contest just to see if something interesting does happen.
By the way, for some interesting reading on this subject, look up Kolmogorov Complexity. The wikipedia seems to have a pretty decent article [wikipedia.org] on it (although I haven't read the whole thing).
Parent
Re: (Score:2)
The contest produces a hard, verifiable result with a hard restriction on resources that you can use to attain it.
If you look at the state of AI research, then you will understand that introducing some cold hard numbers wont hurt.