Rough design for a data compression algorithm

„Developing algorithms for data compression is like developing algorithms for chess - specialized.“

General thoughts (software)

Compression software is good if it has a high practical use value. In practice, the user probably wants to compress a file of one of the known file formats, or in case of text files, known use cases like log files, simple text, source code, etc. The best possible compression software should exactly know most of the common file formats, and store all of their information in a compressed file format fit for the original file format (in a lossless way), and they should also take into account files similar to one another (e.g. photos made at similar light conditions). This needs a huge software, a lot of work (like they do at Prezi), so it is not the task of a solitary hobby-programmer to create a compressor highly competitive in practice. It is probably not be the task of great firms either, as the ROI (return on investment) might be low, since a system administrator can usually choose to convert the data and store it in the best file format for storage. There is one exception to this, AI (artifical intelligence), which might be able to learn about file formats and use this knowledge for compression, but actually, this seems to be much harder than coding compression code for different file formats manually, and our goal is only compression here, and maybe only something able to win the Hutter Prize.

Small algorithms can be famous

Still, it should be possible to write algorithms that are better than other algorithms in the general case, or in the case of the Hutter Prize Challenge. The history of the development of compression methods may be likened to the history of the development of sorting algorithms, many of them are famous. This, however, is not a conventional software development task, as it might or might not succeed. It is more of a research and creativity task, which can only be finished by intelligent people.

It is worth distributing free of charge

If we can find a small, general compression algorithm similar to ZIP (or LZW), but much more efficient, it will probably be worth distributing it free of charge (e.g. as a MIT licensed library, in different programming languages, with software for different operating systems), because there are a lot of compression algorithms (and software) on the market, and it will probably need to be free to spread, even if it can win the Hutter Prize. The possible fame of overcoming the ZIP file format in popularity is probably worth more than the possible money got from a proprietary format, as there are already other proprietary compressed file formats on the market.

This is not for the poor either

It may turn out that generally, the security level of most operating systems is poor, in case we have professional enemies on the Internet or elsewhere. This means that it may not be worth developing a long project on our computer without publishing it, otherwise some cracker hackers might steal it from our machine and publish it before we do that first. Only the rich have enough resources necessary to protect their machines from such attacks, e.g. by keeping an offline-only computer in a safe just for the developments, for which also security guards would care for. It is not the responsibility of the poor to make many sacrifices for the protection of the environment, but it is the responsibility of the rich. The poor need sources of joy like mathematics, love or boredom.

By the way, this would have been the idea for compressing algorithm

The problem of compressing program can be described as a mathematical problem as well: there is a sequence of numbers (which can mean words or letter too), and in which different numbers occur in different frequencies, and the task is to code this sequence by a sequence of bits as short as possible. The present compressing programs solve this by taking into account the frequencies of individual numbers in the whole sequence, and the frequencies of the occurences of numbers after other numbers, in the whole sequence. It seems however, that they usually do not categorize the members of the sequence into syntactical categories, and they do not take advantage of the case when a symbol (or number, based on the mathematical model) occurs only in some areas frequently. For the latter case, Arpad Fekete's idea was simple: just list all the words (or numbers, symbols) in the order of their occurence (maybe with some useful background information), and afterwards, code bits should mean how far will the same word (or number, symbol) occur again.

And yet!

Arpad Fekete wanted to improve his job security, and that's why he preferred programming instead of other projects, so he yet started to implement the compression algorithm. Its name have become TomorIT, and Google Code hosted it, because it was shared by the MIT license, it was written in C++, and it was maintained by the GIT version control system. However, the project was abandoned for more interesting projects, so its code has been made public domain and shared here so that anyone can continue: tomorit.zip

Project

3: