Is bzip2 turing complete? - compression

Or any other compression algorithm, for that matter.
(Then again, if there was a turing-complete compression algorithm, would it still be considered a compression algorithm, rather than a programming language?)

The question might almost make sense if you had asked about the de-compressor, as opposed to the compressor. The job of a compressor is effectively to write a program to be executed by the decompressor that will recreate the original file being compressed. The program is written in the language that is the compressed data format.
The answer to that question is no, the bzip2 decompressor is not Turing complete, since it has no way to loop or recurse. Nor does the decompressor for any other standard compression format that I am aware of.
Update:
It appears to have since been deprecated due to security concerns, but apparently WinRAR had a post-processing language built into the decompressor called RarVM, which was a Turing-complete machine for implementing arbitrarily complex pre-compression filters for data.

Related

Have the monetdb's developers tested any other compression algorithm on it?

Have the MonetDb's developers tested any other compression algorithm on it before?
Perhaps they have tested other compression algorithms ,but it's really had a negative performance impact.
So why haven't they improved this database's compression performance?
I am a student from China. MonetDb is really interesting me and I want to try to improve its compression performance.
So, I should make sure that any body have done this before.
It would be my grateful if you could answer my question.
That is because i really need this.
Thank you So much.
MonetDB only compresses String (Varchar and char) types using dictionary compression and only if the number of unique strings in a column is small.
Integrating any other kind of compression (even simple ones like Prefix-Coding, Run-length Encoding, Delta-compression, ...) need a complete rewrite of the system because the operators have to be made compression-aware (which pretty much means creating a new operator).
The only thing that may be feasible without a complete rewrite is having dedicated compression operators the compress/decompress data instead of spilling to disk. However, this would be very close to the memory compression apple implemented in Mavericks
MonetDB compresses columns using PFor compression. See http://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf for details. This also answers the your question about checking other compression methods.
The choice for PFOR is because of the way modern CPU's work, but really any algorithm that doesn't work with branches but with (only) arithmetics will do just fine. I've hit similar speeds with arithmetic coding in the past.

Packet oriented lossless compression library

Does anyone know of a free (non-GPL), decently performing compression library that supports packet oriented compression in C/C++?
With packet oriented, I mean the kind of feature QuickLZ (GPL) has, where multiple packets of a stream can be compressed and decompressed individually while a history is being maintained across packets to achieve sensible compression.
I'd favor compression ratio over CPU usage as long as the CPU usage isn't ridiculous, but I've had a hard time finding this feature at all, so anything is of interest.
zlib's main deflate() function takes a flush parameter, which allows various different flushing modes. If you pass Z_SYNC_FLUSH at the end of each packet, that should produce the desired effect.
The details are explained in the zLib manual.
bzip2 has flushing functionality as well, which might let you do this kind of thing. See http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzCompress
Google's Snappy may be a good option, if you need speed more than compression and are just looking to save a moderate amount of space.
Alternatively, Ilia Muraviev put a small piece of compression code called BALZ in public domain some time ago. It is quite decent for many kinds of data.
Both of these support stream flushing and independent state variables to do multiple, concurrent streams across packets.
Google's new SPDY protocol uses zlib to compress individual messages, and maintains the zlib state for the life of the connection to achieve better compression. I don't think there's a standalone library that handles this behavior exactly, but there are several open-source implementations of SPDY that could show you how it's done.
The public domain Crush algorithm by Ilia Muraviev has similar performance and compression ratio as QuickLZ has, Crush being a bit more powerful. The algorithms are conceptually similar too, Crush containing a bit more tricks. The BALZ algorithm that was already mentioned earlier is also by Ilia Muraviev. See http://compressme.net/
may be you could use lzma compression SDK, it's written and placed in the public domain by Igor Pavlov.
And since it can compress stream files, and has memory to memory compression I think it's possible to compress packet stream (may be with some changes) but not sure.

File Compressor In Assembly

In an effort to get better at programming assembly, and as an academic exercise, I would like to write a non-trivial program in x86 assembly. Since file compression has always been kind of an interest to me, I would like to write something like the zip utility in assembly.
I'm not exactly out of my element here, having written a simple web server using assembly and coded for embedded devices, and I've read some of the material for zlib (and others) and played with its C implementation.
My problem is finding a routine that is simple enough to port to assembly. Many of the utilities I've inspected thus far are full of #define's and other included code. Since this is really just for me to play with, I'm not really interested in super-awesome compression ratios or anything like that. I'm basically just looking for the RC4 of compression algorithms.
Is a Huffman Coding the path I should be looking down or does anyone have another suggestion?
And here is a more sophisticated algorithm which should not be too hard to implement: LZ77 (containing assembly examples) or LZ77 (this site contains many different compression algorithms).
One option would be to write a decompressor for DEFLATE (the algorithm behind zip and gzip). zlib's implementation is going to be heavily optimized, but the RFC gives pseudocode for a decoder. After you have learned the compressed format, you can move on to writing a compressor based on it.
I remember a project from second year computing science that was something similar to this (in C).
Basically, compressing involves replacing a string of xxxxx (5 x's) with #\005x (the at sign, a byte with a value of 5, followed by the repeated byte. This algorithm is very simple. It doesn't work that well for English text, but works surprisingly well for bitmap images.
Edit: what I am describing is run length encoding.
Take a look at UPX executable packer. It contains some low-level decompressing code as part of unpacking procedures...

What is the current state of text-only compression algorithms?

In honor of the Hutter Prize,
what are the top algorithms (and a quick description of each) for text compression?
Note: The intent of this question is to get a description of compression algorithms, not of compression programs.
The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:
The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.
Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.
There's always lzip.
All kidding aside:
Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.
If you want to use PAQ as a program, you can install the zpaq package on debian-based systems. Usage is (see also man zpaq)
zpaq c archivename.zpaq file1 file2 file3
Compression was to about 1/10th of a zip file's size. (1.9M vs 15M)

best compression algorithm with the following features

What is the best compression algorithm with the following features:
should take less time to decompress (can take reasonably more time compress)
should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)
Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?
Entire site devoted to compression benchmarking here
Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.
But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.
Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.
You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.
Decompression either
1) Writes the literal
2) for (backward position, length)=(m,n) pair,
goes back, in the output buffer, m bytes,
reads n bytes and
writes n bytes at the end of the buffer.
So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.
This is an interessing question.
On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.