tradeoffs of different compression algorithms - compression

What are the tradeoffs of the different compression algorithms?
The purpose is backup, transfer & restore. I don't care about popularity, as long as a mature enough tool exists for unix. I care about
time
cpu
memory
compression level
the algorithms I am considering are
zip
bzip
gzip
tar
others?

Tar is not a compression algorithm per se.
You may use zip/gzip when time for compression/decompression is the most important issue.
You may use bzip when you need a better compression rate.
You may use LZMA when even bigger compression rate needed, but CPU time bigger.
Have a look here.

The best way is to look at compression benchmark sites:
Maximumcompression
Compressionratings

It usually depends on your input data but I've never found anything that gives me better general compression than 7zip (http://www.7-zip.org).

It would be very simple to create a simple testbed for those cases.
Write a script that uses each in turn on a set of files that is representative of those you wish to comporess, and measure the time/cpu/memory usage/compression ratio acheived.
Rerun them a statistically significant number of times, and you'll have your answer.

Related

Which one compress better in terms of size lz4 or zlib

I need to use a compression technique. But can't decide between lz4 and zlib. I searched internet a little bit and lz4 is much recommended but i didn't find any data about the output size. So can anyone tell me which one is better in terms of final output size.
Your google-fu is very weak. A quick search turns up many benchmarks. One of which is right there on the lz4 page. In general zlib will compress better, and take more time doing it, but your mileage may vary. Just try both on your data. Also look at zstd.
Taken from this presentation: https://indico.cern.ch/event/631498/contributions/2553033/attachments/1443750/2223643/zlibvslz4presentation.pdf
also checkout Difference: LZ77 vs. LZ4 vs. LZ4HC (compression algorithms)?

Have the monetdb's developers tested any other compression algorithm on it?

Have the MonetDb's developers tested any other compression algorithm on it before?
Perhaps they have tested other compression algorithms ,but it's really had a negative performance impact.
So why haven't they improved this database's compression performance?
I am a student from China. MonetDb is really interesting me and I want to try to improve its compression performance.
So, I should make sure that any body have done this before.
It would be my grateful if you could answer my question.
That is because i really need this.
Thank you So much.
MonetDB only compresses String (Varchar and char) types using dictionary compression and only if the number of unique strings in a column is small.
Integrating any other kind of compression (even simple ones like Prefix-Coding, Run-length Encoding, Delta-compression, ...) need a complete rewrite of the system because the operators have to be made compression-aware (which pretty much means creating a new operator).
The only thing that may be feasible without a complete rewrite is having dedicated compression operators the compress/decompress data instead of spilling to disk. However, this would be very close to the memory compression apple implemented in Mavericks
MonetDB compresses columns using PFor compression. See http://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf for details. This also answers the your question about checking other compression methods.
The choice for PFOR is because of the way modern CPU's work, but really any algorithm that doesn't work with branches but with (only) arithmetics will do just fine. I've hit similar speeds with arithmetic coding in the past.

Packet oriented lossless compression library

Does anyone know of a free (non-GPL), decently performing compression library that supports packet oriented compression in C/C++?
With packet oriented, I mean the kind of feature QuickLZ (GPL) has, where multiple packets of a stream can be compressed and decompressed individually while a history is being maintained across packets to achieve sensible compression.
I'd favor compression ratio over CPU usage as long as the CPU usage isn't ridiculous, but I've had a hard time finding this feature at all, so anything is of interest.
zlib's main deflate() function takes a flush parameter, which allows various different flushing modes. If you pass Z_SYNC_FLUSH at the end of each packet, that should produce the desired effect.
The details are explained in the zLib manual.
bzip2 has flushing functionality as well, which might let you do this kind of thing. See http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzCompress
Google's Snappy may be a good option, if you need speed more than compression and are just looking to save a moderate amount of space.
Alternatively, Ilia Muraviev put a small piece of compression code called BALZ in public domain some time ago. It is quite decent for many kinds of data.
Both of these support stream flushing and independent state variables to do multiple, concurrent streams across packets.
Google's new SPDY protocol uses zlib to compress individual messages, and maintains the zlib state for the life of the connection to achieve better compression. I don't think there's a standalone library that handles this behavior exactly, but there are several open-source implementations of SPDY that could show you how it's done.
The public domain Crush algorithm by Ilia Muraviev has similar performance and compression ratio as QuickLZ has, Crush being a bit more powerful. The algorithms are conceptually similar too, Crush containing a bit more tricks. The BALZ algorithm that was already mentioned earlier is also by Ilia Muraviev. See http://compressme.net/
may be you could use lzma compression SDK, it's written and placed in the public domain by Igor Pavlov.
And since it can compress stream files, and has memory to memory compression I think it's possible to compress packet stream (may be with some changes) but not sure.

What is the current state of text-only compression algorithms?

In honor of the Hutter Prize,
what are the top algorithms (and a quick description of each) for text compression?
Note: The intent of this question is to get a description of compression algorithms, not of compression programs.
The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:
The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.
Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.
There's always lzip.
All kidding aside:
Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.
If you want to use PAQ as a program, you can install the zpaq package on debian-based systems. Usage is (see also man zpaq)
zpaq c archivename.zpaq file1 file2 file3
Compression was to about 1/10th of a zip file's size. (1.9M vs 15M)

best compression algorithm with the following features

What is the best compression algorithm with the following features:
should take less time to decompress (can take reasonably more time compress)
should be able to compress sorted data (approx list of 3,000,000 strings/integers ...)
Please suggest along with metrics: compression ratio, algorithmic complexity for compression and decompression (if possible)?
Entire site devoted to compression benchmarking here
Well if you just want speed, then standard ZIP compression is just fine and it's most likely integrated into your language/framework already (ex: .NET has it, Java has it). Sometimes the most universal solution is the best, ZIP is a very mature format, any ZIP library and application will work with any other.
But if you want better compression, I'd suggest 7-Zip as the author is very smart, easy to get ahold of and encourages people to use the format.
Providing you with compression times is impossible, as it's directly related to your hardware. If you want a benchmark, you have to do it yourself.
You don't have to worry about decompression time. The time spent the higher compression level is mostly finding the longest matching pattern.
Decompression either
1) Writes the literal
2) for (backward position, length)=(m,n) pair,
goes back, in the output buffer, m bytes,
reads n bytes and
writes n bytes at the end of the buffer.
So the decompression time is independent of the compression level. And, with my experience with Universal Decompression Virtual Machine (RFC3320), I guess the same is true for any decompression algorithm.
This is an interessing question.
On such sorted data of strings and integers, I would expect that difference coding compression approaches would outperform any out-of-the-box text compression approach as LZ77 or LZ78 in terms of compression ratio. General purpose encoder do not use the special properties of the data.