Is there any way to project what kind of compression result you'd get using gzip on an arbitrary string? What factors contribute to the worst and best cases? I'm not sure how gzip works, but for example a string like:
"fffffff"
might compress well compared to something like:
"abcdefg"
where do I start?
Thanks
gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.
By means of demonstration:
[chris#polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris#polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris#polaris ~]$ gzip random
[chris#polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz
[chris#polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris#polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris#polaris ~]$ gzip ordered
[chris#polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz
My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.
The algorithm used by gzip is called DEFLATE.
It combines two popular compression techniques: Duplicate string elimination and bit reduction. Both are explained in the article.
Basically as a rule of thumb you could say that compression gets best when some characters find much more often use than most others and/or when characters are often repeated consecutively. Compression gets worst when characters are uniformely distributed in the input and change every time.
There are also measurements for this, like the entropy of the data.
Related
The degree of compression achieved by any compression algorithm is obviously dependent on the data provided. However, there is also clearly some overhead added purely by virtue of having compressed data.
I'm working on a process where I am compressing data that can be of various types but where I know much of the data will be very small though it will also frequently be large enough to benefit from some level of compression. While I can probably just experimentally determine some minimum before compression is applied that will work well enough I am curious if there's a clear point where this is definitely not worth it.
Running some tests using zip, I compressed a series of files with 10, 100, and 1000 bytes respectively of random data and the alphabet repeated. For example here is the content of the 100 byte alphabet file:
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqr
I was fairly surprised to find that the zipped version of the file was 219 bytes, despite the level of redundancy. For comparison the 100 byte file with random data became 272 bytes.
However, the 1000 byte alphabet file compressed all the way down to 227 bytes, while the random file increased to 1174.
Is there a clear minimum file size where even the most redundant files will not benefit from this type of compression?
Something between 250 and 500 bytes would be a decent threshold depending on the level of redundancy and assuming the time spent compressing the data is negligible.
I got to this by realizing that fully redundant data (every byte the same) would likely result in the greatest level of compression.
Re-running the same tests with data read from /dev/zero, I found that the compressed file length was not really that variable:
Uncompressed | Compressed | Percent Size
-------------+------------+-------------
100 bytes | 178 bytes | 178%
200 bytes | 178 bytes | 89%
300 bytes | 179 bytes | 60%
400 bytes | 180 bytes | 45%
500 bytes | 180 bytes | 36%
...
1000 bytes | 185 bytes | 19%
This makes a decent case for the answer being technically 178 bytes (I tested this case and got 178 bytes).
However, I think the alphabet test is probably a bit closer to a practical best case of redundancy (without knowing much about how DEFLATE looks for redundancy).
Using various files in the same format as in the question, I found the following:
Uncompressed | Compressed | Percent Size
-------------+------------+-------------
100 bytes | 212 bytes | 212%
200 bytes | 212 bytes | 106%
300 bytes | 214 bytes | 71%
400 bytes | 214 bytes | 54%
500 bytes | 214 bytes | 43%
...
1000 bytes | 221 bytes | 22%
And unsurprisingly 212 seems to be a fixed point for this type of file.
Lastly, I decided to try a more direct approach with lorem ipsum text and eventually found that 414 bytes was the fixed point there.
Based on all of this, I would assume something between 250 and 500 would be a reasonable lower limit for skipping compression for general text that may or may not have some level of redundancy on average. One may even want to go higher if benchmarking reveals the time the compression takes isn't worth the minor benefit in space.
I have a large tarball that was split into several files. The tarball is 100GB split into 12GB files.
tar czf - -T myDirList.txt | split --bytes=12GB - my.tar.gz.
Trying cat my.tar.gz.* | gzip -l returns
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout
Trying gzip -l my.tar.gz.aa returns
compressed uncompressed ratio uncompressed_name
12000000000 3488460670 -244.0% my.tar
concatenating the files cat my.tar.gz.* > my.tar.gz returns and even worse answer of
compressed uncompressed ratio uncompressed_name
103614559077 2375907328 -4261.1% my.tar
What is going on here? How can i get the real compression ratio for these split tarballs?
The gzip format stores the uncompressed size as the last four bytes of the stream. gzip -l uses those four bytes and the length of the gzip file to compute a compression ratio. In doing so, gzip seeks to the end of the input to get the last four bytes. Note that four bytes can only represent up to 4 GB - 1.
In your first case, you can't seek on piped input, so gzip gives up and reports -1.
In your second case, gzip is picking up four bytes of compressed data, effectively four random bytes, as the uncompressed size, which is necessarily less than 12,000,000,000, and so a negative compression ratio (expansion) is reported.
In your third case, gzip is getting the actual uncompressed length, but that length modulo 232, which is necessarily much less than 103 GB, reporting an even more significant negative compression ratio.
The second case is hopeless, but the compression ratio for the first and third cases can be determined using pigz, a parallel implementation of gzip that uses multiple cores for compression. pigz -lt decompresses the input without storing it, in order to determine the number of uncompressed bytes directly. (pigz -l is just like gzip -l, and would not work either. You need the t to test, i.e. decompress without saving.)
I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:
Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size
So what can be observed is that:
xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64
How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).
I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.
Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
"XXXXYYYYXXXXYYYY"
A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
Now let's encode our string in Base64. Here's what we get:
"WFhYWFlZWVlYWFhYWVlZWQ=="
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Input bytes : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
0 1 2 3 4 5 6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
6-bit repacking : 00010110 00000101 00100001 00011000
As decimal : 22 5 33 24
Base64 characters: 'W' 'F' 'h' 'Y'
Output bytes : 0x57 0x46 0x68 0x59
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes : 0x57 0x46 0x68 0x59
As binary : 01010111 01000110 01101000 01011001
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.
Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).
Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.
Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.
As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.
It's not Base64. its them memory requirements of libraries "The presets 7-9 are like the preset 6 but use bigger dictionaries and have higher compressor and decompressor memory requirements."https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/LZMA2Options.html
Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.
I couldn't find any function like that on zlib manual section "gzip File Access Functions".
At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.
There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.
First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)
Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)
Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.
pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.
By pigeonhole principle, every lossless compression algorithm can be "defeated", i.e. for some inputs it produces outputs which are longer than the input. Is it possible to explicitly construct a file which, when fed to e.g. gzip or other lossless compression program, will lead to (much) larger output? (or, betters still, a file which inflates ad infinitum upon subsequent compressions?)
Well, I'd assume eventually it'll max out since the bit patterns will repeat, but I just did:
touch file
gzip file -c > file.1
...
gzip file.9 -c > file.10
And got:
0 bytes: file
25 bytes: file.1
45 bytes: file.2
73 bytes: file.3
103 bytes: file.4
122 bytes: file.5
152 bytes: file.6
175 bytes: file.7
205 bytes: file.8
232 bytes: file.9
262 bytes: file.10
Here are 24,380 files graphically (this is really surprising to me, actually):
alt text http://research.engineering.wustl.edu/~schultzm/images/filesize.png
I was not expecting that kind of growth, I would just expect linear growth since it should just be encapsulating the existing data in a header with a dictionary of patterns. I intended to run through 1,000,000 files, but my system ran out of disk space way before that.
If you want to reproduce, here is the bash script to generate the files:
#!/bin/bash
touch file.0
for ((i=0; i < 20000; i++)); do
gzip file.$i -c > file.$(($i+1))
done
wc -c file.* | awk '{print $2 "\t" $1}' | sed 's/file.//' | sort -n > filesizes.txt
The resulting filesizes.txt is a tab-delimited, sorted file for your favorite graphing utility. (You'll have to manually remove the "total" field, or script it away.)
Random data, or data encrypted with a good cypher would probably be best.
But any good packer should only add constant overhead, once it decides that it can't compress the data. (#Frank). For a fixed overhead, an empty file, or a single character will give the greatest percentage overhead.
For packers that include the filename (e.g. rar, zip, tar), you could of course just make the filename really long :-)
Try to gzip the file that results from the following command:
echo a > file.txt
The compression of a 2 bytes file resulted of a 31 bytes gzipped file!
A text file with 1 byte in it (for example one character like 'A') is stored in 1 byte on the disk but winrar rars it to 94 bytes and zips to 141 bytes.
I know it's a sort of cheat answer but it works. I think it's going to be the biggest % difference between original size and 'compressed' size you are going to see.
Take a look at the formula for zipping, they are reasonably simple, and to make 'compressed' file larger than the original, the most basic way is to avoid any repeating data.
All these compression algorithms are looking for redundant data. If you file has no or very less redundancy in it (like a sequence of abac…az, bcbd…bz, cdce…cz, etc.) it is very likely that the “deflated” output is rather an inflation.