How to defeat gzip (or other lossless compression) - compression

By pigeonhole principle, every lossless compression algorithm can be "defeated", i.e. for some inputs it produces outputs which are longer than the input. Is it possible to explicitly construct a file which, when fed to e.g. gzip or other lossless compression program, will lead to (much) larger output? (or, betters still, a file which inflates ad infinitum upon subsequent compressions?)

Well, I'd assume eventually it'll max out since the bit patterns will repeat, but I just did:
touch file
gzip file -c > file.1
...
gzip file.9 -c > file.10
And got:
0 bytes: file
25 bytes: file.1
45 bytes: file.2
73 bytes: file.3
103 bytes: file.4
122 bytes: file.5
152 bytes: file.6
175 bytes: file.7
205 bytes: file.8
232 bytes: file.9
262 bytes: file.10
Here are 24,380 files graphically (this is really surprising to me, actually):
alt text http://research.engineering.wustl.edu/~schultzm/images/filesize.png
I was not expecting that kind of growth, I would just expect linear growth since it should just be encapsulating the existing data in a header with a dictionary of patterns. I intended to run through 1,000,000 files, but my system ran out of disk space way before that.
If you want to reproduce, here is the bash script to generate the files:
#!/bin/bash
touch file.0
for ((i=0; i < 20000; i++)); do
gzip file.$i -c > file.$(($i+1))
done
wc -c file.* | awk '{print $2 "\t" $1}' | sed 's/file.//' | sort -n > filesizes.txt
The resulting filesizes.txt is a tab-delimited, sorted file for your favorite graphing utility. (You'll have to manually remove the "total" field, or script it away.)

Random data, or data encrypted with a good cypher would probably be best.
But any good packer should only add constant overhead, once it decides that it can't compress the data. (#Frank). For a fixed overhead, an empty file, or a single character will give the greatest percentage overhead.
For packers that include the filename (e.g. rar, zip, tar), you could of course just make the filename really long :-)

Try to gzip the file that results from the following command:
echo a > file.txt
The compression of a 2 bytes file resulted of a 31 bytes gzipped file!

A text file with 1 byte in it (for example one character like 'A') is stored in 1 byte on the disk but winrar rars it to 94 bytes and zips to 141 bytes.
I know it's a sort of cheat answer but it works. I think it's going to be the biggest % difference between original size and 'compressed' size you are going to see.
Take a look at the formula for zipping, they are reasonably simple, and to make 'compressed' file larger than the original, the most basic way is to avoid any repeating data.

All these compression algorithms are looking for redundant data. If you file has no or very less redundancy in it (like a sequence of abac…az, bcbd…bz, cdce…cz, etc.) it is very likely that the “deflated” output is rather an inflation.

Related

Compression ratio for split tarballs

I have a large tarball that was split into several files. The tarball is 100GB split into 12GB files.
tar czf - -T myDirList.txt | split --bytes=12GB - my.tar.gz.
Trying cat my.tar.gz.* | gzip -l returns
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout
Trying gzip -l my.tar.gz.aa returns
compressed uncompressed ratio uncompressed_name
12000000000 3488460670 -244.0% my.tar
concatenating the files cat my.tar.gz.* > my.tar.gz returns and even worse answer of
compressed uncompressed ratio uncompressed_name
103614559077 2375907328 -4261.1% my.tar
What is going on here? How can i get the real compression ratio for these split tarballs?
The gzip format stores the uncompressed size as the last four bytes of the stream. gzip -l uses those four bytes and the length of the gzip file to compute a compression ratio. In doing so, gzip seeks to the end of the input to get the last four bytes. Note that four bytes can only represent up to 4 GB - 1.
In your first case, you can't seek on piped input, so gzip gives up and reports -1.
In your second case, gzip is picking up four bytes of compressed data, effectively four random bytes, as the uncompressed size, which is necessarily less than 12,000,000,000, and so a negative compression ratio (expansion) is reported.
In your third case, gzip is getting the actual uncompressed length, but that length modulo 232, which is necessarily much less than 103 GB, reporting an even more significant negative compression ratio.
The second case is hopeless, but the compression ratio for the first and third cases can be determined using pigz, a parallel implementation of gzip that uses multiple cores for compression. pigz -lt decompresses the input without storing it, in order to determine the number of uncompressed bytes directly. (pigz -l is just like gzip -l, and would not work either. You need the t to test, i.e. decompress without saving.)

How to read output of hexdump of a file?

I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.

Uncompressed file size using zlib's gzip file access function

Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.
I couldn't find any function like that on zlib manual section "gzip File Access Functions".
At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.
There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.
First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)
Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)
Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.
pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.

Remove duplicates from two large text files using unordered_map

I am new to a lot of these C++ libraries, so please forgive me if my questions comes across as naive.
I have two large text files, about 160 MB each (about 700000 lines each). I need to remove from file2 all of the duplicate lines that appear in file1. To achieve this, I decided to use unordered_map with a 32 character string as my key. The 32 character string is the first 32 chars of each line (this is enough to uniquely identify the line).
Anyway, so I basically just go through the first file and push the 32 char substring of each line into the unordered_map. Then I go through the second file and check whether the line in file2 exists in my unordered_map. If it doesn't exist, the I write the full line to a new text file.
This works fine for the smaller files.. (40 MB each), but for this 160 MB files.. it takes very long to insert into the hashtable (before I even start looking at file2). At around 260,000 inserts.. it seems to have halted or is going very slow. Is it possible that I have reached my memory limitations? If so, can anybody explain how to calculate this? If not, is there something else that I could be doing to make it faster? Maybe choosing a custom hash function, or specifying some parameters that would help optimize it?
My key object pair into the hash table is (string, int), where the string is always 32 chars long, and int is a count I use to handle duplicates.
I am running a 64 bit Windows 7 OS w/ 12 GB RAM.
Any help would be greatly appreciated.. thanks guys!!
You don't need a map because you don't have any associative data. An unordered set will do the job. Also, I'd go with some memory efficient hash set implementation like Google's sparse_hash_set. It is very memory efficient and is able to store contents on disk.
Aside from that, you can work on smaller chunks of data. For example, split your files into 10 blocks, remove duplicates from each, then combine them until you reach the a single block with no duplicates. You get the idea.
I would not write a C++ program to do this, but use some existing utilities.
In Linux, Unix and Cygwin, perform the following:
cat the two files into 1 large file:
# cat file1 file2 > file3
Use sort -u to extract the unique lines:
# sort -u file3 > file4
Prefer to use operating system utilities rather than (re)writing your own.

gzip compression excepections?

Is there any way to project what kind of compression result you'd get using gzip on an arbitrary string? What factors contribute to the worst and best cases? I'm not sure how gzip works, but for example a string like:
"fffffff"
might compress well compared to something like:
"abcdefg"
where do I start?
Thanks
gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.
By means of demonstration:
[chris#polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris#polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris#polaris ~]$ gzip random
[chris#polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz
[chris#polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris#polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris#polaris ~]$ gzip ordered
[chris#polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz
My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.
The algorithm used by gzip is called DEFLATE.
It combines two popular compression techniques: Duplicate string elimination and bit reduction. Both are explained in the article.
Basically as a rule of thumb you could say that compression gets best when some characters find much more often use than most others and/or when characters are often repeated consecutively. Compression gets worst when characters are uniformely distributed in the input and change every time.
There are also measurements for this, like the entropy of the data.