Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.
I couldn't find any function like that on zlib manual section "gzip File Access Functions".
At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.
There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.
First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)
Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)
Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.
pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.
Related
I have a large tarball that was split into several files. The tarball is 100GB split into 12GB files.
tar czf - -T myDirList.txt | split --bytes=12GB - my.tar.gz.
Trying cat my.tar.gz.* | gzip -l returns
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout
Trying gzip -l my.tar.gz.aa returns
compressed uncompressed ratio uncompressed_name
12000000000 3488460670 -244.0% my.tar
concatenating the files cat my.tar.gz.* > my.tar.gz returns and even worse answer of
compressed uncompressed ratio uncompressed_name
103614559077 2375907328 -4261.1% my.tar
What is going on here? How can i get the real compression ratio for these split tarballs?
The gzip format stores the uncompressed size as the last four bytes of the stream. gzip -l uses those four bytes and the length of the gzip file to compute a compression ratio. In doing so, gzip seeks to the end of the input to get the last four bytes. Note that four bytes can only represent up to 4 GB - 1.
In your first case, you can't seek on piped input, so gzip gives up and reports -1.
In your second case, gzip is picking up four bytes of compressed data, effectively four random bytes, as the uncompressed size, which is necessarily less than 12,000,000,000, and so a negative compression ratio (expansion) is reported.
In your third case, gzip is getting the actual uncompressed length, but that length modulo 232, which is necessarily much less than 103 GB, reporting an even more significant negative compression ratio.
The second case is hopeless, but the compression ratio for the first and third cases can be determined using pigz, a parallel implementation of gzip that uses multiple cores for compression. pigz -lt decompresses the input without storing it, in order to determine the number of uncompressed bytes directly. (pigz -l is just like gzip -l, and would not work either. You need the t to test, i.e. decompress without saving.)
Is it possible for gzip to not compress a file? What happens in that case? Does the archive still contain a DEFLATE stream? I need to handle this special case in my program.
Yes, if the file is not compressible for example if it's already compressed, gzip will make a stored block which contains the source data with some header and trailers appended.
You can make your own non-compressed stream if that's needed. RFC 1951 Sections 3.2.3 and 3.2.4 describe how it's done.
A Deflate stored block is basically a single byte whose value is 0x00 or 0x01 (BTYPE=00 and BFINAL=0,1), followed by 4 bytes of LEN and NLEN followed by your actual data. LEN is the number of data bytes (2^16=64KB) and NLEN is one's complement. If you have more than 64KB, you have to do this multiple times. The last block should have the BFINAL bit set to 1.
Finally, you will have to prepend all of this with a gzip header RFC 1952 (assuming it is a GZIP stream, otherwise check RFC 1950 for ZLIB). The header contains filename, timestamp etc. It's couple hours of work on your part --most time will be spent understanding the spec.
I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.
Every compressed block in bzip2 format have a header, which begins with ".compressed_magic:48 = 0x314159265359 (BCD (pi))". So It can be rather easy to detect middle of big bzip2 in binary form.
Does the gzip format have the same magic constants in the middle of the big file?
or this question is like: does gzip have a gziprecover like bzip2 has bzip2recover
http://www.gzip.org/zlib/rfc-gzip.html I didnt reread it, but as far as I remember, then no block headers in that format.
By pigeonhole principle, every lossless compression algorithm can be "defeated", i.e. for some inputs it produces outputs which are longer than the input. Is it possible to explicitly construct a file which, when fed to e.g. gzip or other lossless compression program, will lead to (much) larger output? (or, betters still, a file which inflates ad infinitum upon subsequent compressions?)
Well, I'd assume eventually it'll max out since the bit patterns will repeat, but I just did:
touch file
gzip file -c > file.1
...
gzip file.9 -c > file.10
And got:
0 bytes: file
25 bytes: file.1
45 bytes: file.2
73 bytes: file.3
103 bytes: file.4
122 bytes: file.5
152 bytes: file.6
175 bytes: file.7
205 bytes: file.8
232 bytes: file.9
262 bytes: file.10
Here are 24,380 files graphically (this is really surprising to me, actually):
alt text http://research.engineering.wustl.edu/~schultzm/images/filesize.png
I was not expecting that kind of growth, I would just expect linear growth since it should just be encapsulating the existing data in a header with a dictionary of patterns. I intended to run through 1,000,000 files, but my system ran out of disk space way before that.
If you want to reproduce, here is the bash script to generate the files:
#!/bin/bash
touch file.0
for ((i=0; i < 20000; i++)); do
gzip file.$i -c > file.$(($i+1))
done
wc -c file.* | awk '{print $2 "\t" $1}' | sed 's/file.//' | sort -n > filesizes.txt
The resulting filesizes.txt is a tab-delimited, sorted file for your favorite graphing utility. (You'll have to manually remove the "total" field, or script it away.)
Random data, or data encrypted with a good cypher would probably be best.
But any good packer should only add constant overhead, once it decides that it can't compress the data. (#Frank). For a fixed overhead, an empty file, or a single character will give the greatest percentage overhead.
For packers that include the filename (e.g. rar, zip, tar), you could of course just make the filename really long :-)
Try to gzip the file that results from the following command:
echo a > file.txt
The compression of a 2 bytes file resulted of a 31 bytes gzipped file!
A text file with 1 byte in it (for example one character like 'A') is stored in 1 byte on the disk but winrar rars it to 94 bytes and zips to 141 bytes.
I know it's a sort of cheat answer but it works. I think it's going to be the biggest % difference between original size and 'compressed' size you are going to see.
Take a look at the formula for zipping, they are reasonably simple, and to make 'compressed' file larger than the original, the most basic way is to avoid any repeating data.
All these compression algorithms are looking for redundant data. If you file has no or very less redundancy in it (like a sequence of abac…az, bcbd…bz, cdce…cz, etc.) it is very likely that the “deflated” output is rather an inflation.