How are gzip content stored in a file? [duplicate] - c++

This question already has an answer here:
How to write gz file in C via zlib and compress2
(1 answer)
Closed 1 year ago.
I am able to compress certain content with "gzip" format and store it in a file. This content can be decompressed programatically within my code.
But I can't "gunzip" them using command line or other tools. e.g. "Hello World" (11 bytes) is compressed properly as 'test.txt.gz', but upon double clicking or from command line, it gives below error:
On the other hand, if I store the same content in the .txt file and gzip that .txt, then it's decompressed properly.
What is the correct way to store gzip content into a file?
Here is the C++14 source code used for compression/decompression (Referred from: Compress string with GZip using qCompress?).
Hex dump (xxd -p /home/milind/Desktopp/test.txt.gz):
1f8b080000000000000b789cf348cdc9c95728cf2fca49010018ab043d52
9ed68b0b000000

The problem is that you wrote a gzip header, and then followed that with a zlib-format stream! You need to follow the gzip header with raw deflate data, with no zlib-format wrapper.
Don't use compress2(), which can only make a zlib stream. Use the deflate functions, documented in zlib.h. You can request that deflate produce a gzip stream, so you don't need to write the header yourself, and you don't need to do your own CRC-32 calculation. That is all provided by zlib.

Related

can gzip only store a file like zip and not compress it?

Is it possible for gzip to not compress a file? What happens in that case? Does the archive still contain a DEFLATE stream? I need to handle this special case in my program.
Yes, if the file is not compressible for example if it's already compressed, gzip will make a stored block which contains the source data with some header and trailers appended.
You can make your own non-compressed stream if that's needed. RFC 1951 Sections 3.2.3 and 3.2.4 describe how it's done.
A Deflate stored block is basically a single byte whose value is 0x00 or 0x01 (BTYPE=00 and BFINAL=0,1), followed by 4 bytes of LEN and NLEN followed by your actual data. LEN is the number of data bytes (2^16=64KB) and NLEN is one's complement. If you have more than 64KB, you have to do this multiple times. The last block should have the BFINAL bit set to 1.
Finally, you will have to prepend all of this with a gzip header RFC 1952 (assuming it is a GZIP stream, otherwise check RFC 1950 for ZLIB). The header contains filename, timestamp etc. It's couple hours of work on your part --most time will be spent understanding the spec.

How to read large files in C++ with mixed text and binary

I need to read a large file of either text, binary, or combination, such as a JPEG file, encrypt it, and write it to a file. At some later time I will need to read the encrypted data, and decrypt it.
The end goal is to verify that the decrypted data matches the original data.
My problem is that with large files greater than 1Meg, I don't want to read and write character by character. I am targeting this code for a phone and I/O will cause too long a delay for the user.
With a pure text file, using fread() and fwrite() convert the data to binary, and the result is different than the original. With a jpeg image, it appears that there is some textual content mixed in with the binary data.
Is there a way to efficiently read in an arbitrary type of file and write it back in the original format?
Or is character by character the only option?
Or am I still out of luck?
After debugging it turned out that the decrypt function had the plain text and cipher text buffers assigned backwards. After swapping the buffer assignments, the decrypted results matched the original data. I originally thought that maybe reading the text as binary and then rewriting as binary would not appear as text, but I was wrong.
Reading the entire file as binary works just fine.

Uncompressed file size using zlib's gzip file access function

Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.
I couldn't find any function like that on zlib manual section "gzip File Access Functions".
At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.
There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.
First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)
Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)
Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.
pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.

finding whether the body contains gzipped data

i have a program wherein it searches the reply from a curl request for specific strings. i sometimes get gzipped data. is there a way to find whether the reply is text or gzipped format?
header sometimes contain gziipped,deflate header, but its not consistent. is there a way to search the string and find if its gzipped?
You could try taking a look at the first two bytes of data. For gzipped data, they should be 0x1f, 0x8b.
Member header and trailer
ID1 (IDentification 1)
ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213),
to identify the file as being in gzip format.
You could look at the first bytes of the file. Perhaps they containt a magic number.
The gzip file format starts with some "magic bytes". You can check whether the body starts with these, and if it does, push back the bytes into the stream and start unzipping it.
You could pipe it through zcat, and if it fails, use the string as is. Sloppy I know, but it ought to be reliable; a plain text file would never contain valid gzipped data.
Standards-compliant HTTP responses will contain a Content-Encoding: or Transfer-Encoding: header specifying "gzip" for compressed responses, eliminating the need to guess by looking at magic number. Unfortunately, lots of sites get these headers wrong, though.

gzip file - does it have a block header like bzip2 does?

Every compressed block in bzip2 format have a header, which begins with ".compressed_magic:48 = 0x314159265359 (BCD (pi))". So It can be rather easy to detect middle of big bzip2 in binary form.
Does the gzip format have the same magic constants in the middle of the big file?
or this question is like: does gzip have a gziprecover like bzip2 has bzip2recover
http://www.gzip.org/zlib/rfc-gzip.html I didnt reread it, but as far as I remember, then no block headers in that format.