Every compressed block in bzip2 format have a header, which begins with ".compressed_magic:48 = 0x314159265359 (BCD (pi))". So It can be rather easy to detect middle of big bzip2 in binary form.
Does the gzip format have the same magic constants in the middle of the big file?
or this question is like: does gzip have a gziprecover like bzip2 has bzip2recover
http://www.gzip.org/zlib/rfc-gzip.html I didnt reread it, but as far as I remember, then no block headers in that format.
Related
This question already has an answer here:
How to write gz file in C via zlib and compress2
(1 answer)
Closed 1 year ago.
I am able to compress certain content with "gzip" format and store it in a file. This content can be decompressed programatically within my code.
But I can't "gunzip" them using command line or other tools. e.g. "Hello World" (11 bytes) is compressed properly as 'test.txt.gz', but upon double clicking or from command line, it gives below error:
On the other hand, if I store the same content in the .txt file and gzip that .txt, then it's decompressed properly.
What is the correct way to store gzip content into a file?
Here is the C++14 source code used for compression/decompression (Referred from: Compress string with GZip using qCompress?).
Hex dump (xxd -p /home/milind/Desktopp/test.txt.gz):
1f8b080000000000000b789cf348cdc9c95728cf2fca49010018ab043d52
9ed68b0b000000
The problem is that you wrote a gzip header, and then followed that with a zlib-format stream! You need to follow the gzip header with raw deflate data, with no zlib-format wrapper.
Don't use compress2(), which can only make a zlib stream. Use the deflate functions, documented in zlib.h. You can request that deflate produce a gzip stream, so you don't need to write the header yourself, and you don't need to do your own CRC-32 calculation. That is all provided by zlib.
Is it possible for gzip to not compress a file? What happens in that case? Does the archive still contain a DEFLATE stream? I need to handle this special case in my program.
Yes, if the file is not compressible for example if it's already compressed, gzip will make a stored block which contains the source data with some header and trailers appended.
You can make your own non-compressed stream if that's needed. RFC 1951 Sections 3.2.3 and 3.2.4 describe how it's done.
A Deflate stored block is basically a single byte whose value is 0x00 or 0x01 (BTYPE=00 and BFINAL=0,1), followed by 4 bytes of LEN and NLEN followed by your actual data. LEN is the number of data bytes (2^16=64KB) and NLEN is one's complement. If you have more than 64KB, you have to do this multiple times. The last block should have the BFINAL bit set to 1.
Finally, you will have to prepend all of this with a gzip header RFC 1952 (assuming it is a GZIP stream, otherwise check RFC 1950 for ZLIB). The header contains filename, timestamp etc. It's couple hours of work on your part --most time will be spent understanding the spec.
Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.
I couldn't find any function like that on zlib manual section "gzip File Access Functions".
At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.
There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.
First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)
Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)
Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.
pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.
i have a program wherein it searches the reply from a curl request for specific strings. i sometimes get gzipped data. is there a way to find whether the reply is text or gzipped format?
header sometimes contain gziipped,deflate header, but its not consistent. is there a way to search the string and find if its gzipped?
You could try taking a look at the first two bytes of data. For gzipped data, they should be 0x1f, 0x8b.
Member header and trailer
ID1 (IDentification 1)
ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213),
to identify the file as being in gzip format.
You could look at the first bytes of the file. Perhaps they containt a magic number.
The gzip file format starts with some "magic bytes". You can check whether the body starts with these, and if it does, push back the bytes into the stream and start unzipping it.
You could pipe it through zcat, and if it fails, use the string as is. Sloppy I know, but it ought to be reliable; a plain text file would never contain valid gzipped data.
Standards-compliant HTTP responses will contain a Content-Encoding: or Transfer-Encoding: header specifying "gzip" for compressed responses, eliminating the need to guess by looking at magic number. Unfortunately, lots of sites get these headers wrong, though.
Is there any fast way to determine if some arbitrary image file is a png file or a jpeg file or none of them?
I'm pretty sure there is some way and these files probably have some sort of their own signatures and there should be some way to distinguish them.
If possible, could you also provide the names of the corresponding routines in libpng / libjpeg / boost::gil::io.
Look at the magic number at the beginning of the file. From the Wikipedia page:
JPEG image files begin with FF D8 and end with FF D9. JPEG/JFIF files
contain the ASCII code for "JFIF" (4A
46 49 46) as a null terminated string.
JPEG/Exif files contain the ASCII code
for "Exif" (45 78 69 66) also as a
null terminated string, followed by
more metadata about the file.
PNG image files begin with an 8-byte signature which identifies the file as
a PNG file and allows detection of
common file transfer problems: \211 P
N G \r \n \032 \n
Apart from Tim Yates' suggestion of reading the magic number "by hand", the Boost GIL documentation says:
png_read_image throws std::ios_base::failure if the file is not a valid PNG file.
jpeg_read_image throws std::ios_base::failure if the file is not a valid JPEG file.
Similarly for other Boost GIL routines. If you only want the type, you might want to try reading only the dimensions, rather than loading the entire file.
The question is essentially answered by the above replies, but I thought I'd add the following: If you ever need to determine file types beyond just "JPEG, PNG, other", there's always libmagic. This is what powers the Unix utility file, which is pretty magical indeed, on many of the modern operating systems.
Image file types like PNG and JPG have well-defined file formats that include signatures identifying them. All you have to do is read enough of the file to read that signature.
The signatures you need are well documented:
http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
http://en.wikipedia.org/wiki/JPEG#Syntax_and_structure