finding whether the body contains gzipped data

finding whether the body contains gzipped data - c++

i have a program wherein it searches the reply from a curl request for specific strings. i sometimes get gzipped data. is there a way to find whether the reply is text or gzipped format?
header sometimes contain gziipped,deflate header, but its not consistent. is there a way to search the string and find if its gzipped?

You could try taking a look at the first two bytes of data. For gzipped data, they should be 0x1f, 0x8b.
Member header and trailer
ID1 (IDentification 1)
ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213),
to identify the file as being in gzip format.

You could look at the first bytes of the file. Perhaps they containt a magic number.

The gzip file format starts with some "magic bytes". You can check whether the body starts with these, and if it does, push back the bytes into the stream and start unzipping it.

You could pipe it through zcat, and if it fails, use the string as is. Sloppy I know, but it ought to be reliable; a plain text file would never contain valid gzipped data.

Standards-compliant HTTP responses will contain a Content-Encoding: or Transfer-Encoding: header specifying "gzip" for compressed responses, eliminating the need to guess by looking at magic number. Unfortunately, lots of sites get these headers wrong, though.

Related

can gzip only store a file like zip and not compress it?

Is it possible for gzip to not compress a file? What happens in that case? Does the archive still contain a DEFLATE stream? I need to handle this special case in my program.

Yes, if the file is not compressible for example if it's already compressed, gzip will make a stored block which contains the source data with some header and trailers appended.
You can make your own non-compressed stream if that's needed. RFC 1951 Sections 3.2.3 and 3.2.4 describe how it's done.
A Deflate stored block is basically a single byte whose value is 0x00 or 0x01 (BTYPE=00 and BFINAL=0,1), followed by 4 bytes of LEN and NLEN followed by your actual data. LEN is the number of data bytes (2^16=64KB) and NLEN is one's complement. If you have more than 64KB, you have to do this multiple times. The last block should have the BFINAL bit set to 1.
Finally, you will have to prepend all of this with a gzip header RFC 1952 (assuming it is a GZIP stream, otherwise check RFC 1950 for ZLIB). The header contains filename, timestamp etc. It's couple hours of work on your part --most time will be spent understanding the spec.

decipher a file format called .EWB

I have a file which I know that contains a bunch of compressed files inside with some kind of a header.
Can anyone tell me how to unpack it?
file format is .EWB, which stands for EasyWorshipBible.
I know its possible as I've seen it being done. But they didn't tell me how.
I tried using hex editors and winRAR. But non of them seem to get the files correct.

In an example I found, each entry begins with hex 51 4b 03 04, followed by six more bytes of information, followed by a zlib stream. When the zlib streams are decompressed, they have the format "1:1 text line ...", blank line, "1:2 text line ...", etc. However the text does not seem to match the extraction I found along with the example, so I suspect that the text is encoded or encrypted somehow.
That should be enough to get you started.

Get files (other than text) from .zip with libzip

I am learning C++ and decided to train me by making a little program that extract files from zip, like text files, images, or even other zip files (but I don't want to extract them directly, one thing a time) with the libzip library.
So I made my program, but now I have a problem.
It extracts well text files, but not files like images or zip. It detects them, gives me exact names and sizes, but once extracted, they are just a few bytes. (but they are located where they should).
Here is my code: http://pastie.org/6221955
So if someone could help me to extract files that aren't texts from zip, it would be great! Thank you!

You're reading and writing binary data as a textual string. The problem is that strings use the presence of a NULL character (0-byte) to indicate end-of-string. Binary data can (and definitely does) contain zeros all over the place, not just at the end.
You need to use ofstream's .write (buffer, <size in bytes>) to write to the disk; by manually specifying the size in bytes, you force it read that many bytes instead of stopping at the first instance of a NULL character.

The issue is with the << operator. You output a character array / string. Strings in C are null terminated. Thus the first binary 0 will terminate your output.

HTTP Response Status-Line maximum size

Quick question - is there a maximum size for the Status-Line of a HTTP Response?
In the RFC I could not find this information, just something like this:
Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
According to this i could assume:
HTTP-Version is usually 8 Bytes ( e.g. HTTP/1.1 )
Status-Code is 3 Bytes
2 Spaces + CRLF is 4 Bytes
Reason-Phrase -> The longest according to the RFC is Requested range not satisfiable so 31 Bytes
This would be a sum of 46 Bytes.
Is this assumption correct or did I miss anything?
UPDATE:
Due to the answer below, I just want to specify my problem a bit:
I am parsing some kind of Log file with TCP messages from a server. Now there is some random Data I don't care for and some HTTP Messages which I want to read. Now all data I get I parse for a \r\n to find the Status Line. Since I need to make assumption that my header is split into several TCP packages I just buffer all data and parse it.
If there is no maximum size for the header status-line, I need to buffer all data until the next \r\n occurs. In the worst case this means I save like kilobytes over kilobytes of random data, since it could ( but will most likely will not ) be part of the Header Status Line.
Or would it , in this case, be rather appropriate to parse for the HTTP Version String instead of the CRLF ?

RFC 2616, 6.1.1:
The reason phrases listed here are only recommendations -- they MAY be
replaced by local equivalents without affecting the protocol.
Aside from this, the HTTP protocol is "allowed" to add more status codes (in a new RFC) without changing the HTTP version to 1.2, provided that the new codes don't introduce additional requirements on HTTP clients. Clients are supposed to treat an unknown status code as if it were x00 (where x is the first digit of the code they get, indicating the category of response), except that they shouldn't cache the response.
So the only limit is the max length of an HTTP header line or of the response headers in total. As far as I can see, the RFC doesn't define any limit, although specific servers impose their own.
What you can be sure of is that the user-agent may ignore the Reason Phrase entirely. So if it's big, you can read it in small pieces and throw them away one at a time until you reach CRLF. If you want to display a human-readable message, mostly you can use the recommended Reason Phrase for the status code that the server provides, regardless of what Reason Phrase the server sends.

I don't think there is any limit on the length of the ReasonPHrase. The W3C doc states it is a "short message" but that is not canonical.
I would not assume Version is 8 characters. Perhaps a version in the future could have 3 digits, ie: HTTP/10.1. The syntax specifies Version is delimited by a SPACE, so I would parse it by stopping at the first SPACE.
https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html
The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason- Phrase.

What is a good way to test a file to see if its a zip file?

I am looking as a new file format specification and the specification says the file can be either xml based or a zip file containing an xml file and other files.
The file extension is the same in both cases. What ways could I test the file to decide if it needs decompressing or just reading?

The zip file format is defined by PKWARE. You can find their file specification here.
Near the top you will find the header specification:
A. Local file header:
local file header signature 4 bytes (0x04034b50)
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
file name (variable size)
extra field (variable size)
From this you can see that the first 4 bytes of the header should be the file signature which should be the hex value 0x04034b50. Byte order in the file is the other way round - PKWARE specify that "All values are stored in little-endian byte order unless otherwise specified.", so if you use a hex editor to view the file you will see 50 4b 03 04 as the first 4 bytes.
You can use this to check if your file is a zip file. If you open the file in notepad, you will notice that the first two bytes (50 and 4b) are the ASCII characters PK.

You could look at the magic number of the file. The ones for ZIP archives are listed on the ZIP format wikipedia page: PK\003\004 or PK\005\006.

Check the first few bytes of the file for the magic number. Zip files begin with PK (50 4B). As XML files cannot start with these characters and still be valid, you can be fairly sure as to the file type.

You can use file to see if it's a text file(xml) or an executable(zip).
Scroll down to see an example.

Not a good solution though, but just thinking out load... how about:
try
{
LoadXmlFile(theFile);//Exception if not an xml file
}
catch(Exception ex)
{
LoadZipFile(theFile)
}

You could check the file to see if it contains a valid XML header. If it doesn't, try decompressing it.
See Click here for XML specification.

File magic numbers
To clarify, it starts with 50 4b 03 04.
See http://www.pkware.com/documents/casestudies/APPNOTE.TXT (From Simon P Stevens)

You could try unzipping it - an XML file is exceedingly unlikely to be a valid zip file, or could check the magic numbers, as others have said.

it depends on what you are using but the zip library might have a function that test wether a file or not is a zip file
something like is_zip, test_file_zip or whatever ...
or create you're own function by using the magic number given above.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

finding whether the body contains gzipped data - c++

You could try taking a look at the first two bytes of data. For gzipped data, they should be 0x1f, 0x8b. Member header and trailer ID1 (IDentification 1) ID2 (IDentification 2) These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213), to identify the file as being in gzip format.

You could look at the first bytes of the file. Perhaps they containt a magic number.

The gzip file format starts with some "magic bytes". You can check whether the body starts with these, and if it does, push back the bytes into the stream and start unzipping it.

You could pipe it through zcat, and if it fails, use the string as is. Sloppy I know, but it ought to be reliable; a plain text file would never contain valid gzipped data.

Standards-compliant HTTP responses will contain a Content-Encoding: or Transfer-Encoding: header specifying "gzip" for compressed responses, eliminating the need to guess by looking at magic number. Unfortunately, lots of sites get these headers wrong, though.

Related

can gzip only store a file like zip and not compress it?

decipher a file format called .EWB

Get files (other than text) from .zip with libzip

HTTP Response Status-Line maximum size

What is a good way to test a file to see if its a zip file?

Categories

Resources