Inflate (decompress) PNG file using ZLIB library C++ - c++

I'm trying to use ZLIB to inflate (decompress) .FLA files, thus extracting all its contents. Since FLA files use a ZIP format, I am able to read the local file headers(https://en.wikipedia.org/wiki/Zip_(file_format)) from it, and use the info inside to decompress the files.
It seems to work fine for regular text-based files, but when it comes to binary (I've only tried PNG and DAT files), it fails to decompress them, returning "Z_DATA_ERROR".
I'm unable to use the minilib library inside ZLIB, since the Central directory file header inside FLA files differs slightly from normal zip files (which is why im reading the local files header manually).
Here's the code I use to decompress a chunk of data:
void DecompressBuffer(char* compressedBuffer, unsigned int compressedSize, std::string& out_decompressedBuffer)
{
// init the decompression stream
z_stream stream;
stream.zalloc = Z_NULL;
stream.zfree = Z_NULL;
stream.opaque = Z_NULL;
stream.avail_in = 0;
stream.next_in = Z_NULL;
if (int err = inflateInit2(&stream, -MAX_WBITS) != Z_OK)
{
printf("Error: inflateInit %d\n", err);
return;
}
// Set the starting point and total data size to be read
stream.avail_in = compressedSize;
stream.next_in = (Bytef*)&compressedBuffer[0];
std::stringstream strStream;
// Start decompressing
while (stream.avail_in != 0)
{
unsigned char* readBuffer = (unsigned char*)malloc(MAX_READ_BUFFER_SIZE + 1);
readBuffer[MAX_READ_BUFFER_SIZE] = '\0';
stream.next_out = readBuffer;
stream.avail_out = MAX_READ_BUFFER_SIZE;
int ret = inflate(&stream, Z_NO_FLUSH);
if (ret == Z_STREAM_END)
{
// only store the data we have left in the stream
size_t length = MAX_READ_BUFFER_SIZE - stream.avail_out;
std::string str((char*)readBuffer);
str = str.substr(0, length);
strStream << str;
break;
}
else
{
if (ret != Z_OK)
{
printf("Error: inflate %d\n", ret); // This is what it reaches when trying to inflate a PNG or DAT file
break;
}
// store the readbuffer in the stream
strStream << readBuffer;
}
free(readBuffer);
}
out_decompressedBuffer = strStream.str();
inflateEnd(&stream);
}
I have tried zipping a single PNG file and extracing that. This doesn't return any errors from Inflate(), but doesn't correctly inflate the PNG either, and the only corresponding values seem to be the first few.
The original file (left) and the uncompressed via code file (right):
Hex editor versions of both PNGs

You do things that rely on the data being text and strings, not binary data.
For example
std::string str((char*)readBuffer);
If the contents of readBuffer is raw binary data then it might contain one or more zero bytes in the middle of it. When you use it as a C-style string then the first zero will act as the string terminator character.
I suggest you try to generalize it, and remove the dependency of strings. Instead I suggest you use e.g. std::vector<int8_t>.
Meanwhile, during your transition to a more generalized way, you can do e.g.
std::string str(readBuffer, length);
This will create a string of the specified length, and the contents will not be checked for terminators.

Related

zlib error -3 while decompressing archive: Incorrect data check

I am writing a C++ library that also decompresses zlib files. For all of the files, the last call to gzread() (or at least one of the last calls) gives error -3 (Z_DATA_ERROR) with message "incorrect data check". As I have not created the files myself I am not entirely sure what is wrong.
I found this answer and if I do
gzip -dc < myfile.gz > myfile.decomp
gzip: invalid compressed data--crc error
on the command line the contents of myfile.decomp seems to be correct. There is still the crc error printed in this case, however, which may or may not be the same problem. My code, pasted below, should be straightforward, but I am not sure how to get the same behavior in code as on the command line above.
How can I achieve the same behavior in code as on the command line?
std::vector<char> decompress(const std::string &path)
{
gzFile inFileZ = gzopen(path.c_str(), "rb");
if (inFileZ == NULL)
{
printf("Error: gzopen() failed for file %s.\n", path.c_str());
return {};
}
constexpr size_t bufSize = 8192;
char unzipBuffer[bufSize];
int unzippedBytes = bufSize;
std::vector<char> unzippedData;
unzippedData.reserve(1048576); // 1 MiB is enough in most cases.
while (unzippedBytes == bufSize)
{
unzippedBytes = gzread(inFileZ, unzipBuffer, bufSize);
if (unzippedBytes == -1)
{
// Here the error is -3 / "incorrect data check" for (one of) the last block(s)
// in the file. The bytes can be correctly decompressed, as demonstrated on the
// command line, but how can this be achieved in code?
int errnum;
const char *err = gzerror(inFileZ, &errnum);
printf(err, "%s\n");
break;
}
if (unzippedBytes > 0)
{
unzippedData.insert(unzippedData.end(), unzipBuffer, unzipBuffer + unzippedBytes);
}
}
gzclose(inFileZ);
return unzippedData;
}
First off, the whole point of the CRC is to detect corrupted data. If the CRC is bad, then you should be going back to where this file came from and getting the data not corrupted. If the CRC is bad, discard the input and report an error.
You are not clear on the "behavior" you are trying to reproduce, but if you're trying to recover as much data as possible from a corrupted gzip file, then you will need to use zlib's inflate functions to decompress the file. int ret = inflateInit2(&strm, 31); will initialize the zlib stream to process a gzip file.

zlib Z_BUF_ERROR with specific file and specific buffer sizes

I'm developing some code that needs to be capable of unzipping large gzip'd files (up to 5GB uncompressed) and reading them into memory. I would prefer to be clean about this and not simply unzip them to disk temporarily so I've been working with zlib to try to accomplish this. I've got it running, most of the way. Meaning it runs for 4 of the 5 files I've used as input. The other file gives a Z_BUF_ERROR right in the middle of processing and I'd prefer not to ignore it.
This initially happened in different code but eventually I brought it all the way back to the example code that I got from zpipe.c on the zlib web page, and no matter what code I used, it resulted in the same Z_BUF_ERROR and only with this file. I played with the code for quite a while after reading several posts about Z_BUF_ERROR and after reading the manual on this as well. Eventually I was able to find a way to make it work by changing the size of the buffer used to hold the inflated output. Normally at this point I'd call it a day until it reported an error with another file, but ideally this will be production level code at some point and I'd like to understand what the error is so I can prevent it rather than just fix it for now. Especially since gzip is able to compress and decompress the file just fine.
I've tried this with the following variations:
different platforms: CentOS, OSX
different versions of zlib: 1.2.3, 1.2.8 (same results)
values of CHUNK and the number of bytes output (complete is 783049330):
2000000: 783049330
1048576: 783049330
1000000: 783049330
100000: 783049330
30000: 248421347
25000: 31095404
20000: 783049330
19000: 155821787
18000: 412613687
17000: 55799133
16384: 37541674
16000: 783049330
any CHUNK size greater than 4100000 gives an error
tried declaring out with a value greater than CHUNK (same results)
tried using malloc to declare out (same results)
tried using gzip to uncompress and then compress the file again thinking something may have been off in the gzip metadata (same results)
tried compressing a separate uncompressed version of the file using gzip for the same purpose, but I believe the original .gz file was created from this one (same results)
I may have tried a few things outside of this list as I've been trying to get to the bottom of it for a while, but only changing the CHUNK size will make this work. My only concern is that I don't know why a different size will work and I'm worried that another CHUNK size will put other files at risk for this issue, because again, this is only an issue for one file.
`
CODE:
FILE* fp = fopen( argv[1], "rb" );
int ret = inf( fp, stdout );
fclose( fp );
int inf(FILE *source, FILE *dest)
{
size_t CHUNK = 100000;
int count = 0;
int ret;
unsigned have;
z_stream strm;
unsigned char in[CHUNK];
unsigned char out[CHUNK];
char out_str[CHUNK];
/* allocate inflate state */
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = 0;
strm.next_in = Z_NULL;
ret = inflateInit2(&strm, 16+MAX_WBITS);
if (ret != Z_OK)
return ret;
/* decompress until deflate stream ends or end of file */
do {
strm.avail_in = fread(in, 1, CHUNK, source);
if (ferror(source)) {
(void)inflateEnd(&strm);
return Z_ERRNO;
}
if (strm.avail_in == 0)
break;
strm.next_in = in;
/* run inflate() on input until output buffer not full */
do {
strm.avail_out = CHUNK;
strm.next_out = out;
ret = inflate(&strm, Z_NO_FLUSH);
switch (ret) {
case Z_NEED_DICT:
ret = Z_DATA_ERROR; /* and fall through */
case Z_DATA_ERROR:
case Z_MEM_ERROR:
(void)inflateEnd(&strm);
return ret;
}
have = CHUNK - strm.avail_out;
char out_str[have+1];
strncpy( out_str, (char*)out, have );
out_str[have] = '\0';
// testing the ability to store the result in a string object and viewing the output
std::cout << "out_str: " << std::string(out_str) << " ::" << std::endl;
if( ret == Z_BUF_ERROR ){
std::cout << "Z_BUF_ERROR!" << std::endl;
exit(1);
}
} while (strm.avail_out == 0);
/* done when inflate() says it's done */
} while (ret != Z_STREAM_END);
/* clean up and return */
(void)inflateEnd(&strm);
return ret == Z_STREAM_END ? Z_OK : Z_DATA_ERROR;
}
`
You should read the commentary where you got that code from. Z_BUF_ERROR is just an indication that there was nothing for inflate() to do on that call. Simply continue and provide more input data and more output space for the next inflate() call.

C++ Inflate gzip char array

I'm attempting to use zlib to uncompress (inflate) some IP packet payload data that is compressed via gzip. However, I'm having some difficultly understanding some of the documentation provided by zlib that covers inflation. I have a char array that my program fills but I can't seem to inflate it with the following code:
const u_char payload; /*contains gzip data,
captured prior to this point in the program*/
/*read compressed contents*/
int ret; //return val
z_stream stream;
unsigned char out[MEM_CHUNK]; //output array, MEM_CHUNK defined as 65535
/* allocate inflate state */
stream.zalloc = Z_NULL;
stream.zfree = Z_NULL;
stream.opaque = Z_NULL;
stream.avail_in = size_payload; // size of input
stream.next_in = (Bytef *)payload; // input char array
stream.avail_out = (uInt)sizeof(out); // size of output
stream.next_out = (Bytef *)out; // output char array
ret = inflateInit(&stream);
inflate(&stream, Z_NO_FLUSH);
inflateEnd(&stream);
printf("Inflate: %s\n\n", out);
In the zlib documentation, they have inflate continually called via a do/while loop, checking for the Z_STREAM_END flag. I'm a bit confused here, because it seems they're working from a file while I'm not. Do I need this loop as well, or am I able to provide a char array without looping over inflate?
Any guidance here would really be appreciated. I'm pretty new to both working with compression and C++.
Thanks.
Assuming you are giving inflate an appropriate and complete "compressed stream", and there is enough space to output the data, you would only need to call inflate once.
Edit: It is not written out as clearly as that in the zlib documentation, but it does say:
inflate decompresses as much data as possible, and stops when the
input buffer becomes empty or the output buffer becomes full. It may
introduce some output latency (reading input without producing any
output) except when forced to flush.
Of course, for any stream that isn't already "in memory and complete", you want to run it block by block, since that's going to have less total runtime (you can decompress while the data is being received [from network or filesystem pre-fetch caching] for the next block).
Here's the whole function from your example code. I've removed the text components from the page to concentrate the code, and marked sections with letters // A , // B etc, then marked tried to explain the sections below.
int inf(FILE *source, FILE *dest)
{
int ret;
unsigned have;
z_stream strm;
unsigned char in[CHUNK]; // A
unsigned char out[CHUNK];
/* allocate inflate state */
strm.zalloc = Z_NULL; // B
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = 0;
strm.next_in = Z_NULL;
ret = inflateInit(&strm); // C
if (ret != Z_OK)
return ret;
/* decompress until deflate stream ends or end of file */
do {
strm.avail_in = fread(in, 1, CHUNK, source); // D
if (ferror(source)) {
(void)inflateEnd(&strm); // E
return Z_ERRNO;
}
if (strm.avail_in == 0) // F
break;
strm.next_in = in; // G
/* run inflate() on input until output buffer not full */
do {
strm.avail_out = CHUNK; // H
strm.next_out = out;
ret = inflate(&strm, Z_NO_FLUSH); // I
assert(ret != Z_STREAM_ERROR); /* state not clobbered */
switch (ret) {
case Z_NEED_DICT:
ret = Z_DATA_ERROR; /* and fall through */
case Z_DATA_ERROR:
case Z_MEM_ERROR:
(void)inflateEnd(&strm);
return ret;
}
have = CHUNK - strm.avail_out; // J
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)inflateEnd(&strm);
return Z_ERRNO;
}
} while (strm.avail_out == 0); // K
/* done when inflate() says it's done */
} while (ret != Z_STREAM_END); // L
/* clean up and return */
(void)inflateEnd(&strm);
return ret == Z_STREAM_END ? Z_OK : Z_DATA_ERROR;
}
A: in is the input buffer (we read from a file into this buffer, then pass it to inflate a while later. out is the output buffer, which is used by inflate to store the output data.
B: Set up a z_stream object called strm. This holds various fields, most of which are not important here (thus set to Z_NULL). The important ones are the avail_in and next_in as well as avail_out and next_out (which are set later).
C: Start inflation process. This sets up some internal data structures and just makes the inflate function itself "ready to run".
D: Read a "CHUNK" amount of data from file. Store the number of bytes read in strm.avail_in, and the actual data goes into in.
E: If we errored out, finish the inflate by calling inflateEnd. Job done.
F: No data available, we're finished.
G: Set where our data is coming from (next_in is set to the input buffer, in).
H: We're now in the loop to inflate things. Here we set the output buffer up: next_out and avail_out indicate where the output goes and how much space there is, respectively.
I: Call inflate itself. This will uncompress a portion of the input buffer, until the output is full.
J: Calculate how much data is available in this step (have is the number of bytes).
K: Until we have space left when inflate finished - this indicates the output is completed for the data in the in buffer, rather than out of space in the out buffer. So time to read some more data from the input file.
L: If the error code from the inflate call is "happy", go round again.
Now, obviously, if you are reading from a network, and uncompressing into memory, you need to replace the fread and fwrite with some suitable read from network and memcpy type calls instead. I can't tell you EXACTLY what those are, since you haven't provided anything to explain where your data comes from - are you calling recv or read or WSARecv, or something else? - and where is it going to?

Sending Binary files over a socket. Text files work, nothing else does?

I can send very large text files no problem. I try to send a jpg, etc and it wont work. The file is the correct size. I don't know what I'm missing.. I check my read and write functions by writing the data before sending to a temp.foo file. I check it, and it handles anything.
I send it like this
for(vector< .... >::iterator it = v.begin(); it!=v.end(); ++it ){
pair<...> p=*it;
send(s,p.first,p.second,0);
}
Then the other program reads it
for(i = 0; i < size; i+=max){
b= 0;
while (b== 0) {
if ((b = recv(s, buf, max, 0)) == -1) {
perror("recv");
exit(1);
}
}
stringstream ss;
ss << buf;
char * out = (char*)malloc(b);
memcpy(out,buff,numbytes);// Perhaps my error is here?
}
// write function call here
Some general points about handling binary data:
Make sure you open the input and output files in binary mode, such as with the ios::binary flag or the "rb" or "wb" format. The default is text mode, which will mangle end-of-line characters in a binary file.
Binary files can have NUL bytes (\0), which means you can't use string-handling functions that work on NUL-terminated strings. C strings are not NUL safe. For instance, this code won't fill ss properly since it interprets buf as a NUL-terminated string:
stringstream ss;
ss << buf;
Also, on the line you point out, is buff with two fs a typo? On the other lines you reference buf with one f.
memcpy(out,buff,numbytes);// Perhaps my error is here?

Read/write operation works neither good nor bad

I am programming a face detection algorithm. In my code I'm parsing an XML file (in a recursion way, very inefficient takes my about 4 minutes to parse the whole XML file). I'd like to save the XML content using Iosteam binary to a file. I'm using a struct in C++ in order to use the raw data.
My goal is to parse the XML only if the raw data file is not exist.
The method work like this:
If the raw data file is not exist, parse the XML file and save the data to a file.
If the raw data file exist, read the raw data from the file
My problem is: whenever I open the raw data file and read from it. I get to read only small amount of byte from the file, I don't know how much, but in a certain point I receive only 0x00 data on my buffer.
My guess: I believe this has to do with the OS buffer, Which has a certain amount of buffer for read and write operations. I might be wrong about this. Though I'm not sure which one from the operations doesn't work well, it's either the write or read.
I was thinking to write / read the raw data char by char or line by line. In the other hand the file doesn't contain a text, which means that I can't read line by line or char by char.
The raw data size is
size_t datasize = DataSize(); == 196876 (Byte)
Which is retrieve in this function
/* Get the upper bound for predefined cascade size */
size_t CCacadeInterpreter::DataSize()
{
// this is an upper boundary for the whole hidden cascade size
size_t datasize = sizeof(HaarClassifierCascade) * TOTAL_CASCADE+
sizeof(HaarStageClassifier)*TOTAL_STAGES +
sizeof(HaarClassifier) * TOTAL_CLASSIFIERS +
sizeof(void*)*(TOTAL_CASCADE+TOTAL_STAGES+TOTAL_CLASSIFIERS);
return datasize;
}
The method work like this
BYTE * CCacadeInterpreter::Interpreter()
{
printf("|Phase - Load cascade from memory | CCacadeInterpreter::Interpreter | \n");
size_t datasize = DataSize();
// Create a memory structure
nextFreeSpace = pStartMemoryLocation = new BYTE [datasize];
memset(nextFreeSpace,0x00,datasize);
// Try to open a predefined cascade file on the current folder (instead of parsing the file again)
fstream stream;
stream.open(cascadeSavePath); // ...try existing file
if (stream.is_open())
{
stream.seekg(0,ios::beg);
stream.read((char*)pStartMemoryLocation , datasize); // **ream from file**
stream.close();
printf("|Load cascade from saved memory location | CCacadeInterpreter::Interpreter | \n");
printf("Completed\n\n");
stream.close();
return pStartMemoryLocation;
}
// Open the cascade file and parse the cascade xml file
std::fstream cascadeFile;
cascadeFile.open(cascadeDestanationPath, std::fstream::in); // open the file with read only attributes
if (!cascadeFile.is_open())
{
printf("Error: couldn't open cascade XML file\n");
delete pStartMemoryLocation;
return NULL;
}
// Read the file XML file , line by line
string buffer, str;
getline(cascadeFile,str);
while(cascadeFile)
{
buffer+=str;
getline(cascadeFile,str);
}
cascadeFile.close();
split(buffer, '<',m_tokens);
// Parsing begins
pHaarClassifierCascade = (HaarClassifierCascade*)nextFreeSpace;
nextFreeSpace += sizeof(HaarClassifierCascade);
pHaarClassifierCascade->count=0;
pHaarClassifierCascade->orig_window_size_height=20;
pHaarClassifierCascade->orig_window_size_width=20;
m_deptInTree=0;
m_numOfStage = 0;
m_numOfTotalClassifiers=0;
while (m_tokens.size())
{
Parsing();
}
// Save the current cascade into a file
SaveBlockToMemory(pStartMemoryLocation,datasize);
printf("\nCompleted\n\n");
return pStartMemoryLocation;
}
bool CCacadeInterpreter::SaveBlockToMemory(BYTE * pStartMemoryLocation,size_t dataSize)
{
fstream stream;
if (stream.is_open() )
stream.close();
stream.open(cascadeSavePath); // ...try existing file
if (!stream.is_open()) // ...else, create new file...
stream.open(cascadeSavePath, ios_base::in | ios_base::out | ios_base::trunc);
stream.seekg(0,ios::beg);
stream.write((char*)pStartMemoryLocation,dataSize);
stream.close();
return true;
}
Try using the Boost IOstreams library.
It has an easy to use wrrapers for file handling