How to use ZLIB deflate method? - compression

I am trying to use zlib to compress a text file. It seems to kinda work except I pretty sure my calculation of the number of bytes to write to the output is wrong. My code (guided by http://zlib.net/zlib_how.html) is below:
int
deflateFile(
char *infile,
char *outfile)
{
#define CHUNKSIZE 1000
int n,nr,nw,towrite;
z_stream strm;
FILE *fin,*fout;
BYTE *inbuf,*outbuf;
int ntot=0;
printf( "Start doDeflateFile:\n" );
// ALLOC BUFFERS
inbuf = malloc( CHUNKSIZE+1 );
outbuf = malloc( CHUNKSIZE+1 );
// OPEN FILES
fin = fopen( infile, "rb" );
fout = fopen( outfile, "wb" );
// SETUP Z STREAM
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = CHUNKSIZE; // size of input
strm.next_in = inbuf; // input buffer
strm.avail_out = CHUNKSIZE; // size of output
strm.next_out = outbuf; // output buffer
deflateInit( &strm, Z_BEST_COMPRESSION ); // init stream level
while( TRUE ) { // loop til EOF on input file
// READ NEXT INPUT CHUNK
nr = fread( inbuf, 1, CHUNKSIZE, fin );
if( nr <= 0 ) {
printf( "End of input\n" );
break;
}
printf( "\nread chunk of %6d bytes\n", nr );
printf( "calling deflate...\n" );
n = deflate(&strm, Z_FINISH); // call ZLIB deflate
towrite = CHUNKSIZE - strm.avail_out; // calc # bytes to write (FIXME???)
printf( "#bytes to write %6d bytes\n", towrite );
nw = fwrite( outbuf, 1, towrite, fout );
if( nw != towrite ) break;
printf( "wrote chunk of %6d bytes\n", nw );
ntot += nw;
}
deflateEnd(&strm); // end deflate
printf( "wrote total of %d bytes\n", ntot );
printf( "End deflateFile.\n" );
return( 0 );
}
The output for a 1010-byte input file with a CHUNKSIZE of 1000 is:
Start deflateFile:
read chunk of 1000 bytes
calling deflate...
#bytes to write 200 bytes
wrote chunk of 200 bytes
read chunk of 10 bytes
calling deflate...
#bytes to write 200 bytes
wrote chunk of 200 bytes
End of input
wrote total of 400 bytes
End deflateFile.
SO #4538586 sort of addressed this but not quite and it's very old..
Can anybody point out my problem?

You should read that page again. Much more carefully this time.
You are not setting avail_in properly at the start, and you are not resetting next_in, avail_in, next_out, and avail_out in the loop. The only thing you are doing correctly is the thing you think is wrong, which is the calculation of how many bytes to write out. What you have will not even "kinda work".
First off, avail_in must always be set to the amount of available input at next_in. Hence the name avail_in. You are setting it to CHUNKSIZE and calling inflateInit(), even though there is no available input in that buffer yet.
Then after you read data into the input buffer, you ignore nr! You need to set avail_in to nr, to indicate how much data is actually in the buffer. It might be less than CHUNKSIZE.
You should read data into the input buffer only if you have processed all of the data that was there from the last read, indicated by avail_in being zero.
When a call of deflate() completes inside the loop, it has updated next_in, avail_in, next_out, and avail_out. To use the inbuf and outbuf buffers again, you need reset the values of next_in, next_out, and avail_out to the values you did initially. avail_in will be set at the top of the loop from nr.
You are calling deflate() with Z_FINISH every time! The way this works is that you call deflate() with Z_NO_FLUSH until the last of the input is provided, and then you use Z_FINISH, to let it know to finish. (That's why it's called that.)
Your loop will exit prematurely, since you need to finish compressing and writing the output, not just finish reading the input.
You are not checking the return code of deflate(). Always check return codes. That's why they're there.
Good luck.

Related

ifstream does not completely read whole data

I'm trying to read a file block by block. Blocksize is 64Byte. But some bytes are left.
Example: I have a 360 Byte file and read the data in 64byte blocks, so I need 6 times a 64 byteblock to get all data.
typedef unsigned char uint1;
ifstream is(path.c_str(), ifstream::in | ifstream::binary);
uint1 data[64];
int i = 0;
while (is.read((char*)data, 64)) {
i++;
}
cout << i << endl;
But I only get 5 times a completely filled 64-Byte Block. How to get the remaining Bytes??
I suppose the problem is your file size is not divisible by buffer size, so last chunk's size is less than 64(360 - 64 * 5 = 40 bytes). And for this case doc for istream::read says:
If the input sequence runs out of characters to extract (i.e., the end-of-file is reached) before n characters have been successfully read, the array pointed to by s contains all the characters read until that point, and both the eofbit and failbit flags are set for the stream.
So, last is.read's return value is evaluated to "false" and it's not counted by your counter.
360 is not divisible by 64, which means that the last block will not be read in its entirety. Consulting suitable documentation shows that reading such an incomplete block sets both eofbit and failbit on the stream from which you're reading, which means the condition in your while loop will evaluate to false for the last block. But the read did actually happen and the data is stored correctly.
You might want to check the value of gcount() after the last read:
while (is.read((char*)data, 64)) {
i++;
}
if (is.gcount() > 0) {
i++;
}
If your aim is actually to read the file, and not simply to count the number of blocks as your sample does then you probably want smth like this:
std::ifstream is( path.c_str(), std::ios_base::binary ); // in mode is always set for ifstream
if( !is )
throw std::runtime_error("unable to open file '" + path + "'" );
while( !is.eof() )
{
std::array< char, 64 > buf;
is.peek(); // needs this because of the buffering.
const auto n = is.readsome( buf.data(), buf.size() );
if( is )
handle_block( buf, n ); // std::cout.write( buf.data(), n )
else
throw std::runtime_error("error reading file '" + path + "'" );
}

zlib Z_BUF_ERROR with specific file and specific buffer sizes

I'm developing some code that needs to be capable of unzipping large gzip'd files (up to 5GB uncompressed) and reading them into memory. I would prefer to be clean about this and not simply unzip them to disk temporarily so I've been working with zlib to try to accomplish this. I've got it running, most of the way. Meaning it runs for 4 of the 5 files I've used as input. The other file gives a Z_BUF_ERROR right in the middle of processing and I'd prefer not to ignore it.
This initially happened in different code but eventually I brought it all the way back to the example code that I got from zpipe.c on the zlib web page, and no matter what code I used, it resulted in the same Z_BUF_ERROR and only with this file. I played with the code for quite a while after reading several posts about Z_BUF_ERROR and after reading the manual on this as well. Eventually I was able to find a way to make it work by changing the size of the buffer used to hold the inflated output. Normally at this point I'd call it a day until it reported an error with another file, but ideally this will be production level code at some point and I'd like to understand what the error is so I can prevent it rather than just fix it for now. Especially since gzip is able to compress and decompress the file just fine.
I've tried this with the following variations:
different platforms: CentOS, OSX
different versions of zlib: 1.2.3, 1.2.8 (same results)
values of CHUNK and the number of bytes output (complete is 783049330):
2000000: 783049330
1048576: 783049330
1000000: 783049330
100000: 783049330
30000: 248421347
25000: 31095404
20000: 783049330
19000: 155821787
18000: 412613687
17000: 55799133
16384: 37541674
16000: 783049330
any CHUNK size greater than 4100000 gives an error
tried declaring out with a value greater than CHUNK (same results)
tried using malloc to declare out (same results)
tried using gzip to uncompress and then compress the file again thinking something may have been off in the gzip metadata (same results)
tried compressing a separate uncompressed version of the file using gzip for the same purpose, but I believe the original .gz file was created from this one (same results)
I may have tried a few things outside of this list as I've been trying to get to the bottom of it for a while, but only changing the CHUNK size will make this work. My only concern is that I don't know why a different size will work and I'm worried that another CHUNK size will put other files at risk for this issue, because again, this is only an issue for one file.
`
CODE:
FILE* fp = fopen( argv[1], "rb" );
int ret = inf( fp, stdout );
fclose( fp );
int inf(FILE *source, FILE *dest)
{
size_t CHUNK = 100000;
int count = 0;
int ret;
unsigned have;
z_stream strm;
unsigned char in[CHUNK];
unsigned char out[CHUNK];
char out_str[CHUNK];
/* allocate inflate state */
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = 0;
strm.next_in = Z_NULL;
ret = inflateInit2(&strm, 16+MAX_WBITS);
if (ret != Z_OK)
return ret;
/* decompress until deflate stream ends or end of file */
do {
strm.avail_in = fread(in, 1, CHUNK, source);
if (ferror(source)) {
(void)inflateEnd(&strm);
return Z_ERRNO;
}
if (strm.avail_in == 0)
break;
strm.next_in = in;
/* run inflate() on input until output buffer not full */
do {
strm.avail_out = CHUNK;
strm.next_out = out;
ret = inflate(&strm, Z_NO_FLUSH);
switch (ret) {
case Z_NEED_DICT:
ret = Z_DATA_ERROR; /* and fall through */
case Z_DATA_ERROR:
case Z_MEM_ERROR:
(void)inflateEnd(&strm);
return ret;
}
have = CHUNK - strm.avail_out;
char out_str[have+1];
strncpy( out_str, (char*)out, have );
out_str[have] = '\0';
// testing the ability to store the result in a string object and viewing the output
std::cout << "out_str: " << std::string(out_str) << " ::" << std::endl;
if( ret == Z_BUF_ERROR ){
std::cout << "Z_BUF_ERROR!" << std::endl;
exit(1);
}
} while (strm.avail_out == 0);
/* done when inflate() says it's done */
} while (ret != Z_STREAM_END);
/* clean up and return */
(void)inflateEnd(&strm);
return ret == Z_STREAM_END ? Z_OK : Z_DATA_ERROR;
}
`
You should read the commentary where you got that code from. Z_BUF_ERROR is just an indication that there was nothing for inflate() to do on that call. Simply continue and provide more input data and more output space for the next inflate() call.

Raspberry Pi C++ Read NMEA Sentences from Adafruit's Ultimate GPS Module

I'm trying to read the GPS NMEA sentences from Adafruit's Ultimate GPS module. I'm using C++ on the raspberry pi to read the serial port connection to the module
Here is my read function:
int Linuxutils::readFromSerialPort(int fd, int bufferSize) {
/*
Reading data from a port is a little trickier. When you operate the port in raw data mode,
each read(2) system call will return however many characters are actually available in the
serial input buffers. If no characters are available, the call will block (wait) until
characters come in, an interval timer expires, or an error occurs. The read function can be
made to return immediately by doing the following:
fcntl(fd, F_SETFL, FNDELAY);
The NDELAY option causes the read function to return 0 if no characters are available on the port.
*/
// Check the file descriptor
if ( !checkFileDecriptorIsValid(fd) ) {
fprintf(stderr, "Could not read from serial port - it is not a valid file descriptor!\n");
return -1;
}
// Now, let's wait for an input from the serial port.
fcntl(fd, F_SETFL, 0); // block until data comes in
// Now read the data
int absoluteMax = bufferSize*2;
char *buffer = (char*) malloc(sizeof(char) * bufferSize); // allocate buffer.
int rcount = 0;
int length = 0;
// Read in each newline
FILE* fdF = fdopen(fd, "r");
int ch = getc(fdF);
while ( (ch != '\n') ) { // Check for end of file or newline
// Reached end of file
if ( ch == EOF ) {
printf("ERROR: EOF!");
continue;
}
// Expand by reallocating if necessary
if( rcount == absoluteMax ) { // time to expand ?
absoluteMax *= 2; // expand to double the current size of anything similar.
rcount = 0; // Re-init count
buffer = (char*)realloc(buffer, absoluteMax); // Re-allocate memory.
}
// Read from stream
ch = getc(fdF);
// Stuff in buffer
buffer[length] = ch;
// Increment counters
length++;
rcount++;
}
// Don't care if we return 0 chars read
if ( rcount == 0 ) {
return 0;
}
// Stick
buffer[rcount] = '\0';
// Print results
printf("Received ( %d bytes ): %s\n", rcount,buffer);
// Return bytes read
return rcount;
}
So I kind of get the sentences as you can see below, the problem is I get these "repeated" portions of a complete sentence like this:
Received ( 15 bytes ): M,-31.4,M,,*61
Here is the complete thing:
Received ( 72 bytes ): GPGGA,182452.000,4456.2019,N,09337.0243,W,1,8,1.19,292.6,M,-31.4,M,,*61
Received ( 56 bytes ): GPGSA,A,3,17,07,28,26,08,11,01,09,,,,,1.49,1.19,0.91*00
Received ( 15 bytes ): M,-31.4,M,,*61
Received ( 72 bytes ): GPGGA,182453.000,4456.2019,N,09337.0242,W,1,8,1.19,292.6,M,-31.4,M,,*61
Received ( 56 bytes ): GPGSA,A,3,17,07,28,26,08,11,01,09,,,,,1.49,1.19,0.91*00
Received ( 15 bytes ): M,-31.4,M,,*61
Received ( 72 bytes ): GPGGA,182456.000,4456.2022,N,09337.0241,W,1,8,1.21,292.6,M,-31.4,M,,*64
Received ( 56 bytes ): GPGSA,A,3,17,07,28,26,08,11,01,09,,,,,2.45,1.21,2.13*0C
Received ( 70 bytes ): GPRMC,182456.000,A,4456.2022,N,09337.0241,W,0.40,183.74,110813,,,A*7F
Received ( 37 bytes ): GPVTG,183.74,T,,M,0.40,N,0.73,K,A*34
Received ( 70 bytes ): GPRMC,182453.000,A,4456.2019,N,09337.0242,W,0.29,183.74,110813,,,A*7E
Received ( 37 bytes ): GPVTG,183.74,T,,M,0.29,N,0.55,K,A*3F
Received ( 32 bytes ): 242,W,0.29,183.74,110813,,,A*7E
Received ( 70 bytes ): GPRMC,182452.000,A,4456.2019,N,09337.0243,W,0.33,183.74,110813,,,A*75
Why am I getting the repeated sentences and how can I fix it? I tried flushing the serial port buffers but then things became really ugly! Thanks.
I'm not sure I understand your exact problem. There are a few problems with the function though which might explain a variety of errors.
The lines
int absoluteMax = bufferSize*2;
char *buffer = (char*) malloc(sizeof(char) * bufferSize); // allocate buffer.
seem wrong. You'll decide when to grow the buffer by comparing the number of characters read to absoluteMax so this needs to match the size of the buffer allocated. You're currently writing beyond the end of allocated memory before you reallocate. This results in undefined behaviour. If you're lucky your app will crash, if you're unlucky, things will appear to work but you'll lose the second half of the data you've read since only the data written to memory you own will be moved by realloc (if it relocates your heap cell).
Also, you shouldn't cast the return from malloc (or realloc) and can rely on sizeof(char) being 1.
You lose the first character read (the one that is read just before the while loop). Is this deliberate?
When you reallocate buffer, you shouldn't reset rcount. This causes the same bug as above where you'll write beyond the end of buffer before reallocating again. Again, the effects of doing this are undefined but could include losing portions of output.
Not related to the bug you're currently concerned with but also worth noting is the fact that you leak buffer and fdF. You should free and fclose them respectively before exiting the function.
The following (untested) version ought to fix these issues
int Linuxutils::readFromSerialPort(int fd, int bufferSize)
{
if ( !checkFileDecriptorIsValid(fd) ) {
fprintf(stderr, "Could not read from serial port - it is not a valid file descriptor!\n");
return -1;
}
fcntl(fd, F_SETFL, 0); // block until data comes in
int absoluteMax = bufferSize;
char *buffer = malloc(bufferSize);
int rcount = 0;
int length = 0;
// Read in each newline
FILE* fdF = fdopen(fd, "r");
int ch = getc(fdF);
for (;;) {
int ch = getc(fdF);
if (ch == '\n') {
break;
}
if (ch == EOF) { // Reached end of file
printf("ERROR: EOF!\n");
break;
}
if (length+1 >= absoluteMax) {
absoluteMax *= 2;
char* tmp = realloc(buffer, absoluteMax);
if (tmp == NULL) {
printf("ERROR: OOM\n");
goto cleanup;
}
buffer = tmp;
}
buffer[length++] = ch;
}
if (length == 0) {
return 0;
}
buffer[length] = '\0';
// Print results
printf("Received ( %d bytes ): %s\n", rcount,buffer);
cleanup:
free(buffer);
fclose(fdH);
return length;
}
Maybe you could try to flush serial port buffers before reading from it as shown in this link ?
I would also consider not reopening the serial port every time you call Linuxutils::readFromSerialPort - you could keep the file descriptor open for further reading (anyway the call is blocking so from the caller's point of view nothing changes).

C++ Inflate gzip char array

I'm attempting to use zlib to uncompress (inflate) some IP packet payload data that is compressed via gzip. However, I'm having some difficultly understanding some of the documentation provided by zlib that covers inflation. I have a char array that my program fills but I can't seem to inflate it with the following code:
const u_char payload; /*contains gzip data,
captured prior to this point in the program*/
/*read compressed contents*/
int ret; //return val
z_stream stream;
unsigned char out[MEM_CHUNK]; //output array, MEM_CHUNK defined as 65535
/* allocate inflate state */
stream.zalloc = Z_NULL;
stream.zfree = Z_NULL;
stream.opaque = Z_NULL;
stream.avail_in = size_payload; // size of input
stream.next_in = (Bytef *)payload; // input char array
stream.avail_out = (uInt)sizeof(out); // size of output
stream.next_out = (Bytef *)out; // output char array
ret = inflateInit(&stream);
inflate(&stream, Z_NO_FLUSH);
inflateEnd(&stream);
printf("Inflate: %s\n\n", out);
In the zlib documentation, they have inflate continually called via a do/while loop, checking for the Z_STREAM_END flag. I'm a bit confused here, because it seems they're working from a file while I'm not. Do I need this loop as well, or am I able to provide a char array without looping over inflate?
Any guidance here would really be appreciated. I'm pretty new to both working with compression and C++.
Thanks.
Assuming you are giving inflate an appropriate and complete "compressed stream", and there is enough space to output the data, you would only need to call inflate once.
Edit: It is not written out as clearly as that in the zlib documentation, but it does say:
inflate decompresses as much data as possible, and stops when the
input buffer becomes empty or the output buffer becomes full. It may
introduce some output latency (reading input without producing any
output) except when forced to flush.
Of course, for any stream that isn't already "in memory and complete", you want to run it block by block, since that's going to have less total runtime (you can decompress while the data is being received [from network or filesystem pre-fetch caching] for the next block).
Here's the whole function from your example code. I've removed the text components from the page to concentrate the code, and marked sections with letters // A , // B etc, then marked tried to explain the sections below.
int inf(FILE *source, FILE *dest)
{
int ret;
unsigned have;
z_stream strm;
unsigned char in[CHUNK]; // A
unsigned char out[CHUNK];
/* allocate inflate state */
strm.zalloc = Z_NULL; // B
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.avail_in = 0;
strm.next_in = Z_NULL;
ret = inflateInit(&strm); // C
if (ret != Z_OK)
return ret;
/* decompress until deflate stream ends or end of file */
do {
strm.avail_in = fread(in, 1, CHUNK, source); // D
if (ferror(source)) {
(void)inflateEnd(&strm); // E
return Z_ERRNO;
}
if (strm.avail_in == 0) // F
break;
strm.next_in = in; // G
/* run inflate() on input until output buffer not full */
do {
strm.avail_out = CHUNK; // H
strm.next_out = out;
ret = inflate(&strm, Z_NO_FLUSH); // I
assert(ret != Z_STREAM_ERROR); /* state not clobbered */
switch (ret) {
case Z_NEED_DICT:
ret = Z_DATA_ERROR; /* and fall through */
case Z_DATA_ERROR:
case Z_MEM_ERROR:
(void)inflateEnd(&strm);
return ret;
}
have = CHUNK - strm.avail_out; // J
if (fwrite(out, 1, have, dest) != have || ferror(dest)) {
(void)inflateEnd(&strm);
return Z_ERRNO;
}
} while (strm.avail_out == 0); // K
/* done when inflate() says it's done */
} while (ret != Z_STREAM_END); // L
/* clean up and return */
(void)inflateEnd(&strm);
return ret == Z_STREAM_END ? Z_OK : Z_DATA_ERROR;
}
A: in is the input buffer (we read from a file into this buffer, then pass it to inflate a while later. out is the output buffer, which is used by inflate to store the output data.
B: Set up a z_stream object called strm. This holds various fields, most of which are not important here (thus set to Z_NULL). The important ones are the avail_in and next_in as well as avail_out and next_out (which are set later).
C: Start inflation process. This sets up some internal data structures and just makes the inflate function itself "ready to run".
D: Read a "CHUNK" amount of data from file. Store the number of bytes read in strm.avail_in, and the actual data goes into in.
E: If we errored out, finish the inflate by calling inflateEnd. Job done.
F: No data available, we're finished.
G: Set where our data is coming from (next_in is set to the input buffer, in).
H: We're now in the loop to inflate things. Here we set the output buffer up: next_out and avail_out indicate where the output goes and how much space there is, respectively.
I: Call inflate itself. This will uncompress a portion of the input buffer, until the output is full.
J: Calculate how much data is available in this step (have is the number of bytes).
K: Until we have space left when inflate finished - this indicates the output is completed for the data in the in buffer, rather than out of space in the out buffer. So time to read some more data from the input file.
L: If the error code from the inflate call is "happy", go round again.
Now, obviously, if you are reading from a network, and uncompressing into memory, you need to replace the fread and fwrite with some suitable read from network and memcpy type calls instead. I can't tell you EXACTLY what those are, since you haven't provided anything to explain where your data comes from - are you calling recv or read or WSARecv, or something else? - and where is it going to?

zlib stops on buffer expansion

When I attempt to decompress data of a size greater than 2048 the zlib uncompress call returns Z_OK. So to clarify if I decompress data of size 2980 it will decompress upto 2048 (Two loops) and then return Z_OK.
What am i missing?
Bytes is a vector< unsigned char >;
Bytes uncompressIt( const Bytes& data )
{
size_t buffer_length = 1024;
Byte* buffer = nullptr;
int status = 0;
do
{
buffer = ( Byte* ) calloc( buffer_length + 1, sizeof( Byte ) );
int status = uncompress( buffer, &buffer_length, &data[ 0 ], data.size( ) );
if ( status == Z_OK )
{
break;
}
else if ( status == Z_MEM_ERROR )
{
throw runtime_error( "GZip decompress ran out of memory." );
}
else if ( status == Z_DATA_ERROR )
{
throw runtime_error( "GZip decompress input data was corrupted or incomplete." );
}
else //if ( status == Z_BUF_ERROR )
{
free( buffer );
buffer_length *= 2;
}
} while ( status == Z_BUF_ERROR ); //then the output buffer wasn't large enough
Bytes result;
for( size_t index = 0; index != buffer_length; index++ )
{
result.push_back( buffer[ index ] );
}
return result;
}
EDIT:
Thanks #Michael for catching the realloc. I've been mucking around with the implementation and missed it; still no excuse before posting it.
I got it.
int status
is defined inside and outside of the loop. The Lesson here is never drink & develop.
From the zlib manual: "In the case where there is not enough room, uncompress() will fill the output buffer with the uncompressed data up to that point."
I.e, up to 1024 bytes have already been uncompressed, then you get Z_BUF_ERROR and double the buffer size giving you room for 2048 bytes, and once you've uncompressed the second time you've got a total of up to 3072 bytes of uncompressed data.
Also, it looks like you're unnecessarily doing a calloc right after realloc when you get Z_BUF_ERROR.
I find nothing apparent that is wrong with your code. You may be mis-predicting the length of your uncompressed data. uncompress() will only return Z_OK if it has decompressed a complete zlib stream and the check value of the uncompressed data matched the check value at the end of the stream.