ZLib GZIP Returning Z_BUF_ERROR(-5) - c++

I am using the zlib library (compiled from src) to deflate/inflate gzip/zlib/raw bytes. I have created a wrapper class for decompressing and compressing (Compressor/Decompresser). I have also created several test cases (GZIP, ZLib, Raw, Auto-Detect). The tests pass for Zlib/Raw/Auto-Detect(Zlib), but not for GZip (window bits of 15u | 16u).
Here is my compress function.
std::vector<char> out(zlib->avail_in + 8);
deflateInit2(zlib.get(), Z_DEFAULT_COMPRESSION, Z_DEFLATED, static_cast<int32_t>(mode), 8, Z_DEFAULT_STRATEGY);
zlib->avail_out = out.size();
zlib->next_out = reinterpret_cast<Bytef*>(out.data());
deflate(zlib.get(), Z_FINISH);
out.resize(zlib->total_out + 3);
deflateEnd(zlib.get());
return std::move(out);
And here is decompress
uIntf multiplier = 2;
uIntf currentSize = zlib->avail_in * (multiplier++) * 1000 /* Just to make sure enough output space(will implement loop) */;
std::vector<char> out(currentSize);
inflateInit2(zlib.get(), static_cast<int>(mode));
zlib->avail_out = out.size();
zlib->next_out = reinterpret_cast<Bytef*>(out.data());
inflate(zlib.get(), Z_FINISH);
out.resize(zlib->total_out);
inflateEnd(zlib.get());
return std::move(out);
Input is set in a different function (that is being called), that looks like this. (char* is not being deleted when compress/decompress is called)
zlib->next_in = reinterpret_cast<Bytef*>(bytes);
zlib->avail_in = static_cast<uIntf>(length);
I also have a mode enum
enum class Mode : int32_t {
AUTO = 15u | 32u, // Never used on compress
GZIP = 15u | 16u,
ZLIB = 15,
RAW = -15
};
Note: Test cases with the mode being AUTO (paired with zlib), ZLib, and RAW work. GZip fails the test case. (The test case is just a simple alphanum character array).
Also I debugged the output of the gzip decompress (after it failed) and the output is missing the last 3 characters (y, z, termination character)
Another note:
The constructor of the wrapper classes look like this
zlib->zalloc = Z_NULL;
zlib->zfree = Z_NULL;
zlib->opaque = Z_NULL;
zlib->avail_in = 0;
zlib->next_in = Z_NULL;

First off, a bunch of scattered code fragments with no context makes it impossible to see what's happening. See How to create a Minimal, Reproducible Example for how to provide a decent example.
Second, you are not saying what is returning Z_BUF_ERROR. There aren't even any places in your code fragments where you retain the return values of deflate() or inflate(), so it's not even possible for you to see a Z_BUF_ERROR! You need to at least do something like int ret = deflate(zlib.get(), Z_FINISH); and then check the value of ret.
Third, I cannot tell in your code fragments where or even if you set the input pointer and length. Is the length set to zero before the inits? Or is it set to the data? Or is the data pointer and length set after the inits? See the MRE link above.
Fourth, we don't have the example data that you're using. So we cannot reproduce the error. Again, see the MRE link.
Ok, so making a stab in the dark here, I will guess that deflate() is returning the error. Then the problem is likely that you have not provided enough output space, and you have asked for Z_FINISH, which is telling deflate() you have provided enough output space. In that case, deflate() returning Z_BUF_ERROR means that you didn't. Compression can expand the data if it is not compressible, and gzip adds more header and trailer information than zlib. Your + 8 is inadequate to account for those two things. A zlib header and trailer is six bytes, whereas a gzip header and trailer is at least 18 bytes. The expansion is a multiplier on the input, adding some part of a percent, where you have no multiplier on the length at all.
zlib provides a function for just this purpose, deflateBound(). You would call that after deflateInit() with the size of your input, and it will return the maximum size of the compressed output.
However it is better to call deflate() multiple times in a loop. For most practical applications, it is necessary to call inflate() multiple times in a loop. This is seen in your comment, as well as in your attempt (also inadequate) to account for the possible size of the inflated data by multiplying by a thousand.
You can find a heavily commented example of how to use the zlib functions properly, with loops, at zlib Usage Example.

Related

How to copy every N-th byte(s) of a C array

I am writing bit of code in C++ where I want to play a .wav file and perform an FFT (with fftw) on it as it comes (and eventually display that FFT on screen with ncurses). This is mainly just as a "for giggles/to see if I can" project, so I have no restrictions on what I can or can't use aside from wanting to try to keep the result fairly lightweight and cross-platform (I'm doing this on Linux for the moment). I'm also trying to do this "right" and not just hack it together.
I'm using SDL2_audio to achieve the playback, which is working fine. The callback is called at some interval requesting N bytes (seems to be desiredSamples*nChannels). My idea is that at the same time I'm copying the memory from my input buffer to SDL I might as well also copy it in to fftw3's input array to run an FFT on it. Then I can just set ncurses to refresh at whatever rate I'd like separate from the audio callback frequency and it'll just pull the most recent data from the output array.
The catch is that the input file is formatted where the channels are packed together. I.E "(LR) (LR) (LR) ...". So while SDL expects this, I need a way to just get one channel to send to FFTW.
The audio callback format from SDL looks like so:
void myAudioCallback(void* userdata, Uint8* stream, int len) {
SDL_memset(stream, 0, sizeof(stream));
SDL_memcpy(stream, audio_pos, len);
audio_pos += len;
}
where userdata is (currently) unused, stream is the array that SDL wants filled, and len is the length of stream (I.E the number of bytes SDL is looking for).
As far as I know there's no way to get memcpy to just copy every other sample (read: Copy N bytes, skip M, copy N, etc). My current best idea is a brute-force for loop a la...
// pseudocode
for (int i=0; i<len/2; i++) {
fftw_in[i] = audio_pos + 2*i*sizeof(sample)
}
or even more brute force by just reading the file a second time and only taking every other byte or something.
Is there another way to go about accomplishing this, or is one of these my best option? It feels kind of kludgey to go from a nice one line memcpy to send to the data to SDL to some sort of weird loop to send it to fftw.
Very hard OP's solution can be simplified (for copying bytes):
// pseudocode
const char* s = audio_pos;
for (int d = 0; s < audio_pos + len; d++, s += 2*sizeof(sample)) {
fftw_in[d] = *s;
}
If I new what fftw_in is, I would memcpy blocks sizeof(*fftw_in).
Please check assembly generated by #S.M.'s solution.
If the code is not vectorized, I would use intrinsics (depending on your hardware support) like _mm_mask_blend_epi8

Seek in libarchive, how to reset header?

Is it possible to read decompressed file once again?
Let imagine I used archive_read_next_header(a, &entry),
and I read an unknown number of bytes using archive_read_data(a, ptr_to_buffer, buffer_size). Right now I want to reset it and start reading again from the beginning. I trying to override seekoff(std::streamoff off, std::ios_base::seekdir way, std::ios_base::openmode which). I understand that might be impossible to just seek inside decompressed data because of inner work of compression algorithms, and data is not stored anywhere except a limited number of bytes in libarchive internal buffer.
The idea is to just reset it all, and read std::streamoff off bytes, that way I could create backward seek. Forward seek would be easy, just read std::streamoff off bytes. It's really inefficient, but let's hope, seek won't be used much.
Whole structure archive was initialized that way:
archive_read_set_read_callback(a, read_callback);
archive_read_set_callback_data(a, container);
archive_read_set_seek_callback(a, seek_callback);
archive_read_set_skip_callback(a, skip_callback);
int r = (archive_read_open1(a));
where container contains most of all std::istream, and callbacks are functions which manipulate that stream.
Template of what I would like to achive
`
std::streampos seek_beg(std::streamoff off) {
if(off >= 0) {
// read/skip 'off' bytes
} else {
// reset (a)
// read/skip 'off' bytes
}
// return position
}
`
also my underflow() method is implemented that way:
`
int underflow() {
int r = archive_read_data(ar, ptr, BUFFER_SIZE);
if (r < 0) {
throw std::runtime_error("ERROR");
} else if (r == 0) {
return std::streambuf::traits_type::eof();
} else {
setg(ptr, ptr, ptr + r);
}
return std::streambuf::traits_type::to_int_type(*ptr);
}
`
Libarchive documentation, more precisely, wishlist in libarchive wiki on GitHub says:
A few people have asked for the ability to efficiently "re-read"
particular archive entries. This is a tricky subject. For many
formats, the performance gains from this would be very modest. For
example, with a little performance work, the seeking Zip reader could
support very fast re-reading from the beginning since it only involves
re-parsing the central directory. The cases where there would be real
gains (e.g., tar.gz) are going to be very difficult to handle. The
most likely implementation would be some form of checkpointing so that
clients can explicitly ask for a checkpoint object and then restore
back to that checkpoint. The checkpoint object could be complex if you
have a series of stacked read filters plus state in the format handler
itself.
As I see seeking in archives with help of libarchive is not right now possible, so a solution to my problem was to remember all read data only if I have some suspicion that I would want to re-read it, and alternatively push it back to stream.

Why is a different zlib window bits value required for extraction, compared with compression?

I am trying to debug a problem with some code that uses zlib 1.2.8. The problem is that this larger project can make archives, but runs into Z_DATA_ERROR header problems when trying to extract that archive.
To do this, I wrote a small test program in C++ that compresses ("deflates") a specified regular file, writes the compressed data to a second regular file, and extracts ("inflates") to a third regular file, one line at a time. I then diff the first and third files to make sure I get the same bytes.
For reference, this test project is located at: https://github.com/alexpreynolds/zlib-test and compiles under Clang (and should also compile under GNU GCC).
My larger question is how to deal with header data correctly in my larger project.
In my first test scenario, I can set up compression machinery with the following code:
z_error = deflateInit(this->z_stream_ptr, ZLIB_TEST_COMPRESSION_LEVEL);
Here, ZLIB_TEST_COMPRESSION_LEVEL is 1, to provide best speed. I then run deflate() on the z_stream pointer until there is nothing left that comes out of compression.
To extract these bytes, I can use inflateInit():
int ret = inflateInit(this->z_stream_ptr);
So what is the header format, in this case?
In my second test scenario, I set up the deflate machinery like so:
z_error = deflateInit2(this->z_stream_ptr,
ZLIB_TEST_COMPRESSION_LEVEL,
ZLIB_TEST_COMPRESSION_METHOD,
ZLIB_TEST_COMPRESSION_WINDOW_BITS,
ZLIB_TEST_COMPRESSION_MEM_LEVEL,
ZLIB_TEST_COMPRESSION_STRATEGY);
These deflate constants are, respectively, 1 for level, Z_DEFLATED for method, 15+16 or 31 for window bits, 8 for memory level, and Z_DEFAULT_STRATEGY for strategy.
The former inflateInit() call does not work; instead, I must use inflateInit2() and specify a modified window bits value:
int ret = inflateInit2(this->z_stream_ptr, ZLIB_TEST_COMPRESSION_WINDOW_BITS + 16);
In this case, the window bits value is not 31 as in the deflateInit2() call, but 15+32 or 47.
If I use 31 (or any other value than 47), then I get a Z_DATA_ERROR on subsequent inflate() calls. That is, if I use the same window bits for the inflateInit2() call:
int ret = inflateInit2(this->z_stream_ptr, ZLIB_TEST_COMPRESSION_WINDOW_BITS);
Then I get the following error on attempting to inflate():
Error: inflate to stream failed [-3]
Here, -3 is the same as Z_DATA_ERROR.
According to the documentation, using 31 with deflateInit2() should write a gzip header and trailer. Thus, 31 on the following inflateInit2() call should be expected to be able to extract the header information.
Why is the modified value 47 working, but not 31?
My test project is mostly similar to the example code on the zlib site, with the exception of the extraction/inflation code, which inflates one z_stream chunk at a time and parses the output for newline characters.
Is there something special about running inflate() only when a new buffer of extracted data is asked for — like header information going missing between inflate() calls — as opposed to running the whole extraction in one pass, as in the zlib example code?
My larger debugging problem is looking for a robust way to extract a chunk of zlib-compressed data only on request, so that I can extract data one line at a time, as opposed to getting the whole extracted file. Something about the way I am handling the zlib format parameter seems to be messing me up, but I can't figure out why or how to fix this.
deflateInit() and inflateInit(), as well as deflateInit2() and inflateInit2() with windowBits in 0..15 all process zlib-wrapped deflate data. (See RFC 1950 and RFC 1951.)
deflateInit2() and inflateInit2() with negative windowBits in -1..-15 process raw deflate data with no header or trailer. deflateInit2() and inflateInit2() with windowBits in 16..31, i.e. 16 added to 0..15, process gzip-wrapped deflate data (RFC 1952). inflateInit2() with windowBits in 32..47 (32 added to 0..15) will automatically detect either a gzip or zlib header (but not raw deflate data), and decompress accordingly.
Why is the modified value 47 working, but not 31?
31 does work. I did not try to look at your code to debug it.
Is there something special about running inflate() only when a new
buffer of extracted data is asked for — like header information going
missing between inflate() calls — as opposed to running the whole
extraction in one pass, as in the zlib example code?
I can't figure out what you're asking here. Perhaps a more explicit example would help. The whole point of inflate() is to decompress a chunk at a time.

Why sync-safe integer?

I'm recently working on ID3v2.4.0.
Reading 2.4.0 document, i found a particular part that i can't understand - sync-safe integer.
Why does the ID3v2 use this method?
Of course, i know why the ID3v2 uses Unsynchronization scheme, which is used to keep MPEG decoder from considering ID3 tag as a MPEG sync data.
But what i couldn't understand is that why sync-safe integer instead of Unsynchronization Scheme (= inserting $00).
Is there any reason why they adopt sync-safe integer when expressing tag size instead of inserting $00?
These two method result in completely same effect. 
ID3v2 document says that the size of unsynchronized data is not known in advance.
But that statement does not make sense.
If tag data is stored in buffer, one can know the size of unsynchronized data after simply replacing the problematic character with $FF 00.
Is there anyone who can help me?
I would presume for simplicity, and the unsynch/synch scheme only makes sense when used on an mpeg file.
It is trivial to read in the four bytes and convert them to a regular integer:
// pseudo code
uint32_t size;
file.read( &size, sizeof(uint32_t) );
size = (size & 0x0000007F) |
( (size & 0x00007F00) >> 1 ) |
( (size & 0x007F0000) >> 2 ) |
( (size & 0x7F000000) >> 3 );
If they used the same unsynch scheme as frame data you would need to read each byte separately, look for the FF00 pattern, and reconstruct the integer byte by byte. Also, if the ‘size’ field in the header could be a variable number of bytes, due to unsynch bytes being inserted, the entire header would be a variable number of bytes. Simpler for them to say 'the header is always 10 bytes in size and it looks like this...'.
ID3v2 document says that the size of unsynchronized data is not known in advance. But that statement does not make sense. If tag data is stored in buffer, one can know the size of unsynchronized data after simply replacing the problematic character with $FF 00.
You are correct, it doesn't make sense. The size written in the id3v2 header and frame headers is the size after unsynchronisation, if any, was applied. However, it is permissible to write frame data without unsynching as id3v2 may be used for tagging files other than mp3, where the concept of unsynch/synch makes no sense. I think what section 6.2 was trying to say is 'regardless of whether this is an mp3 file, or a frame is written unsynched/synched, the frame size is always written in a mpeg synch-safe manner'.
ID3v2.4 frames can have the ‘Data Length Indicator’ flag set in the frame header, in which case you can find out how big a buffer is after synchronisation. Refer to section 4.1.2 of the spec.
Is there anyone who can help me?
Some helpful advice from someone who has written a conforming id3v2 tag reader: Don't try make sense of the spec. It surely was written by madmen and sadists. Just looking at it again is giving me nightmares.

String issue with assert on erase

I am developing a program in C++, using the string container , as in std::string to store network data from the socket (this is peachy), I receive the data in a maximum possible 1452 byte frame at a time, the protocol uses a header that contains information about the data area portion of the packets length, and header is a fixed 20 byte length. My problem is that a string is giving me an unknown debug assertion, as in , it asserts , but I get NO message about the string. Now considering I can receive more than a single packet in a frame at a any time, I place all received data into the string , reinterpret_cast to my data struct, calculate the total length of the packet, then copy the data portion of the packet into a string for regex processing, At this point i do a string.erase, as in mybuff.Erase(totalPackLen); <~ THIS is whats calling the assert, but totalpacklen is less than the strings size.
Is there some convention I am missing here? Or is it that the std::string really is an inappropriate choice here? Ty.
Fixed it on my own. Rolled my own VERY simple buffer with a few C calls :)
int ret = recv(socket,m_buff,0);
if(ret > 0)
{
BigBuff.append(m_buff,ret);
while(BigBuff.size() > 16){
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
if(ntohs(hdr->PackLen) <= BigBuff.size() - 20){
hdr->PackLen = ntohs(hdr->PackLen);
string lData;
lData.append(BigBuff.begin() + 20,BigBuff.begin() + 20 + hdr->PackLen);
Parse(lData); //regex parsing helper function
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
}
}
}
From the code snippet you provided it appears that your packet comprises a fixed-length binary header followed by a variable length ASCII string as a payload. Your first mistake is here:
BigBuff.append(m_buff,ret);
There are at least two problems here:
1. Why the append? You presumably have dispatched with any previous messages. You should be starting with a clean slate.
2. Mixing binary and string data can work, but more often than not it doesn't. It is usually better to keep the binary and ASCII data separate. Don't use std::string for non-string data.
Append adds data to the end of the string. The very next statement after the append is a test for a length of 16, which says to me that you should have started fresh. In the same vein you do that reinterpret cast from BigBuff[0]:
Header *hdr = reinterpret_cast<Header*>(&BigBuff[0]);
Because of your use of append, you are perpetually dealing with the header from the first packet received rather than the current packet. Finally, there's that erase:
BigBuff.erase(hdr->PackLen + 20);
Many problems here:
- If the packet length and the return value from recv are consistent the very first call will do nothing (the erase is at but not past the end of the string).
- There is something very wrong if the packet length and the return value from recv are not consistent. It might mean, for example, that multiple physical frames are needed to form a single logical frame, and that in turn means you need to go back to square one.
- Suppose the physical and logical frames are one and the same, you're still going about this all wrong. As noted, the first time around you are erasing exactly nothing. That append at the start of the loop is exactly what you don't want to do.
Serialization oftentimes is a low-level concept and is best treated as such.
Your comment doesn't make sense:
BigBuff.erase(hdr->PackLen + 20); //assert here when len is packlen is 235 and string len is 1458;
BigBuff.erase(hdr->PackLen + 20) will erase from hdr->PackLen + 20 onwards till the end of the string. From the description of the code - seems to me that you're erasing beyond the end of the content data. Here's the reference for std::string::erase() for you.
Needless to say that std::string is entirely inappropriate here, it should be std::vector.