zlib compress() produces awful compression rate - c++

I want to compress my data, using compress zlib function, so, code, looks like following:
ifs.read(srcBuf,srcLen) // std::ifstream, srcLen = 256kb
compress(dstBuf, &dstLen, srcBuf, srcLen); // casts are omitted
ofs.write(dstBuf, dstLen); // std::ofstream
dstLen = dstBufSize;
Result file is ~4% smaller, than original (380mb vs 360mb), which is, actually, awful.
Meanwhile, Winrar compress this file to 70mb file. I've tried bzip2 and zlib, and both provide similar result. I guess the problem is that 256KB buffer is too small, but I'd like to understand how it works, and how I can use zlib, to achieve better compression.
Overall, I want to make lowlevel facility to compress several files to 1 big one for internal use, and compress() looks very suited for it, but...
Deep explanations are very welcome. Thanks in advance.

I believe your problem is that by using the compress() function (rather than deflateInit()/deflate()/deflateEnd()), you are underutilizing zlib's compression abilities.
The key insight here is that zlib compression is implemented by building up a Huffman tree, which is a dictionary-type data structure that specifies short "tokens" that will succinctly represent longer sequences of input bytes. That way, whenever those longer sequences are repeated later on in the input stream, they can be replaced by their equivalent tokens in the output stream, greatly reducing the total size of the compressed data.
However, the efficiency of that process depends a lot of the persistence of that built-up Huffman tree, which in turn depends on your program keeping the deflate-algorithm's state for the entire duration of the compression process. But your code is calling compress(), which is meant to be a single-shot convenience function for small amounts of data, and as such compress() does not provide any way for your program to retain state across multiple calls to it. With each call to compress(), a brand-new Huffman tree is generated, written to the output stream used for the remainder of the data passed to that call, and then forgotten -- it will be inaccessible to any subsequent compress() calls. That is likely the source of the poor efficiency you are seeing.
The fix is not to use compress() in cases where you need to compress the data in more than one step. Instead, call deflateInit() (to allocate the state for the algorithm), then call deflate() multiple times (to compress data using, and updating that state), and then finally call deflateEnd() to clean up.

Use deflateInit(), deflate(), and deflateEnd() instead of compress(). I don't know whether or not that will improve the compression, since you provided no information on the data, and only the slightest clue as to what your program does (are those lines inside a loop?). However if you are compressing something large that you are not loading into memory all at once, then you do not use compress().

Related

How to parse very large buffer with flex and bison

I've written a json library that uses flex and bison to parse serialized-json (i.e strings) — and deserialize them to json objects. It works great for small strings.
However, it fails to work with a very large strings (I tried strings of almost 3 GB) with this error:
‘fatal flex scanner internal error--end of buffer missed’
I want to know what is the maximum size of buffer which I can pass to this function:
//js: serialized json stored in std::string
yy_scan_bytes(js.data(), js.size());
and how can make flex/bison work with large buffers?
It looks to me like you are using an old version of the flex skeleton (and hence of flex), in which string lengths were assumed to fit into ints. The error message you are observing is probably the result of an int overflowing to a negative value.
I believe that if you switch to version 2.5.37 or more recent, you'll find that most of those ints have become size_t and you should have no problem calling yy_scan_bytes with an input buffer whose size exceeds 2 gigabytes. (The prototype for that function now takes a size_t rather than an int, for example.)
I have a hard time believing that doing so is a good idea, however. For a start, yy_scan_bytes copies the entire string, because the lexical scanner wants a string it is allowed to modify, and because it wants to assure itself that the string has two NUL bytes at the end. Making that copy is going to needless use up a lot of memory, and if you're going to copy the buffer anyway, you might as well copy it in manageable pieces (say, 64Kib or even 1MiB.) That will only prove problematic if you have single tokens which are significantly larger than the chunk size, because flex is definitely not optimized for large single tokens. But for all normal use cases, it will probably work out a lot better.
Flex doesn't provide an interface for splitting a huge input buffer into chunks, but you can do it very easily by redefining the YY_INPUT macro. (If you do that, you'll probably end up using yyin as a pointer to your own buffer structure, which is theoretically non-portable. However, it will work on any Posix architecture, where all object pointers have the same representation.)
Of course, you would normally not want to wait while 3GB of data is accumulated in memory to start parsing it. You could parse incrementally as you read the data. (You might still need to redefine YY_INPUT, depending on how you are reading the data.)

Decompressing a byte array with zlib to a byte array

Context: I'm using a .mbtiles file, a geomapping file format, which is a sqlite database file containing vector tiles.
Those vector tiles are packed using protocol buffer and then gzipped.
I'm using C++, and currently reading the zlib usage decompression example, but I am not sure about how to handle chunks and the end of stream event.
SQLite gives me a void* pointer and a length.
I quote the page:
For applications where zlib streams are embedded in other data, this
routine would need to be modified to return the unused data, or at
least indicate how much of the input data was not used, so the
application would know where to pick up after the zlib stream.
The protocol buffer class methods either take void* or std::string. I guess I should go with void*.
I'm not sure how those events work, and the example doesn't seem to provide a case for bytes arrays. How should I change the code to avoid errors ?
It sounds like SQLite is giving you a zlib stream without anything after it. If so, then that comment doesn't apply.
In any case, you are looking at the right page. (You didn't say what "the page" is, but I recognize the quote, since I wrote it.) That shows in general how to use the zlib functions. You should be able to figure out how to apply it to a byte array instead of file input.
If the data is really "gzipped", then you will need to use inflateInit2() instead of inflateInit(). Read the zlib documentation in zlib.h.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.
So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.
Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Reading large txt efficiently in c++

I have to read a large text file (> 10 GB) in C++. This is a csv file with variable length lines. when I try to read line by line using ifstream it works but takes long time, i guess this is becuase each time I read a line it goes to disk and reads, which makes it very slow.
Is there a way to read in bufferes, for example read 250 MB at one shot (using read method of ifstream) and then get lines from this buffer, i see lot of issues with solution like buffer can have incomplete lines etc..
Is there a solution for this in c++ which handles all these cases etc. Are there any open source libraries that can do this for example boost etc ?
Note: I would want to avoid c stye FILE* pointers etc.
Try using the Windows memory mapped file function. The calls are buffered and you get to treat a file as if its just memory.
memory mapped files
IOstreams already use buffers much as you describe (though usually only a few kilobytes, not hundreds of megabytes). You can use pubsetbuf to get it to use a larger buffer, but I wouldn't expect any huge gains. Most of the overhead in IOstreams stems from other areas (like using virtual functions), not from lack of buffering.
If you're running this on Windows, you might be able to gain a little by writing your own stream buffer, and having it call CreateFile directly, passing (for example) FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_NO_BUFFERING. Under the circumstances, either of these may help your performance substantially.
If you want real speed, then you're going to have to stop reading lines into std::string, and start using char*s into the buffer. Whether you read that buffer using ifstream::read() or memory mapped files is less important, though read() has the disadvantage you note about potentially having N complete lines and an incomplete one in the buffer, and needing to recognise that (can easily do that by scanning the rest of the buffer for '\n' - perhaps by putting a NUL after the buffer and using strchr). You'll also need to copy the partial line to the start of the buffer, read the next chunk from file so it continues from that point, and change the maximum number of characters read such that it doesn't overflow the buffer. If you're nervous about FILE*, I hope you're comfortable with const char*....
As you're proposing this for performance reasons, I do hope you've profiled to make sure that it's not your CSV field extraction etc. that's the real bottleneck.
I hope this helps -
http://www.cppprog.com/boost_doc/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
BTW, you wrote "i see lot of issues with solution like buffer can have incomplete lines etc.." - in this situation how about reading 250 MB and then read char by char until you get the delimiter to complete the line.

Which is faster in memory, ints or chars? And file-mapping or chunk reading?

Okay, so I've written a (rather unoptimized) program before to encode images to JPEGs, however, now I am working with MPEG-2 transport streams and the H.264 encoded video within them. Before I dive into programming all of this, I am curious what the fastest way to deal with the actual file is.
Currently I am file-mapping the .mts file into memory to work on it, although I am not sure if it would be faster to (for example) read 100 MB of the file into memory in chunks and deal with it that way.
These files require a lot of bit-shifting and such to read flags, so I am wondering that when I reference some of the memory if it is faster to read 4 bytes at once as an integer or 1 byte as a character. I thought I read somewhere that x86 processors are optimized to a 4-byte granularity, but I'm not sure if this is true...
Thanks!
Memory mapped files are usually the fastest operations available if you require your file to be available synchronously. (There are some asynchronous APIs that allow the O/S to reorder things for a slight speed increase sometimes, but that sounds like it's not helpful in your application)
The main advantage you're getting with the mapped files is that you can work in memory on the file while it is still being read from disk by the O/S, and you don't have to manage your own locking/threaded file reading code.
Memory reference wise, on the x86 memory is going to be read an entire line at a time no matter what you're actually working with. The extra time associated with non byte granular operations refers to the fact that integers need not be byte aligned. For example, performing an ADD will take more time if things aren't aligned on a 4 byte boundary, but for something like a memory copy there will be little difference. If you are working with inherently character data then it's going to be faster to keep it that way than to read everything as integers and bit shift things around.
If you're doing h.264 or MPEG2 encoding the bottleneck is probably going to be CPU time rather than disk i/o in any case.
If you have to access the whole file, it is always faster to read it to memory and do the processing there. Of course, it's also wasting memory, and you have to lock the file somehow so you won't get concurrent access by some other application, but optimization is about compromises anyway. Memory mapping is faster if you're skipping (large) parts of the file, because you don't have to read them at all then.
Yes, accessing memory at 4-byte (or even 8-byte) granularity is faster than accessing it byte-wise. Again it's a compromise - depending on what you have to do with the data afterwards, and how skilled you are at fiddling with the bits in an int, it might not be faster overall.
As for everything regarding optimization:
measure
optimize
measure
These are sequential bit-streams - you basically consume them one bit at a time without random-access.
You don't need to put a lot of effort into explicitly buffering reads and such in this scenario: the operating system will be buffering them for you anyway. I've written H.264 parsers before, and the time is completely dominated by the decoding and manipulation, not the IO.
My recommendation is to use a standard library and for parsing these bit-streams.
Flavor is such a parser, and the website even includes examples of MPEG-2 (PS) and various H.264 parts like M-Coder. Flavor builds native parsing code from a c++-like language; here's an quote from the MPEG-2 PS spec:
class TargetBackgroundGridDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 7
{
unsigned int(14) horizontal_size;
unsigned int(14) vertical_size;
unsigned int(4) aspect_ratio_information;
}
class VideoWindowDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 8
{
unsigned int(14) horizontal_offset;
unsigned int(14) vertical_offset;
unsigned int(4) window_priority;
}
Regarding to the best size to read from memory, I'm sure you will enjoy reading this post about memory access performance and cache effects.
One thing to consider about memory-mapping files is that a file with a size greater than the available address range will only be able to be map a portion of the file. To access the remainder of the file requires the first part to be unmapped and the next part to mapped in its place.
Since you're decoding mpeg streams you may want to use a double buffered approach with asynchronous file reading. It works like this:
blocksize = 65536 bytes (or whatever)
currentblock = new byte [blocksize]
nextblock = new byte [blocksize]
read currentblock
while processing
asynchronously read nextblock
parse currentblock
wait for asynchronous read to complete
swap nextblock and currentblock
endwhile