c++ what in-memory compression library? - c++

I already googled for in-memory compression and found quite a few libraries that offert this functionality. the zlib seems to be widely used - but it also seems to be quite old. I'm asking here whether there are newer, better alternatives.
The data i want to compress in-memory are memorypools with size of a few megabytes (2-16 MB) and each of those blocks contains data of two different structs as well as some arrays of pointers. inside the blocks, there's no particular order for the structs and the arrays, they are just allocated after another when the application needs to create such an element.
What compression lib would you suggest for this? compression and decompression performance (both) are more important than compression quality.
Also - for compression reasons - would it be better to have separate pools for the two different structs as well as the arrays, such that each datablock to be compressed only contains one kind of data?
This is the first time i intend to use in-memory compression and i know my question is maybe too general to give a good answer - but every hint is welcome!
thx!

zlib is good. Proven, performant, and understood by many. It's what I'd use by default in a new system like what you describe. Its age should be seen as one of its greatest assets.

For something more modern than zlib, libbzip2 might be worth a look. It provides a similar interface to zlib, for compatibility. In a lot of cases, it offers better compression, but at a performance cost.
For something faster than zlib (but which doesn't compress as well..) there's LZO.

It makes no sense to do this on modern operating systems with a virtual memory manager. You'll create a blob of bytes that are not useful for anything, taking space in your virtual memory address space for no good reason. The memory manager won't leave it in RAM for very long, it will notice that the pages occupied by the blob are not being accessed and swap it out to the paging file.
In addition, you'll have to translate the data if it contains pointers. The odds that you'll be able to decompress the data at the exact same virtual memory address, so that the pointers are still valid, are very close to zero. After all, you did this to free up virtual memory space, the hole previously used by the data will be occupied by something else. This translation will probably not be trivial and it will take lots of additional memory.
If you are doing this to avoid OOM, look at operating system support for memory mapped files and consider switching to 64-bit code.

If compression/decompression speed is important for you, you should take a look at LZO:
http://www.oberhumer.com/opensource/lzo/
Compared to zlib the code smaller and easier to use as well.

I'm not aware of anything newer/better than zlib... zlib works fine, despite its age. zlib's deflateInit() has an argument that lets you trade off compression speed against compressed size, so you can experiment with that to find the setting that works best for you application.
There are probably C++ wrapper APIs that call the zlib C API for you, if you want something "prettier"... or if there aren't, its easy enough to write your own.

For compression, the data matters a lot. Compressing arbitrary binary data in memory is a complete waste of time, will slow your performance immensely, and probably will end up making your memory usage higher.
If you really need to have much more memory you should look at using VirtualAlloc or sbrk to control the memory yourself. This way you can address ALL physical memory, not just 2-4gb.

Related

Not enough memory in C++ : write to file instead, read data in when needed?

I'm developing a tool for wavelet image analysis and machine learning on Linux machines in C++.
It is limited by the size of the images, the number of scales and their corresponding filters (up to 2048x2048 doubles) for each of N orientations as well as additional memory and processing overhead by a machine learning algorithm.
Unfortunately my skills of Linux system programming are shallow at best,
so I'm currently using no swap but figure it should be possible somehow?
I'm required to keep the imaginary and real part of the
filtered images of each scale and orientation, as well as the corresponding wavelets for reconstruction purposes. I keep them in memory for additional speed for small images.
Regarding the memory use: I already
store everything no more than once,
only what is needed,
cut out any double entries or redundancy,
pass by reference only,
use pointers over temporary objects,
free memory as soon as it is not required any more and
limit the number of calculations to the absolute minimum.
As with most data processing tools, speed is at the essence. As long as there
is enough memory the tool is about 3x as fast compared to the same implementation in Matlab code.
But as soon as I'm out of memory nothing goes any more. Unfortunately most of the images I'm training the algorithm on are huge (raw data 4096x4096 double entries, after symmetric padding even larger), therefore I hit the ceiling quite often.
Would it be bad practise to temporarily write data that is not needed for the current calculation / processing step from memory to the disk?
What approach / data format would be most suitable to do that?
I was thinking of using rapidXML to read and write an XML to a binary file and then read out only the required data. Would this work?
Is a memory-mapped file what I need? https://en.wikipedia.org/wiki/Memory-mapped_file
I'm aware that this will result in performance loss, but it is more important that the software runs smoothly and does not freeze.
I know that there are libraries out there that can do wavelet image analysis, so please spare the "Why reinvent the wheel, just use XYZ instead". I'm using very specific wavelets, I'm required to do it myself and I'm not supposed to use external libraries.
Yes, writing data to the disk to save memory is bad practice.
There is usually no need to manually write your data to the disk to save memory, unless you are reaching the limits of what you can address (4GB on 32bit machines, much more in 64bit machines).
The reason for this is that the OS is already doing exactly the same thing. It is very possible that your own solution would be slower than what the OS is doing. Read this Wikipedia article if you are not familiar with the concept of paging and virtual memory.
Did you look into using mmap and munmap to bring the images (and temporary results) into your address space and discard them when you no longer need them. mmap allows you to map the content of a file directly in memory. no more fread/fwrite. Direct memory access. Writes to the memory region are written back to the file too and bringing back that intermediate state later on is no harder than redoing an mmap.
The big advantages are:
no encoding in a bloated format like XML
perfectly suitable for transient results such as matrices that are represented in contiguous memory regions.
Dead simple to implement.
Completely delegate to the OS the decision of when to swap in and out.
This doesn't solve your fundamental problem, but: Are you sure you need to be doing everything in double precision? You may not be able to use integer coefficient wavelets, but storing the image data itself in doubles is usually pretty wasteful. Also, 4k images aren't very big ... I'm assuming you are actually using frames of some sort so have redundant entries, otherwise your numbers don't seem to add up (and are you storing them sparsely?) ... or maybe you are just using a large number at once.
As for "should I write to disk"? This can help, particularly if you are getting a 4x increase (or more) by taking image data to double precision. You can answer it for yourself though, just measure the time to load and compare to your compute time to see if this is worth pursuing. The wavelet itself should be very cheap, so I'm guess you're mostly dominated by your learning algorithm. In that case, go ahead and throw out original data or whatever until you need it again.

C++ Memory Counting in OpenCV

I have an application written in OpenCV. It consists of two threads that each perform an OpenCV function. How can i determine how much memory each thread is generating?
I'm using libdispatch, Grand Central Dispatch design pattern. It is at a stage where i can have multiple tasks running at once. How can i manage memory in such a situation? With some opencv processes and enough concurrent tasks, i can easily hit my RAM ceiling. How to manage this?
What strategies can be employed in C++?
If each thread had a memory limit, how could this be handled?
Regards,
Daniel
I'm not familiar with the dispatching library/pattern you're using, but I've had a quick glance over what it aims to do. I've done a fair amount of work in the image processing/video processing domain, so hopefully my answer isn't a completely useless wall-of-text ;)
My suspicion is that you're firing off whole image buffers to different threads to run the same processing on them. If this is the case, then you're quickly going to hit RAM limits. If a task (thread) uses N image buffers in its internal functions, and your RAM is M, then you may start running out of legs at M / N tasks (threads). If this is the case, then you may need to resort to firing off chunks of images to the threads instead (see the hints further down on using dependency graphs for processing).
You should also consider the possibility that performance in your particular algorithm is memory bound and not CPU bound. So it may be pointless firing off more threads even though you have extra cores, and perhaps in this case you're better off focusing on CPU SIMD things like SSE/MMX.
Profile first, Ask (Memory Allocator) Questions Later
Using hand-rolled memory allocators that cater for concurrent environments and your specific memory requirements can make a big difference to performance. However, they're unlikely to reduce the amount of memory you use unless you're working with many small objects, where you may be able to do a better job with memory layout when allocating and reclaiming them than the default malloc/free implementations. As you're working with image processing algorithms, the latter is unlikely. You've typically got huge image buffers allocated on the heap as opposed to many small-ish structs.
I'll add a few tips on where to begin reading on rolling your own allocators at the end of my answer, but in general my advice would be to first profile and figure out where the memory is being used. Having written the code you may have a good hunch about where it's going already, but if not tools like valgrind's massif (complicated beast) can be a big help.
After having profiled the code, figure out how you can reduce the memory use. There are many, many things you can do here, depending on what's using the memory. For example:
Free up any memory you don't need as soon as you're done with it. RAII can come in handy here.
Don't copy memory unless you need to.
Share memory between threads and processes where appropriate. It will make it more difficult than working with immutable/copied data, because you'll have to synchronise read/write access, but depending on your problem case it may make a big difference.
If you're using memory caches, and you don't want to cache the data to disk due to performance reasons, then consider using in-memory compression (e.g. zipping some of the cache) when it's falling to the bottom of your least-recently-used cache, for example.
Instead of loading a whole dataset and having each method operate on the whole of it, see if you can chunk it up and only operate on a subset of it. This is particularly relevant when dealing with large data sets.
See if you can get away with using less resolution or accuracy, e.g. quarter-size instead of full size images, or 32 bit floats instead of 64 bit floats (or even custom libraries for 16 bit floats), or perhaps using only one channel of image data at a time (just red, or just blue, or just green, or greyscale instead of RGB).
As you're working with OpenCV, I'm guessing you're either working on image processing or video processing. These can easily gobble up masses of memory. In my experience, initial R&D implementations typically process a whole image buffer in one method before passing it over to the next. This often results in multiple full image buffers being used, which is hugely expensive in terms of memory consumption. Reducing the use of any temporary buffers can be a big win here.
Another approach to alleviate this is to see if you can figure out the data dependencies (e.g. by looking at the ROIs required for low-pass filters, for example), and then processing smaller chunks of the images and joining them up again later, and to avoid temporary duplicate buffers as much as possible. Reducing the memory footprint in this way can be a big win, as you're also typically reducing the chances of a cache miss. Such approaches often hugely complicate the implementation, and unless you have a graph-based framework in place that already supports it, it's probably not something you should attempt before exhausting other options. Intel have a number of great resources pertaining to optimisation of threaded image processing applications.
Tips on Memory Allocators
If you still think playing with memory allocators is going to be useful, here are some tips.
For example, on Linux, you could use
malloc hooks, or
just override them in your main compilation unit (main.cpp), or a library that you statically link, or a shared libary that you LD_PRELOAD, for example.
There are several excellent malloc/free replacements available that you could study for ideas, e.g.
dlmalloc
tcmalloc
If you're dealing with specific C++ objects, then you can override their new and delete operators. See this link, for example.
Lastly, if I did manage to guess wrong regarding where memory is being used, and you do, in fact, have loads of small objects, then search the web for 'small memory allocators'. Alexander Alexandrescu wrote a couple of great articles on this, e.g. here and here.

Fastest way of reading a file in Linux?

On Linux what would be the fastest way of reading a file in to an array of bytes/to process the bytes? This can include memory-mapping, sys calls etc. I am not familiar with the many Linux-specific functions.
In the past I have used boost memory mapping, but I need faster Linux-specific performance rather than portability.
mmap should be the fastest way to access the contents of a file if the file is large enough. There's an initial cost for setting up the memory mappings, but that's offset by not needing to copy the data from the page cache into userland. And if you want all the contents of the file, the cost to allocate the memory to your program should be more or less the same as the cost of mmap.
Your best bet, as always, is to test and benchmark.
Don't let yourself get fooled by lazy stuff like memory mapping. Rather focus on what you really need. Do you really need to read the whole file into memory? Then the straight-forward way of opening, reading chunks in a loop, and closing the file will be as fast as it can be done.
But often you don't really want that. Instead you might want to read specific parts, a block here, a block there, jump through the file, read a block at a specific position, etc.
Then still fseeking out those positions and freading the blocks won't have overheads worth mentioning. But it can be more convenient to use memory mapping to let the operating system or a library deal with stuff like memory allocation etc. It won't get the job done faster, though.

Reserving Shared Memory with No File Backing (Linux/Windows) (boost::interprocess)

How can I reserve and allocate shared memory without the backing of a file? I'm trying to reserve a large (many tens of GiBs) chunk of shared memory and use it in multiple processes as a form of IPC. However, most of this chunk won't be touched at all (the access will be really sparse; maybe a few hundred megabytes throughout the life of the processes) and I don't care about the data when the applications end.
So preferably, the method to do this should have the following properties:
Doesn't commit the whole range. I will choose which parts to commit (actually use.) (But the pattern is quite unpredictable.)
Doesn't need a memory-mapped file or anything like that. I don't need to preserve the data.
Lets me access the memory area from multiple processes (I'll handle the locking explicitly.)
Works in both Linux and Windows (obviously a 64-bit OS is needed.)
Actually uses shared memory. I need the performance.
(NEW) The OS or the library doesn't try to initialize the reserved region (to zero or whatever.) This is obviously impractical and unnecessary.
I've been experimenting with boost::interprocess::shared_memory_object, but that causes a large file to be created on the filesystem (with the same size as my mapped memory region.) It does remove the file afterwards, but that hardly helps.
Any help/advice/pointers/reference is appreciated.
P.S. I do know how to do this on Windows using the native API. And POSIX seems to have the same functionality (only with a cleaner interface!) I'm looking for a cross-platform way here.
UPDATE: I did a bit of digging, and it turns out that the support that I thought existed in Windows was only a special case of memory-mapped files, using the system page file as a backing. (I'd never noticed it before because I had used at most a few megabytes of shared memory in the past projects.)
Also, I have a new requirement now (the number 6 above.)
On Windows, all memory has to be backed by the disk one way or another.
The closest I think you can accomplish on windows would be to memory map a sparse file. I think this will work in your case for two reasons:
On Windows, only the shared memory that you actually touch will become resident. This meets the first requirement.
Because the file is sparse, only the parts that have been modified will actually be stored on disk. If you looked at the properties of the file, it would say something like, "Size: 500 MB, Size on disk: 32 KB".
I realize that this technically doesn't meet your 2nd requirement, but, unfortunately, I don't think that is possible on Windows. At least with this approach, only the regions you actually use will take up space. (And only when Windows decides to commit the memory to disk, which it may or may not do at its discretion.)
As for turning this into a cross-platform solution, one option would be to modify boost::interprocess so that it creates sparse files on Windows. I believe boost::interprocess already meets your requirements on Linux, where POSIX shared memory is available.

Understanding the efficiency of an std::string

I'm trying to learn a little bit more about c++ strings.
consider
const char* cstring = "hello";
std::string string(cstring);
and
std::string string("hello");
Am I correct in assuming that both store "hello" in the .data section of an application and the bytes are then copied to another area on the heap where the pointer managed by the std::string can access them?
How could I efficiently store a really really long string? I'm kind of thinking about an application that reads in data from a socket stream. I fear concatenating many times. I could imagine using a linked list and traverse this list.
Strings have intimidated me for far too long!
Any links, tips, explanations, further details, would be extremely helpful.
I have stored strings in the 10's or 100's of MB range without issue. Naturally, it will be primarily limited by your available (contiguous) memory / address space.
If you are going to be appending / concatenating, there are a few things that may help efficiency-wise: If possible, try to use the reserve() member function to pre-allocate space-- even if you have a rough idea of how big the final size might be, it would save from unnecessary re-allocations as the string grows.
Additionally, many string implementations use "exponential growth", meaning that they grow by some percentage, rather than fixed byte size. For example, it might simply double the capacity any time additional space is needed. By increasing size exponentially, it becomes more efficient to perform lots of concatenations. (The exact details will depend on your version of stl.)
Finally, another option (if your library supports it) is to use rope<> template: Ropes are similar to strings, except that they are much more efficient when performing operations on very large strings. In particular, "ropes are allocated in small chunks, significantly reducing memory fragmentation problems introduced by large blocks". Some additional details on SGI's STL guide.
Since you're reading the string from a socket, you can reuse the same packet buffers and chain them together to represent the huge string. This will avoid any needless copying and is probably the most efficient solution possible. I seem to remember that the ACE library provides such a mechanism. I'll try to find it.
EDIT: ACE has ACE_Message_Block that allows you to store large messages in a linked-list fashion. You almost need to read the C++ Network Programming books to make sense of this colossal library. The free tutorials on the ACE website really suck.
I bet Boost.Asio must be capable of doing the same thing as ACE's message blocks. Boost.Asio now seems to have a larger mindshare than ACE, so I suggest looking for a solution within Boost.Asio first. If anyone can enlighten us about a Boost.Asio solution, that would be great!
It's about time I try writing a simple client-server app using Boost.Asio to see what all the fuss is about.
I don't think efficiency should be the issue. Both will perform well enough.
The deciding factor here is encapsulation. std::string is a far better abstraction than char * could ever be. Encapsulating pointer arithmetic is a good thing.
A lot of people thought long and hard to come up with std::string. I think failing to use it for unfounded efficiency reasons is foolish. Stick to the better abstraction and encapsulation.
As you probably know, an std::string is really just another name for basic_string<char>.
That said, they are a sequence container and memory will be allocated sequentially. It's possible to get an exceptions from an std::string if you try to make one bigger than the available contiguous memory that you can allocate. This threshold is typically considerably less than the total available memory due to memory fragmentation.
I've seen problems allocating contiguous memory when trying to allocate, for instance, large contiguous 3D buffers for images. But these issues don't start happening at least on the order of 100MB or so, at least in my experience, on Windows XP Pro (for instance.)
Are your strings this big?