Not enough memory in C++ : write to file instead, read data in when needed? - c++

I'm developing a tool for wavelet image analysis and machine learning on Linux machines in C++.
It is limited by the size of the images, the number of scales and their corresponding filters (up to 2048x2048 doubles) for each of N orientations as well as additional memory and processing overhead by a machine learning algorithm.
Unfortunately my skills of Linux system programming are shallow at best,
so I'm currently using no swap but figure it should be possible somehow?
I'm required to keep the imaginary and real part of the
filtered images of each scale and orientation, as well as the corresponding wavelets for reconstruction purposes. I keep them in memory for additional speed for small images.
Regarding the memory use: I already
store everything no more than once,
only what is needed,
cut out any double entries or redundancy,
pass by reference only,
use pointers over temporary objects,
free memory as soon as it is not required any more and
limit the number of calculations to the absolute minimum.
As with most data processing tools, speed is at the essence. As long as there
is enough memory the tool is about 3x as fast compared to the same implementation in Matlab code.
But as soon as I'm out of memory nothing goes any more. Unfortunately most of the images I'm training the algorithm on are huge (raw data 4096x4096 double entries, after symmetric padding even larger), therefore I hit the ceiling quite often.
Would it be bad practise to temporarily write data that is not needed for the current calculation / processing step from memory to the disk?
What approach / data format would be most suitable to do that?
I was thinking of using rapidXML to read and write an XML to a binary file and then read out only the required data. Would this work?
Is a memory-mapped file what I need? https://en.wikipedia.org/wiki/Memory-mapped_file
I'm aware that this will result in performance loss, but it is more important that the software runs smoothly and does not freeze.
I know that there are libraries out there that can do wavelet image analysis, so please spare the "Why reinvent the wheel, just use XYZ instead". I'm using very specific wavelets, I'm required to do it myself and I'm not supposed to use external libraries.

Yes, writing data to the disk to save memory is bad practice.
There is usually no need to manually write your data to the disk to save memory, unless you are reaching the limits of what you can address (4GB on 32bit machines, much more in 64bit machines).
The reason for this is that the OS is already doing exactly the same thing. It is very possible that your own solution would be slower than what the OS is doing. Read this Wikipedia article if you are not familiar with the concept of paging and virtual memory.

Did you look into using mmap and munmap to bring the images (and temporary results) into your address space and discard them when you no longer need them. mmap allows you to map the content of a file directly in memory. no more fread/fwrite. Direct memory access. Writes to the memory region are written back to the file too and bringing back that intermediate state later on is no harder than redoing an mmap.
The big advantages are:
no encoding in a bloated format like XML
perfectly suitable for transient results such as matrices that are represented in contiguous memory regions.
Dead simple to implement.
Completely delegate to the OS the decision of when to swap in and out.

This doesn't solve your fundamental problem, but: Are you sure you need to be doing everything in double precision? You may not be able to use integer coefficient wavelets, but storing the image data itself in doubles is usually pretty wasteful. Also, 4k images aren't very big ... I'm assuming you are actually using frames of some sort so have redundant entries, otherwise your numbers don't seem to add up (and are you storing them sparsely?) ... or maybe you are just using a large number at once.
As for "should I write to disk"? This can help, particularly if you are getting a 4x increase (or more) by taking image data to double precision. You can answer it for yourself though, just measure the time to load and compare to your compute time to see if this is worth pursuing. The wavelet itself should be very cheap, so I'm guess you're mostly dominated by your learning algorithm. In that case, go ahead and throw out original data or whatever until you need it again.

Related

C++: Is it more efficient to store data or continually read it

Ok so I'm working on a game project. Just finished rebuilding a game engine I designed some time ago. I'm looking at making a proprietary file type to store data rather than using a database like sqlite.
Looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time.
My question is: Is it more efficient overall to load the data from the file and store it in a data manager class to be reused? Or is it more efficient overall to continually pull from the file?
Assuming the file follows some form of consistent structure for it's data. And we're looking at the largest "table" being something like 30 columns with roughly 1000 rows of data.
Here's a handy chart of "Latency Numbers Every Computer Programmer Should Know"
The far right hand side of the chart (red) has the time it takes to read 1 MB from disk. The green column has the same value read from RAM.
What this shows us is that you should do almost anything to avoid having to directly interact with the disk. Keeping data in RAM is good. Keeping data on disk is bad. (Memory mapped files might provide a way to handle this.)
This aside, reinventing the wheel is almost always the wrong solution. Sqlite works and works well. If it's not ideally suited for your needs, there are other file types out there.
If you're "looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time", you'll find that's easiest to do if you reuse preexisting solutions to common problems.
Keeping reading from a file is generally not a good idea; modern operating systems do keep large IO caches (so if you keep reading the same stuff it won't really hit the disk), but syscalls are of course way more onerous than straight accessing memory - although, whether this is actually going to be a performance problem for your specific case is impossible to judge with the information you provided. On the other hand, if you have a lot of data to access keeping it all in memory can be wasteful, slow to load and, when under memory pressure, lead to paging.
The easy way out of this conundrum is to map the file in memory; the data is automatically fetched from disk when required and, unless the system is under memory pressure, frequently accessed pages remain cached in RAM, guaranteeing you fast access.
Of course this is feasible only if the data you need to map is smaller than the address space, but given the example you provided (30 columns/1000 rows, which is really small) it shouldn't be a problem at all.
If you can hold the data in RAM then it is more efficient. This is because it is quicker for your computer to access values that are in RAM, a cache or the CPU's registers than it is to get it from the hard drive. Reading from the hard drive requires alot of time from the drivers of the operating system; therefore holding the data is more efficient

Optimising data-structures so that they take advantage of virtual memory

I would like to know how to optimise data-structures in openCV (the mat type specifically) so that I am able to leverage the operating systems built in memory/virtual memory management.
For a full context please read the Q and A here - but otherwise the situation could be summed up that I have a large collection of mats* that I'll need to access arbitrarily and rapidly. The main complication is that full amount of data is well above the amount of RAM available.
(*Conceptually the data is a recursively defined 3D array of 3D arrays, but let's not muddy the water with that confusion!)
Rather than build my own LRU cache and RAM-hungry and inefficient 'page' addressing strategies to access it, I'd rather let the OS do this for me.
I think I get the concepts, but when it comes to the actual implementation I'm twiddling thumbs:
Is this a generic C++ consideration, or something I need to address at the openCV level?
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)
Is this a generic C++ consideration, or something I need to address at the openCV level?
You just allocate and use boatloads of memory. The whole point of paging / virtual memory is that it's completely transparent. Everything gets extremely slow, but keeps working. You don't get ENOMEM until you're out of swap space + RAM.
On a normal Linux system, your normal swap partition should be very small (under 1GB), so you'll probably need to dd a swap file, and mkswap / swapon on it. Make sure the swap file is has read-write permission for root only. Obviously every major OS will have its own procedures.
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
If you have pointers to other data, make sure you keep them together. You want all the small "hot" data to be in only a few pages that a decent OS LRU algorithm won't page out.
If you have hot data mixed with cold data, it will easily get paged out and lead to an extra page-file round trip before the cache miss for the final data can even happen.
Like Yakk says, sequential access patterns will do much better, because disk I/O does better with multi-block reads. (Even SSDs have better throughput with larger blocks). This also allows prefetching, which allows one I/O request to start before the previous one's data arrives. Maxing out I/O throughput requires pipelining requests.
Try to design your algorithms to do sequential accesses when possible. This is advantageous at all levels of memory, from paging all the way up to L1 cache. Sequential access even enables auto-vectorization with vector-registers.
Cache blocking (aka loop tiling) techniques are also applicable to page misses. Google for details, but the main idea is to do all the steps of your algorithm over a subset of the data, instead of touching all the data at each step. Then each piece of data only has to be loaded into cache once total, instead of once for each step of your algorithm.
Think of DRAM as a cache for your giant virtual address space.
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)
Swap space / the pagefile is the backing store for your process's address space. So yes, it's very similar to what you'd get if you allocated memory by mmaping a big file instead of making an anonymous allocation.

c++: how to optimize IO?

I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?
Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...
The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.
Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.
These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.
Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.
As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.
It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.
Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.
More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.

Fast Reading and Writing of Data to/from a file

I have an application, which in initialization creates a graph and I perform an all-pair shortest path on that graph and use the results later.
As the graph is quite big, this takes pretty much time around 10-12 minutes, and the graph which I create is same everytime, so I can calculate the matrix once, dump it and reuse it later on.
However, this makes sense only if the time taken to read the Array into memory is lesser and the array can have as many as 35M elements.(1 byte each, 35M)
Is there some fast way of dumping/reading data so that this is achievable.
Thanks
The number of options available depends on the operating system. In virtual memory systems, there is usually a way to map a portion of memory space to a file and have it automatically transfer pages back and forth as required.
In most operating systems with file systems, increasing the file buffer can dramatically improve file reading and writing performance. By default, the C++ and C runtime libraries use a buffer of around 512 or 1024 bytes. Increase the buffer to somewhere in the neighborhood of 1 to 40 MB for your application.
Another means of improving the performance is rethink the data structure. Maybe it can be made smaller and/or have better locality of reference. Items closer to each other are more likely to already be buffered or cached.
Is it actually necessary to write a file it all?
At some point you'll run into the upper speed limit of your hard drive.
The simplest optimization that can do to is to improve the hardware from which you are reading. One option is to buy a solid-state drive. Alternatively, you can make a RAM disk from which you can read your data. Either of these should improve speed significantly without too much effort, independent of programming language.
Yes, memory-map the file. You can use boost::mapped_file for portability.
If you know the computer you're running this on won't change -- or that you don't need it to be portable, you could try doing a depth first traversal and writing each node to a binary file.
fwrite( currNode, sizeof(Node), 1, out);
Reading would be the opposite
Node theNode; fread(&theNode, sizeof(node), 1, in);
You could look into using boost serialization for a more automated solution. I've never used it, just mention it in passing
Since the graph is always the same, you could hard code it into your program.
The most ambitious solution is to rewrite your graph using template meta programming techniques. This allows for you to change the map at compile time. It will put a huge burden on your compiler but will reduce to having the graph in memory a runtime.

Memory mapped files performance - memory management when working with large data sets

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
Thanks.
It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.
I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.
I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.
It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.
Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.