c++ Alternative implementation to avoid shifting between RAM and SWAP memory - c++

I have a program, that uses dynamic programming to calculate some information. The problem is, that theoretically the used memory grows exponentially. Some filters that I use limit this space, but for a big input they also can't avoid that my program runs out of RAM - Memory.
The program is running on 4 threads. When I run it with a really big input I noticed, that at some point the program starts to use the swap memory, because my RAM is not big enough. The consequence of this is, that my CPU-usage decreases from about 380% to 15% or lower.
There is only one variable that uses the memory which is the following datastructure:
Edit (added type) with CLN library:
class My_Map {
typedef std::pair<double,short> key;
typedef cln::cl_I value;
public:
tbb::concurrent_hash_map<key,value>* map;
My_Map() { map = new tbb::concurrent_hash_map<myType>(); }
~My_Map() { delete map; }
//some functions for operations on the map
};
In my main program I am using this datastructure as globale variable:
My_Map* container = new My_Map();
Question:
Is there a way to avoid the shifting of memory between SWAP and RAM? I thought pushing all the memory to the Heap would help, but it seems not to. So I don't know if it is possible to maybe fully use the swap memory or something else. Just this shifting of memory cost much time. The CPU usage decreases dramatically.

If you have 1 Gig of RAM and you have a program that uses up 2 Gb RAM, then you're going to have to find somewhere else to store the excess data.. obviously. The default OS way is to swap but the alternative is to manage your own 'swapping' by using a memory-mapped file.
You open a file and allocate a virtual memory block in it, then you bring pages of the file into RAM to work on. The OS manages this for you for the most part, but you should think about your memory usage so not to try to keep access to the same blocks while they're in memory if you can.
On Windows you use CreateFileMapping(), on Linux you use mmap(), on Mac you use mmap().

The OS is working properly - it doesn't distinguish between stack and heap when swapping - it pages you whatever you seem not to be using and loads whatever you ask for.
There are a few things you could try:
consider whether myType can be made smaller - e.g. using int8_t or even width-appropriate bitfields instead of int, using pointers to pooled strings instead of worst-case-length character arrays, use offsets into arrays where they're smaller than pointers etc.. If you show us the type maybe we can suggest things.
think about your paging - if you have many objects on one memory page (likely 4k) they will need to stay in memory if any one of them is being used, so try to get objects that will be used around the same time onto the same memory page - this may involve hashing to small arrays of related myType objects, or even moving all your data into a packed array if possible (binary searching can be pretty quick anyway). Naively used hash tables tend to flay memory because similar objects are put in completely unrelated buckets.
serialisation/deserialisation with compression is a possibility: instead of letting the OS swap out full myType memory, you may be able to proactively serialise them into a more compact form then deserialise them only when needed
consider whether you need to process all the data simultaneously... if you can batch up the work in such a way that you get all "group A" out of the way using less memory then you can move on to "group B"
UPDATE now you've posted your actual data types...
Sadly, using short might not help much because sizeof key needs to be 16 anyway for alignment of the double; if you don't need the precision, you could consider float? Another option would be to create an array of separate maps...
tbb::concurrent_hash_map<double,value> map[65536];
You can then index to map[my_short][my_double]. It could be better or worse, but is easy to try so you might as well benchmark....
For cl_I a 2-minute dig suggests the data's stored in a union - presumably word is used for small values and one of the pointers when necessary... that looks like a pretty good design - hard to improve on.
If numbers tend to repeat a lot (a big if) you could experiment with e.g. keeping a registry of big cl_Is with a bi-directional mapping to packed integer ids which you'd store in My_Map::map - fussy though. To explain, say you get 987123498723489 - you push_back it on a vector<cl_I>, then in a hash_map<cl_I, int> set [987123498723489 to that index (i.e. vector.size() - 1). Keep going as new numbers are encountered. You can always map from an int id back to a cl_I using direct indexing in the vector, and the other way is an O(1) amortised hash table lookup.

Related

Fastest way to "resurrect" (serialize/ and deserialize) an std::map

As part of my test code I need to build complex structure that uses among other stuff 2 std::map instances; both of them have around 1 million elements. In optimized builds it's OK, however, in debug un-optimized builds it takes almost a minute. I use the same data to build that map, basically if I could save chunk of ram and restore it in 20ms then I'd effectively get the same map in my app without waiting a minute every time. What can i do to speed it up? I could try to use custom allocator and save/restore its alloacted storage, or is there a way to construct std::map from data that's already sorted perhaps so that it would be linear in time?
The technical difficulty, is that for std::map in debug mode, the Visual studio compiler inserts checks for correctness, and in some revisions, has inserted elements into the structure to ensure checking is easier.
There are 2 possible solutions :-
Abstraction
If the information provided by the std::map is replaceable by an interface class, then the internals of the std::map can be hidden and moved into a separate compilation unit. This can be compiled outside of the debug environment and performance restored.
alternative data structure
For a piece of information which is broadly static (e.g. a static piece of data you need to retrieve quickly, then std::map is not the fastest way to achieve this, and a sorted std::vector of std::pair<key,value> would be more performant in operation.
The advantage of the std::vector, is there are guarantees about its layout. If the data is Plain-old-data, then it can be loaded by a std::vector::reserve and a memcpy. Otherwise, filling the elements in the std::vector would still avoid the significant time spent by Visual Studio tracking the memory and structure of the std::map for issues.
Eventually after trying different approaches I ended up using custom allocator.
std::map was one of many containers that's used by my struct to hold the data. Total size of allocated ram was actually around 400MB, the struct contained lists, maps, vectors of different data where many of the members of these containers were pointers into other containers. So, I took radical approach and made it extremely fast to 'resurrect' my entire structure with all the maps and internal pointers. While originally in my post it was about making it 'fast' in debug builds quickly after modifying code and adding extra complexity it became equally applicable to release builds as well: construction time became around 10 seconds in release build.
So, at first I modified all structure members to use my custom allocator, this way I saw how much ram was actually allocated:
total allocated: 1970339320 bytes, total freed: 1437565512 bytes
This way I could estimate that I'll need around 600MB total. Then, in my custom allocator I added static global method my_aloc::start_recording() this method would allocate that 600MB chunk of ram and after start_recording was called my custom allocator would simply return addresses from that 600MB block. After start_recording was called I'd make a copy of my entire structure (it was actually a vector of structures). When making a copy there is no overallocation, each structure member allocated only enough ram for its storage. Basically by copying structs it actually allocated only around 400MB instead of 600MB.
I said that internally in my construct there is lots of pointers to internal members, how to reuse then this 400MB recorded "snapshot" from my custom allocator? I could write code to "patch" pointers, but perhaps it wouldn't even work: I had lots of maps that also use pointers as key with custom compare struct that dereferences pointers to compare actual pointer to values. Also, some maps contained iterators into lists, it would be quite messy to deal with all that. Also, my overall structure isn't set in stone, it's work in progress and if something gets changed then patching code would also need to be changed. So, the answer was quit obvious: I simply needed to load that entire 400MB snapshot at the same base address. In windows I use VirtualAlloc, in linux something like mmap perhaps would need to be used, or alternatively boost shared mem lib could be used to make it more portable. At the end overall load time went down to 150ms, while in release I takes more than 10 seconds, and in debug builds it's already perhaps somewhere in minutes now.

Memory usage in C++ program

I wrote a program that needs to handle a very large data with the following libraries:
vector
boost::unordered_map
boost::unordered_multimap
So, I'm having memory problems (The program uses a LOT) and I was thinking maybe I can replace this libraries (with something that already exists or my own implementations):
So, three questions:
How much memory I'd save if I replace vector with a C array? Is it worth it?
Can someone explain how is the memory used in boost::unordered_map and boost::unordered_multimap in the current implementation? Like what's stored in order to achieve their performance.
Can you recommend me some libraries that outperform boost::unordered_map and boost::unordered_multimap in memory usage (But not something too slow) ?
std::vector is memory efficient. I don't know about the boost maps, but the Boost people usually know what they're doing, I doubt you'll save a lot of memory by creating your own variants.
You can do a few other things to help with memory issues:
Compile in 64 bit. Running out of memory in a 64 bit process is very hard.
You won't run out of memory, but memory might get swapped out. You should instead see if you need to load everything into memory at once, perhaps you can work on chunks of the data at a time.
As a side benefit, working on a chunk of the data at a time allows you to run your code in parallel.
With memory being so cheap nowadays, so that allocating 10GB of RAM is very simple, I guess your bottlenecks will be in the processing you do of the data, not of allocating the data.
These two articles explain the data structures underyling some common implementations of unordered associative containers:
Implementation of C++ unordered associative containers
Implementation of C++ unordered associative containers with duplicate elements
Even though there are some differences between implementations, they are modest --one word per element at most. If you go with minimum-overhead solutions such as sorted vectors, this would gain you 2-3 words per element, not even a 2x improvement if your objects are large. So, you'd probably be better off resorting to an environment with more memory or radically changing your approach by using a database or something.
If you have only one set of data and multiple ways of accessing it you can try to use boost::multi_index here is the documentation.
std::vector is basically a contiguous array, plus a few bytes of overhead. About the only way you'll improve with vector is by using a smaller element type. Can you store a short int instead of a regular int? If so, you can cut the vector memory down by half.
Are you perhaps using these containers to hold pointers to many objects on the heap? If so, you may have a lot of wasted space in the heap that could be saved by writing custom allocators, or by doing away with a pointer to a heap element altogether, and by storing a value type within the container.
Look within your class types. Consider all pointer types, and whether they need to be dynamic storage or not. The typical class often has pointer members hanging off a base object, which means a single object is a graph of memory chunks in itself. The more you can inline your class members, the more efficient your use of the heap will be.
RAM is cheap in 2014. Easy enough to build x86-64 Intel boxes with 64-256GB of RAM and Solid State Disk as fast swap if your current box isn't cutting it for the project. Hopefully this isn't a commercial desktop app we are discussing. :)
I ended up by changing boost::unordered_multimap for std::unordered_map of vector.
boost::unordered_multimap consumes more than two times the memory consumed by std::unordered_map of vector due the extra pointers it keeps (at least one extra pointer per element) and the fact that it stores the key and the value of each element, while unordered_map of vector only stores the key once for a vector that contains all the colliding elements.
In my particular case, I was trying to store about 4k million integers, consuming about 15 GB of memory in the ideal case. Using the multimap I get to consume more than 40 GB while using the map I use about 15 GB (a little bit more due the pointers and other structures but their size if despicable).

Why is deque using so much more RAM than vector in C++?

I have a problem I am working on where I need to use some sort of 2 dimensional array. The array is fixed width (four columns), but I need to create extra rows on the fly.
To do this, I have been using vectors of vectors, and I have been using some nested loops that contain this:
array.push_back(vector<float>(4));
array[n][0] = a;
array[n][1] = b;
array[n][2] = c;
array[n][3] = d;
n++
to add the rows and their contents. The trouble is that I appear to be running out of memory with the number of elements I was trying to create, so I reduced the number that I was using. But then I started reading about deque, and thought it would allow me to use more memory because it doesn't have to be contiguous. I changed all mentions of "vector" to "deque", in this loop, as well as all declarations. But then it appeared that I ran out of memory again, this time with even with the reduced number of rows.
I looked at how much memory my code is using, and when I am using deque, the memory rises steadily to above 2GB, and the program closes soon after, even when using the smaller number of rows. I'm not sure exactly where in this loop it is when it runs out of memory.
When I use vectors, the memory usage (for the same number of rows) is still under 1GB, even when the loop exits. It then goes on to a similar loop where more rows are added, still only reaching about 1.4GB.
So my question is. Is this normal for deque to use more than twice the memory of vector, or am I making an erroneous assumption in thinking I can just replace the word "vector" with "deque" in the declarations/initializations and the above code?
Thanks in advance.
I'm using:
MS Visual C++ 2010 (32-bit)
Windows 7 (64-bit)
The real answer here has little to do with the core data structure. The answer is that MSVC's implementation of std::deque is especially awful and degenerates to an array of pointers to individual elements, rather than the array of arrays it should be. Frankly, only twice the memory use of vector is surprising. If you had a better implementation of deque you'd get better results.
It all depends on the internal implementation of deque (I won't speak about vector since it is relatively straightforward).
Fact is, deque has completely different guarantees than vector (the most important one being that it supports O(1) insertion at both ends while vector only supports O(1) insertion at the back). This in turn means the internal structures managed by deque have to be more complex than vector.
To allow that, a typical deque implementation will split its memory in several non-contiguous blocks. But each individual memory block has a fixed overhead to allow the memory management to work (eg. whatever the size of the block, the system may need another 16 or 32 bytes or whatever in addition, just for bookkeeping). Since, contrary to a vector, a deque requires many small, independent blocks, the overhead stacks up which can explain the difference you see. Also note that those individual memory blocks need to be managed (maybe in separate structures?), which probably means some (or a lot of) additional overhead too.
As for a way to solve your problem, you could try what #BasileStarynkevitch suggested in the comments, this will indeed reduce your memory usage but it will get you only so far because at some point you'll still run out of memory. And what if you try to run your program on a machine that only has 256MB RAM? Any other solution which goal is to reduce your memory footprint while still trying to keep all your data in memory will suffer from the same problems.
A proper solution when handling large datasets like yours would be to adapt your algorithms and data structures in order to be able to handle small partitions at a time of your whole dataset, and load/save those partitions as needed in order to make room for the other partitions. Unfortunately since it probably means disk access, it also means a big drop in performance but hey, you can't eat the cake and have it too.
Theory
There two common ways to efficiently implement a deque: either with a modified dynamic array or with a doubly linked list.
The modified dynamic array uses is basically a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end.
There are several implementations of modified dynamic array:
Allocating deque contents from the center of the underlying array,
and resizing the underlying array when either end is reached. This
approach may require more frequent resizings and waste more space,
particularly when elements are only inserted at one end.
Storing deque contents in a circular buffer, and only resizing when
the buffer becomes full. This decreases the frequency of resizings.
Storing contents in multiple smaller arrays, allocating additional
arrays at the beginning or end as needed. Indexing is implemented by
keeping a dynamic array containing pointers to each of the smaller
arrays.
Conclusion
Different libraries may implement deques in different ways, but generally as a modified dynamic array. Most likely your standard library uses the approach #1 to implement std::deque, and since you append elements only from one end, you ultimately waste a lot of space. For that reason, it makes an illusion that std::deque takes up more space than usual std::vector.
Furthermore, if std::deque would be implemented as doubly-linked list, that would result in a waste of space too since each element would need to accommodate 2 pointers in addition to your custom data.
Implementation with approach #3 (modified dynamic array approach too) would again result in a waste of space to accommodate additional metadata such as pointers to all those small arrays.
In any case, std::deque is less efficient in terms of storage than plain old std::vector. Without knowing what do you want to achieve I cannot confidently suggest which data structure do you need. However, it seems like you don't even know what deques are for, therefore, what you really want in your situation is std::vector. Deques, in general, have different application.
Deque can have additional memory overhead over vector because it's made of a few blocks instead of contiguous one.
From en.cppreference.com/w/cpp/container/deque:
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The primary issue is running out of memory.
So, do you need all the data in memory at once?
You may never be able to accomplish this.
Partial Processing
You may want to consider processing the data into "chunks" or smaller sub-matrices. For example, using the standard rectangular grid:
Read data of first quadrant.
Process data of first quandrant.
Store results (in a file) of first quandrant.
Repeat for remaining quandrants.
Searching
If you are searching for a particle or a set of datum, you can do that without reading in the entire data set into memory.
Allocate a block (array) of memory.
Read a portion of the data into this block of memory.
Search the block of data.
Repeat steps 2 and 3 until the data is found.
Streaming Data
If your application is receiving the raw data from an input source (other than a file), you will want to store the data for later processing.
This will require more than one buffer and is more efficient using at least two threads of execution.
The Reading Thread will be reading data into a buffer until the buffer is full. When the buffer is full, it will read data into another empty one.
The Writing Thread will initially wait until either the first read buffer is full or the read operation is finished. Next, the Writing Thread takes data out of the read buffer and writes to a file. The Write Thread then starts writing from the next read buffer.
This technique is called Double Buffering or Multiple Buffering.
Sparse Data
If there is a lot of zero or unused data in the matrix, you should try using Sparse Matrices. Essentially, this is a list of structures that hold the data's coordinates and the value. This also works when most of the data is a common value other than zero. This saves a lot of memory space; but costs a little bit more execution time.
Data Compression
You could also change your algorithms to use data compression. The idea here is to store the data location, value and the number or contiguous equal values (a.k.a. runs). So instead of storing 100 consecutive data points of the same value, you would store the starting position (of the run), the value, and 100 as the quantity. This saves a lot of space, but requires more processing time when accessing the data.
Memory Mapped File
There are libraries that can treat a file as memory. Essentially, they read in a "page" of the file into memory. When the requests go out of the "page", they read in another page. All this is performed "behind the scenes". All you need to do is treat the file like memory.
Summary
Arrays and deques are not your primary issue, quantity of data is. Your primary issue can be resolved by processing small pieces of data at a time, compressing the data storage, or treating the data in the file as memory. If you are trying to process streaming data, don't. Ideally, streaming data should be placed into a file and then processed later.
A historical purpose of a file is to contain data that doesn't fit into memory.

how to manage large arrays

I have a c++ program that uses several very large arrays of doubles, and I want to reduce the memory footprint of this particular part of the program. Currently, I'm allocating 100 of them and they can be 100 Mb each.
Now, I do have the advantage, that eventually parts of these arrays become obsolete during later parts of the program's execution, and there is little need to ever have the whole of any one of then in memory at any one time.
My question is this:
Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?
I'm coming to the conclusion that the only way to achieve this is going to be to declare an array of pointers, each of which may point to a chunk say 1Mb of the desired array, so that old chunks that are not needed any more can be reused for new bits of the array. This seems to me like writing a custom memory manager which does seem like a bit of a sledgehammer, that's going to create a bit of a performance hit as well
I can't move the data in the array because it is going to cause too many thread contention issues. the arrays may be accessed by any one of a large number of threads at any time, though only one thread ever writes to any given array.
It depends on the operating system. POSIX - including Linux - has the system call madvise to do improve memory performance. From the man page:
The madvise() system call advises the kernel about how to handle paging input/output in the address range beginning at address addr and with size length bytes. It allows an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. This call does not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. The kernel is free to ignore the advice.
See the man page of madvise for more information.
Edit: Apparently, the above description was not clear enough. So, here are some more details, and some of them are specific to Linux.
You can use mmap to allocate a block of memory (directly from the OS instead of the libc), that is not backed by any file. For large chunks of memory, malloc is doing exactly the same thing. You have to use munmap to release the memory - regardless of the usage of madvise:
void* data = ::mmap(nullptr, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ...
::munmap(data, size);
If you want to get rid of some parts of this chunk, you can use madvise to tell the kernel to do so:
madvise(static_cast<unsigned char*>(data) + 7 * page_size,
3 * page_size, MADV_DONTNEED);
The address range is still valid, but it is no longer backed - neither by physical RAM nor by storage. If you access the pages later, the kernel will allocate some new pages on the fly and re-initialize them to zero. Be aware, that the dontneed pages are also part of the virtual memory size of the process. It might be necessary to make some configuration changes to the virtual memory management, e.g. activating over-commit.
It would be easier to answer if we had more details.
1°) The answer to the question "Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?" is "not really". That's the point of C and C++, and any language that let you handle memory manually.
2°) If you're using C++ and not C, you should not be using malloc.
3°) Nor arrays, unless for a very specific reason. Use a std::vector.
4°) Preferably, if you need to change often the content of the array and reduce the memory footprint, use a linked list (std::list), though it'll be more expensive to "access" individually the content of the list (but will be almost as fast if you only iterate through it).
A std::deque with pointers to std::array<double,LARGE_NUMBER> may do the job, but you better make a dedicated container with the deque, so you can remap the indexes and most importantly, define when entries are not used anymore.
The dedicated container can also contain a read/write lock, so it can be used in a thread-safe way.
You could try using lists instead of arrays. Of course list is 'heavyer' than array but on the other hand it is easy to reconstruct a list so that you can throw away a part of it when it becomes obsolete. You could also use a wrapper which would only contain indexes saying which part of the list is up-to-date and which part may be reused.
This will help you improve performance, but will require a little bit more (reusable) memory.
Allocating by chunk and delete[]-ing and new[]-ing on the way seems like the good solution. It may be possible to do as little as memory management as possible. Do not reuse chunk yourself, simply deallocate old one and allocate new chunks when needed.

std::sort on container of pointers

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.