I have 200 sets of about 50,000 unique integers in the range 0 to 500,000 I need to map to another small value (pair of ints, values are unrelated so no on-demand calculation).
I tried using std::unordered_maps, and this used around 50MB (measured in VS2015 heap diagnostics tool), and while performance was fine Id would like to get this memory usage down (intending to be a background service on some small 500MB cloud servers).
Effectively my initial version was 200 separate std::unordered_map<int, std::pair<int, int>>.
One option seems to be a sorted array and use binary search, but is there anything else?
I think sorted vector should work, if you won't change the vector once it's sorted. It's really space-efficient, i.e. no pointer overhead.
If you need even better performance, and don't mind some third-party library. You can try sparse_hash_map, which implement hash map with very little space overhead.
I guess that the most memory efficient will be an std::vector<std::pair<int, std::set<Something>>>, like you already suggested.
In this case, you will only have memory overhead as a result of:
The fixed overhead from std::vector (Very limited)
Sometimes a higher memory usage during the 'grow' as the old data and the new one have to be alive at that moment
The unused space in std::vector
You kinda indicate that after the build-up you no longer have to extend the vector, so either you can reserve or shrink_to_fit to get rid of the unused space. (Note that reserve also fixes the spikes in memory usage during grow)
If you would have a denser usage, you could consider changing the storage to std::vector<std::set<Something>> or std::vector<std::unique_ptr<std::set<Something>>>. In this structure, the index is implicit, though the memory gain will only show if you would have a value for every index.
The disadvantage of using a vector is that you have to write some custom code. In that case, std::unordered_map and std::map ain't that bad if you don't mind more cache misses on the processor caches (L1 ...) for less standard implementations, one could check out Googles sparsehash, Googles cpp-btree or Facebooks AtomicHashMap from Folly, though I don't have any experience with it.
Finally, one could wonder why you have this data all in memory, though I don't see a way to prevent this if you need optimal performance.
For efficient storage, depending on the precise value range, you may want to use bit operations to store the key/value pairs in a single value: For example, if the values are really small, you could even use 24bit for the keys and 8 bits for the values, resulting in a single 32 bit entry. I believe most compilers nowadays use 32 or 64 bit alignments, so storing for example 32bit keys and 16bit values may still require 64bit per entry. Using simple compression can also be beneficial for performance if the bottleneck is the memory bus and cache misses, rather than the CPU itself.
Then it depends on the kind of operations you would like to perform. The simplest way to store the keys would be a sorted array of structs or the combined ley/value entry that I proposed above. This is fast and very space efficient, but requires O(log n) lookup.
If you want to be a bit more fancy, you could use perfect hashing, the idea is to find a hash function that produces unique hash values for each key. This allows the hashmap to be a simple array which needs to be only marginally larger than the sorted array that I proposed above. Finding a good hash function should be relatively quick, you can make it even easier by making the array a little bit larger and allowing for some unused fields in the array.
Here is an implementation of perfect hashing, but I haven't used it myself.
In both cases the memory consumption would be: (number of pairs) * (bits per entry) bit, plus storing the hash function when you use the second approach.
** EDIT **
Updated after comment from #FireLancer. Also, added some words about performance of compressed arrays.
Related
I was considering ways to reduce memory footprint, and it is constantly mentioned that a bool takes up more memory than it logically needs to, as a byproduct of processor design.
it is also sometimes mentioned that one could store several bool within an int.
I am wondering if this would actually be more memory efficient?
if we have a usecase where we can use a significant portion of 32 (or 64) bool. and we decide to store all of them in a single int. then on the surface we have saved
7 (bits) * 32 (size of int) = 224 (bits) or 28 (bytes)
but in order to get each of those bits from the int, we needed to use some method of masking
such as:
bit shifting the int both directions (int<<x)>>y here we need to load and store x,y which are probably an int, but you could get them smaller depending on the use case
masking the int: int & int2 here we also store an additional int, which is stored and loaded
even if these aren't stored as variables, and they are defined statically within the code, it still ends up using additional memory, as it will increase the memory footprint of the instructions. as well as the instructions for the masking steps.
is there any way to do this that isn't actually worse for memory usage than just taking the hit on 7 wasted bits?
You are describing a text book example of a trade-off.
Yes, several bools in one int is hugeley more memory efficient - in itself.
Yes, you need to spend code to use that.
Yes, for only a few bools (for different values of "few"), the code might take more space than you save.
However, you could look at the kind of memory which is used. In some environments, RAM (which is saved by your idea) is much more expensive than ROM (which has to be paid for your idea).
Also, the price to pay is mostly paid once for implementation and only paid a fraction for using, especially when the using code is reused, e.g. in loops.
Altogether, in case of many bools, you can save more than you pay.
The point of actually saving needs to be determined for the special case.
On the other hand, you have missed on "currency" on the price-tag for the idea. You not only pay in memory, you also pay in execution time. You focused your question on memory, so I won't elaborate here. But for anything time critical, you should take the longer execution time into conisderation. You might find that saving memory is quite achievable with your idea, but the whole thing gets unbearably slow.
Again from the other side, as Eric Postpischil points out in a comment, execution speed can also improve due to cache effects from better memory footprint.
I am wondering if this would actually be more memory efficient?
Potentially yes. Storing multiple bools inside a single object may use less storage compared to having distinct bool object for each, if the number of bools is great enough to offset the cost in memory use of the instructions.
Also consider that there are more considerations than space efficiency. Most typically, people are concerned about time efficiency as well. In this regard, compacting bools may more or less efficient depending on the details of the use case.
is using an integer to store many bool worth the effort?
It can be worth the effort. It can also be counter productive. The difference can be minuscule or significant. Both in terms of time and space efficiency. Most accurate way to find out is to measure it.
It's not necessary to implement this yourself though, since there are solutions in the standard library. std::vector<bool> and std::bitset both implement compact storage of bools. Using bitfields may also be an option (just remember to not rely on the internal representation).
I am searching for a high performance C++ structure for a table. The table will have void* as keys and uint32 as values.
The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (those tables are suited for high number of items too).
So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.
Another solution which I will shortly evaluate is like that:
struct Group
{
void* items[5]; // search using SIMD
int32_t items[5];
}; // fits in cache line
struct Table
{
Group* groups;
size_t capacity;
};
Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!
EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.
Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).
If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by #JorgeBellon
in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.
A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.
Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).
You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.
I think a nice alternative is an optimization of your simple linear probing idea.
Your lookup procedure would look like this:
Slot *s = &table[hash(key)];
Slot *e = s + s->max_extent;
for (;s<e; ++s) {
if (s->key == key) {
return s->value;
}
}
return NOT_FOUND;
table[h].max_extent is the maximum number of elements you may have to look at if you're looking for an element with hash code h. You would pre-calculate this when you generate the table, so your lookup doesn't have to iterate until it gets a null. This greatly reduces the amount of probing you have to do for misses.
Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill your hash table in hash(key) order to produce the best result.
NOTE that we do not wrap around from the end to the start of the table while probing. Just allocate however many extra slots you need to avoid it.
I wrote a program that needs to handle a very large data with the following libraries:
vector
boost::unordered_map
boost::unordered_multimap
So, I'm having memory problems (The program uses a LOT) and I was thinking maybe I can replace this libraries (with something that already exists or my own implementations):
So, three questions:
How much memory I'd save if I replace vector with a C array? Is it worth it?
Can someone explain how is the memory used in boost::unordered_map and boost::unordered_multimap in the current implementation? Like what's stored in order to achieve their performance.
Can you recommend me some libraries that outperform boost::unordered_map and boost::unordered_multimap in memory usage (But not something too slow) ?
std::vector is memory efficient. I don't know about the boost maps, but the Boost people usually know what they're doing, I doubt you'll save a lot of memory by creating your own variants.
You can do a few other things to help with memory issues:
Compile in 64 bit. Running out of memory in a 64 bit process is very hard.
You won't run out of memory, but memory might get swapped out. You should instead see if you need to load everything into memory at once, perhaps you can work on chunks of the data at a time.
As a side benefit, working on a chunk of the data at a time allows you to run your code in parallel.
With memory being so cheap nowadays, so that allocating 10GB of RAM is very simple, I guess your bottlenecks will be in the processing you do of the data, not of allocating the data.
These two articles explain the data structures underyling some common implementations of unordered associative containers:
Implementation of C++ unordered associative containers
Implementation of C++ unordered associative containers with duplicate elements
Even though there are some differences between implementations, they are modest --one word per element at most. If you go with minimum-overhead solutions such as sorted vectors, this would gain you 2-3 words per element, not even a 2x improvement if your objects are large. So, you'd probably be better off resorting to an environment with more memory or radically changing your approach by using a database or something.
If you have only one set of data and multiple ways of accessing it you can try to use boost::multi_index here is the documentation.
std::vector is basically a contiguous array, plus a few bytes of overhead. About the only way you'll improve with vector is by using a smaller element type. Can you store a short int instead of a regular int? If so, you can cut the vector memory down by half.
Are you perhaps using these containers to hold pointers to many objects on the heap? If so, you may have a lot of wasted space in the heap that could be saved by writing custom allocators, or by doing away with a pointer to a heap element altogether, and by storing a value type within the container.
Look within your class types. Consider all pointer types, and whether they need to be dynamic storage or not. The typical class often has pointer members hanging off a base object, which means a single object is a graph of memory chunks in itself. The more you can inline your class members, the more efficient your use of the heap will be.
RAM is cheap in 2014. Easy enough to build x86-64 Intel boxes with 64-256GB of RAM and Solid State Disk as fast swap if your current box isn't cutting it for the project. Hopefully this isn't a commercial desktop app we are discussing. :)
I ended up by changing boost::unordered_multimap for std::unordered_map of vector.
boost::unordered_multimap consumes more than two times the memory consumed by std::unordered_map of vector due the extra pointers it keeps (at least one extra pointer per element) and the fact that it stores the key and the value of each element, while unordered_map of vector only stores the key once for a vector that contains all the colliding elements.
In my particular case, I was trying to store about 4k million integers, consuming about 15 GB of memory in the ideal case. Using the multimap I get to consume more than 40 GB while using the map I use about 15 GB (a little bit more due the pointers and other structures but their size if despicable).
I'm developing a tiny search engine using TF-IDF and cosine similarity. When pages are added, I build an inverted index to keep words frequency in the different pages. I remove stopwords and more common words, and plural/verb/etc. are stemmed.
My inverted index looks like:
map< string, map<int, float> > index
[
word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],
word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],
...
]
With this data structure, I can get the idf weight with word_a.size(). Given a query, the program loops over the keywords and scores the documents.
I don't know well data structures and my questions are:
How to store a 500 Mo inverted index in order to load it at search time? Currently, I use boost to serialize the index:
ofstream ofs_index("index.sr", ios::binary);
boost::archive::bynary_oarchive oa(ofs_index);
oa << index;
And then I load it at search time:
ifstream ifs_index("index.sr", ios::binary);
boost::archive::bynary_iarchive ia(ifs_index);
ia >> index;
But it is very slow, it takes sometines 10 seconds to load.
I don't know if map are efficient enough for inverted index.
In order to cluster documents, I get all keywords from each document and I loop over these keywords to score similar documents, but I would like to avoid reading again each document and use only this inverted index. But I think this data structure would be costly.
Thank you in advance for any help!
The answer will pretty much depend on whether you need to support data comparable to or larger than your machine's RAM and whether in your typical use case you are likely to access all of the indexed data or rather only a small fraction of it.
If you are certain that your data will fit into your machine's memory, you can try to optimize the map-based structure you are using now. Storing your data in a map should give pretty fast access, but there will always be some initial overhead when you load the data from disk into memory. Also, if you only use a small fraction of the index, this approach may be quite wasteful as you always read and write the whole index, and keep all of it in memory.
Below I list some suggestions you could try out, but before you commit too much time to any of them, make sure that you actually measure what improves the load and run times and what does not. Without profiling the actual working code on actual data you use, these are just guesses which may be completely wrong.
map is implemented as a tree (usually black-red tree). In many cases, a hash_map may give you better performance as well as better memory usage (fewer allocations and less fragmentation for example).
Try reducing the size of the data - less data means it will be faster to read it from disk, potentially less memory allocation, and sometimes better in-memory performance due to better locality. You may for example consider that you use float to store the frequency, but perhaps you could store only the number of occurrences as an unsigned short in the map values and in a separate map store the number of all words for each document (also as a short). Using the two numbers, you can re-calculate the frequency, but use less disk space when you save the data to disk, which could result in faster load times.
Your map has quite a few entries, and sometimes using custom memory allocators helps improve performance in such a case.
If your data could potentially grow beyond the size of your machine's RAM, I would suggest you use memory-mapped files for storing the data. Such an approach may require re-modelling your data structures and either using custom STL allocators or using completely custom data structures instead of std::map but it may improve your performance an order of magnitude if done well. In particular, this approach frees your from having to load the whole structure into memory at once, so your startup times will improve dramatically at the cost of slight delays related to disk accesses distributed over time as you touch different parts of the structure for the first time. The subject is quite broad, and requires much deeper changes to your code than just tuning the map, but if you plan handling huge data, you should certainly have a look at mmap and friends.
Imagine there's a fixed and constant set of 'options' (e.g. skills). Every object (e.g. human) can either have or not have any of the options.
Should I maintain a member list-of-options for every object and fill it with options?
OR:
Is it more efficient (faster) to use a bitarray where each bit represents the respective option's taken (or not taken) status?
-edited:-
To be more specific, the list of skills is a vector of strings (option names), definitely shorter than 256.
The target is for the program to be AS FAST as possible (no memory concerns).
That rather depends. If the number of options is small, then use several bool members to represent them. If the list grows large, then both your options become viable:
a bitset (which an appropriate enum to symbolically represent the options) takes a constant, and very small, amount of space, and getting a certain option takes O(1) time;
a list of options, or rather an std::set or unordered_set of them, might be more space-efficient, but only if the number of options is huge, and it is expected that a very small number of them will be set per object.
When in doubt, use either a bunch of bool members, or a bitset. Only if profiling shows that storing options becomes a burden, consider a dynamic list or set representation (and even then, you might want to reconsider your design).
Edit: with less than 256 options, a bitset would take at most 64 bytes, which will definitely beat any list or set representation in terms of memory and likely speed. A bunch of bools, or even an array of unsigned char, might still be faster because accessing a byte is commonly faster than accessing a bit. But copying the structure will be slower, so try several options and measure the result. YMMV.
Using a bit array is faster when testing for the presence of multiple skills in a person in a single operation.
If you use a list of options then you'll have to go over the list one item at a time to find if a skill set exits which would obviously take more time and require many comparison operations.
The bitarray will be generally faster to edit and faster to search. As for space required, just do the math. A list of options requires a dynamically sized array (which suffers some overhead over the set of options itself); but if there are a large number of options, it may be smaller if (typically) only a small number of options are set.