Imagine there's a fixed and constant set of 'options' (e.g. skills). Every object (e.g. human) can either have or not have any of the options.
Should I maintain a member list-of-options for every object and fill it with options?
OR:
Is it more efficient (faster) to use a bitarray where each bit represents the respective option's taken (or not taken) status?
-edited:-
To be more specific, the list of skills is a vector of strings (option names), definitely shorter than 256.
The target is for the program to be AS FAST as possible (no memory concerns).
That rather depends. If the number of options is small, then use several bool members to represent them. If the list grows large, then both your options become viable:
a bitset (which an appropriate enum to symbolically represent the options) takes a constant, and very small, amount of space, and getting a certain option takes O(1) time;
a list of options, or rather an std::set or unordered_set of them, might be more space-efficient, but only if the number of options is huge, and it is expected that a very small number of them will be set per object.
When in doubt, use either a bunch of bool members, or a bitset. Only if profiling shows that storing options becomes a burden, consider a dynamic list or set representation (and even then, you might want to reconsider your design).
Edit: with less than 256 options, a bitset would take at most 64 bytes, which will definitely beat any list or set representation in terms of memory and likely speed. A bunch of bools, or even an array of unsigned char, might still be faster because accessing a byte is commonly faster than accessing a bit. But copying the structure will be slower, so try several options and measure the result. YMMV.
Using a bit array is faster when testing for the presence of multiple skills in a person in a single operation.
If you use a list of options then you'll have to go over the list one item at a time to find if a skill set exits which would obviously take more time and require many comparison operations.
The bitarray will be generally faster to edit and faster to search. As for space required, just do the math. A list of options requires a dynamically sized array (which suffers some overhead over the set of options itself); but if there are a large number of options, it may be smaller if (typically) only a small number of options are set.
Related
I am searching for a high performance C++ structure for a table. The table will have void* as keys and uint32 as values.
The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (those tables are suited for high number of items too).
So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.
Another solution which I will shortly evaluate is like that:
struct Group
{
void* items[5]; // search using SIMD
int32_t items[5];
}; // fits in cache line
struct Table
{
Group* groups;
size_t capacity;
};
Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!
EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.
Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).
If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by #JorgeBellon
in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.
A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.
Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).
You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.
I think a nice alternative is an optimization of your simple linear probing idea.
Your lookup procedure would look like this:
Slot *s = &table[hash(key)];
Slot *e = s + s->max_extent;
for (;s<e; ++s) {
if (s->key == key) {
return s->value;
}
}
return NOT_FOUND;
table[h].max_extent is the maximum number of elements you may have to look at if you're looking for an element with hash code h. You would pre-calculate this when you generate the table, so your lookup doesn't have to iterate until it gets a null. This greatly reduces the amount of probing you have to do for misses.
Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill your hash table in hash(key) order to produce the best result.
NOTE that we do not wrap around from the end to the start of the table while probing. Just allocate however many extra slots you need to avoid it.
I have 200 sets of about 50,000 unique integers in the range 0 to 500,000 I need to map to another small value (pair of ints, values are unrelated so no on-demand calculation).
I tried using std::unordered_maps, and this used around 50MB (measured in VS2015 heap diagnostics tool), and while performance was fine Id would like to get this memory usage down (intending to be a background service on some small 500MB cloud servers).
Effectively my initial version was 200 separate std::unordered_map<int, std::pair<int, int>>.
One option seems to be a sorted array and use binary search, but is there anything else?
I think sorted vector should work, if you won't change the vector once it's sorted. It's really space-efficient, i.e. no pointer overhead.
If you need even better performance, and don't mind some third-party library. You can try sparse_hash_map, which implement hash map with very little space overhead.
I guess that the most memory efficient will be an std::vector<std::pair<int, std::set<Something>>>, like you already suggested.
In this case, you will only have memory overhead as a result of:
The fixed overhead from std::vector (Very limited)
Sometimes a higher memory usage during the 'grow' as the old data and the new one have to be alive at that moment
The unused space in std::vector
You kinda indicate that after the build-up you no longer have to extend the vector, so either you can reserve or shrink_to_fit to get rid of the unused space. (Note that reserve also fixes the spikes in memory usage during grow)
If you would have a denser usage, you could consider changing the storage to std::vector<std::set<Something>> or std::vector<std::unique_ptr<std::set<Something>>>. In this structure, the index is implicit, though the memory gain will only show if you would have a value for every index.
The disadvantage of using a vector is that you have to write some custom code. In that case, std::unordered_map and std::map ain't that bad if you don't mind more cache misses on the processor caches (L1 ...) for less standard implementations, one could check out Googles sparsehash, Googles cpp-btree or Facebooks AtomicHashMap from Folly, though I don't have any experience with it.
Finally, one could wonder why you have this data all in memory, though I don't see a way to prevent this if you need optimal performance.
For efficient storage, depending on the precise value range, you may want to use bit operations to store the key/value pairs in a single value: For example, if the values are really small, you could even use 24bit for the keys and 8 bits for the values, resulting in a single 32 bit entry. I believe most compilers nowadays use 32 or 64 bit alignments, so storing for example 32bit keys and 16bit values may still require 64bit per entry. Using simple compression can also be beneficial for performance if the bottleneck is the memory bus and cache misses, rather than the CPU itself.
Then it depends on the kind of operations you would like to perform. The simplest way to store the keys would be a sorted array of structs or the combined ley/value entry that I proposed above. This is fast and very space efficient, but requires O(log n) lookup.
If you want to be a bit more fancy, you could use perfect hashing, the idea is to find a hash function that produces unique hash values for each key. This allows the hashmap to be a simple array which needs to be only marginally larger than the sorted array that I proposed above. Finding a good hash function should be relatively quick, you can make it even easier by making the array a little bit larger and allowing for some unused fields in the array.
Here is an implementation of perfect hashing, but I haven't used it myself.
In both cases the memory consumption would be: (number of pairs) * (bits per entry) bit, plus storing the hash function when you use the second approach.
** EDIT **
Updated after comment from #FireLancer. Also, added some words about performance of compressed arrays.
Suppose I created a class that took a template parameter equal to the number uint8_ts I want to string together into a Big int.
This way I can create a huge int like this:
SizedInt<1000> unspeakablyLargeNumber; //A 1000 byte number
Now the question arises: am I killing my speed by using uint8_ts instead of using a larger built in type.
For example:
SizedInt<2> num1;
uint16_t num2;
Are num1 and num2 the same speed, or is num2 faster?
It would undoubtedly be slower to use uint8_t[2] instead of uint16_t.
Take addition, for example. In order to get the uint8_t[2] speed up to the speed of uint16_t, the compiler would have to figure out how to translate your add-with-carry logic and fuse those multiple instructions into a single, wider addition. I'm sure that some compilers out there are capable of such optimizations sometimes, but there are many circumstances which could make the optimization unlikely or impossible.
On some architectures, this will even apply to loading / storing, since uint8_t[2] usually has different alignment requirements than uint16_t.
Typical bignum libraries, like GMP, work on the largest words that are convenient for the architecture. On x64, this means using an array of uint64_t instead of an array of something smaller like uint8_t. Adding two 64-bit numbers is quite fast on modern microprocessors, in fact, it is usually the same speed as adding two 8-bit numbers, to say nothing of the data dependencies that are introduced by propagating carry bits through arrays of small numbers. These data dependencies mean that you will often only be add one element of your array per clock cycle, so you want those elements to be as large as possible. (At a hardware level, there are special tricks which allow carry bits to quickly move across the entire 64-bit operation, but these tricks are unavailable in software.)
If you desire, you can always use template specialization to choose the right sized primitives to make the most space-efficient bignums you want. Otherwise, using an array of uint64_t is much more typical.
If you have the choice, it is usually best to simply use GMP. Portions of GMP are written in assembly to make bignum operations much faster than they would be otherwise.
You may get better performance from larger types due to a decreased loop overhead. However, the tradeoff here is a better speed vs. less flexibility in choosing the size.
For example, if most of your numbers are, say, 5 bytes in length, switching to unit_16 would require an overhead of an extra byte. This means a memory overhead of 20%. On the other hand, if we are talking about really large numbers, say, 50 bytes or more, memory overhead would be much smaller - on the order of 2%, so getting an increase in speed would be achieved at a much smaller cost.
I have to compare two larger objects for equality.
Properties of the objects:
Contain all their members by value (so no pointers to follow).
They also contain some stl::array.
They contain some other objects for which 1 and 2 hold.
Size is up to several kB.
Some of the members will more likely differ than others, therefore lead to a quicker break of the comparison operation if compared first.
The objects do not change. Basically, the algorithm is just to count how many objects are the same. Each object is only compared once against several "master" objects.
What is the best way to compare these objects? I see three options:
Just use plain, non-overloaded operator==.
Overload == and perform a member-by-member comparison, beginning with members likely to differ.
Overload == and view the object as a plain byte field and compare word by word.
Some thoughts:
Option 1 seems good because it means the least amount of work (and opportunities to introduce errors).
Option 2 seems good, because I can exploit the heuristic about which elements differ most likely. But maybe it's still slower because built-in ==of option 1 is ridiculously fast.
Option 3 seems to be most "low-level" optimized, but that's what the compiler probably also does for option 1.
So the questions are:
Is there a well-known best way to solve the task?
Is one of the options an absolute no-go?
Do I have to consider something else?
Default == is fast for small objects, but if you have big data members to compare, try to find some optimizations thinking about the specific data stored and the way they are update, redefining an overloaded == comparison operator more "smarter" than the default one.
As many already said, option 3 is wrong, due to the fact that fields generally are padded to respect the data-alignment, and for optimization reason the bytes added are not initialized to 0 (maybe this is done in the DEBUG version).
I can suggest you to explore the option to divide the check in two stages:
first stage, create some sort of small & fast member that "compress" the status of the instance (think like an hash); this field could be updated every time some big field changes, for example the elements of the stl-array. Than check frequently changed fields and this "compress" status first to make a conservative comparison (for example, the sum of all int in the array, or maybe the xor)
second stage, use an in-deep test for every members. This is the slowest but complete check, but likely will be activated only sometimes
A good question.
If you have some heuristic about which members are likely to differ - use it. So that overloading operator == and checking suspected members first seems to be a good idea.
About byte-wise comparison (aka memcmp and friends) - may be problematic due to struct member alignment. I.e. the compiler sometimes puts "empty spaces" in your struct layout, so that each member will have required alignment. Those are not initialized and usually contain garbage.
This may be solved by explicit zero-initializing of your whole object. But, I don't see any advantage of memcmp vs automatic operator ==, which is a members-wise comparison. It probably may save some code size (a single call to memcpy vs explicit reads and comparisons), however from the performance perspective this seems to be pretty much the same.
I'm currently evaluating whether I should utilize a single large bitset or many 64-bit unsigned longs (uint_64) to store a large amount of bitmap information. In this case, the bitmap represents the current status of a few GB of memory pages (dirty / not dirty), and has thousands of entries.
The work which I am performing requires that I be able to query and update the dirty pages, including performing OR operations between two dirty page bitmaps.
To be clear, I will be performing the following:
Importing a bitmap from a file, and performing a bitwise OR operation with the existing bitmap
Computing the hamming weight (counting the number of bits set to 1, which represents the number of dirty pages)
Resetting / clearing a bit, to mark it as updated / clean
Checking the current status of a bit, to determine if it is clean
It looks like it is easy to perform bitwise operations on a C++ bitset, and easily compute the hamming weight. However, I imagine there is no magic here -- the CPU can only perform bitwise operations on as many bytes as it can store in a register -- so the routine utilized by the bitset is likely the same I would implement myself. This is probably also true for the hamming weight.
In addition, importing the bitmap data from the file to the bitset looks ugly -- I need to perform bitshifts multiple times, as shown here. I imagine given the size of the bitsets I would be working with, this would have a negative performance impact. Of course, I imagine I could just use many small bitsets instead, but there may be no advantage to this (other then perhaps ease of implementation).
Any advice is appriciated, as always. Thanks!
Sounds like you have a very specific single-use application. Personally, I've never used a bitset, but from what I can tell its advantages are in being accessible as if it was an array of bools as well as being able to grow dynamically like a vector.
From what I can gather, you don't really have a need for either of those. If that's the case and if populating the bitset is a drama, I would tend towards doing it myself, given that it really is quite simple to allocate a whole bunch of integers and do bit operations on them.
Given that have very specific requirements, you will probably benefit from making your own optimizations. Having access to the raw bit data is kinda crucial for this (for example, using pre-calculated tables of hamming weights for a single byte, or even two bytes if you have memory to spare).
I don't generally advocate reinventing the wheel... But if you have special optimization requirements, it might be best to tailor your solution towards those. In this case, the functionality you are implementing is pretty simple.
Thousands bits does not sound as a lot. But maybe you have millions.
I suggest you write your code as-if you had the ideal implementation by abstracting it (to begin with use whatever implementation is easier to code, ignoring any performance and memory requirement problems) then try several alternative specific implementations to verify (by measuring them) which performs best.
One solution that you did not even consider is to use Judy arrays (specifically Judy1 arrays).
I think if I were you I would probably just save myself the hassle of any DIY and use boost::dynamic_bitset. They've got all the bases covered in terms of functionality, including stream operator overloads which you could use for file IO (or just read your data in as unsigned ints and use their conversions, see their examples) and a count method for your Hamming weight. Boost is very highly regarded a least by Sutter & Alexandrescu, and they do everything in the header file--no linking, just #include the appropriate files. In addition, unlike the Standard Library bitset, you can wait until runtime to specify the size of the bitset.
Edit: Boost does seem to allow for the fast input reading that you need. dynamic_bitset supplies the following constructor:
template <typename BlockInputIterator>
dynamic_bitset(BlockInputIterator first, BlockInputIterator last,
const Allocator& alloc = Allocator());
The underlying storage is a std::vector (or something almost identical to it) of Blocks, e.g. uint64s. So if you read in your bitmap as a std::vector of uint64s, this constructor will write them directly into memory without any bitshifting.