I want to store integers with greater than 64 bits in length. Number of bits per integer can go up to millions as each entry gets added in the application.
And then for 64 such integers (of equal length) bit-wise AND operation has to be performed.
So what would be the best C++ data structure for operations to be time efficient?
Earlier i had considered vectors for it as it would allow to increase the length dynamically. Other option is to use std:bitset.
But i am not sure how to performs bit-wise ANDs with both these approaches so that its done in most time-efficient manner.
Thanks
The GNU Multiprecision Library is a good arbitrary-precision integer library. It is most likely heavily optimized down to specifics for your compiler/CPU, so I'd go with that as a first start and if it's not fast enough roll your own specific implementation.
It is quite expensive to reallocate memory for vector when taking large data, so i would define
struct int_node{
bitset<256> holder;
int_node *next_node;
}
I think this approach would save time on memory management and save some cycles on bitwise ops.
Related
I am searching for a high performance C++ structure for a table. The table will have void* as keys and uint32 as values.
The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (those tables are suited for high number of items too).
So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.
Another solution which I will shortly evaluate is like that:
struct Group
{
void* items[5]; // search using SIMD
int32_t items[5];
}; // fits in cache line
struct Table
{
Group* groups;
size_t capacity;
};
Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!
EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.
Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).
If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by #JorgeBellon
in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.
A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.
Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).
You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.
I think a nice alternative is an optimization of your simple linear probing idea.
Your lookup procedure would look like this:
Slot *s = &table[hash(key)];
Slot *e = s + s->max_extent;
for (;s<e; ++s) {
if (s->key == key) {
return s->value;
}
}
return NOT_FOUND;
table[h].max_extent is the maximum number of elements you may have to look at if you're looking for an element with hash code h. You would pre-calculate this when you generate the table, so your lookup doesn't have to iterate until it gets a null. This greatly reduces the amount of probing you have to do for misses.
Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill your hash table in hash(key) order to produce the best result.
NOTE that we do not wrap around from the end to the start of the table while probing. Just allocate however many extra slots you need to avoid it.
Suppose I created a class that took a template parameter equal to the number uint8_ts I want to string together into a Big int.
This way I can create a huge int like this:
SizedInt<1000> unspeakablyLargeNumber; //A 1000 byte number
Now the question arises: am I killing my speed by using uint8_ts instead of using a larger built in type.
For example:
SizedInt<2> num1;
uint16_t num2;
Are num1 and num2 the same speed, or is num2 faster?
It would undoubtedly be slower to use uint8_t[2] instead of uint16_t.
Take addition, for example. In order to get the uint8_t[2] speed up to the speed of uint16_t, the compiler would have to figure out how to translate your add-with-carry logic and fuse those multiple instructions into a single, wider addition. I'm sure that some compilers out there are capable of such optimizations sometimes, but there are many circumstances which could make the optimization unlikely or impossible.
On some architectures, this will even apply to loading / storing, since uint8_t[2] usually has different alignment requirements than uint16_t.
Typical bignum libraries, like GMP, work on the largest words that are convenient for the architecture. On x64, this means using an array of uint64_t instead of an array of something smaller like uint8_t. Adding two 64-bit numbers is quite fast on modern microprocessors, in fact, it is usually the same speed as adding two 8-bit numbers, to say nothing of the data dependencies that are introduced by propagating carry bits through arrays of small numbers. These data dependencies mean that you will often only be add one element of your array per clock cycle, so you want those elements to be as large as possible. (At a hardware level, there are special tricks which allow carry bits to quickly move across the entire 64-bit operation, but these tricks are unavailable in software.)
If you desire, you can always use template specialization to choose the right sized primitives to make the most space-efficient bignums you want. Otherwise, using an array of uint64_t is much more typical.
If you have the choice, it is usually best to simply use GMP. Portions of GMP are written in assembly to make bignum operations much faster than they would be otherwise.
You may get better performance from larger types due to a decreased loop overhead. However, the tradeoff here is a better speed vs. less flexibility in choosing the size.
For example, if most of your numbers are, say, 5 bytes in length, switching to unit_16 would require an overhead of an extra byte. This means a memory overhead of 20%. On the other hand, if we are talking about really large numbers, say, 50 bytes or more, memory overhead would be much smaller - on the order of 2%, so getting an increase in speed would be achieved at a much smaller cost.
I'm creating a Chess Solver and have decided to use bitboards. Conveniently there are 64 squares on a standard chess board. This is nice since the prevalence of 64-bit operating systems a single bitboard can fit into a single register.
That said, are there fundamental differences (size (memory and code), speed, complexity, memory usage, etc) in using a std::bitset<64> and the functions therein or the fundamental type of the "same" size unsigned long long and performing the bit twiddling manually?
uint64_t probably. You'll want to perform operations that are not available on std::bitset, including almost all arithmetic operations, bitscan, using parts of the board as index into an array, and SSE intrinsics if you get serious about it.
For example (not an exhaustive list by any means, just some simple examples) in o^(o-2r) (and its cousin), the more advanced Hyperbola Quintessence, as part of extracting the lowest set bit, etc.
You could use std::bitset but you'd be converting it back to some type of integer a lot.
Say I have 2 bitsets
bitset<1024> test, current;
How am I supposed to modulus current with test and output it in another bitset<1024>? Note that test may be of any form, not just powers of two?
Looking for an answer with either complete code or complete pseudocode. I will not accept answers involving converting to another type except bitset because although using bitsets here may work slower, but later in the program bitsets are going to be very fast.
Here's something you could try if you don't want to implement the modulo algorithm yourself:
Instead of std::bitset, use boost::dynamic_bitset.
Use boost::to_block_range to copy the bitset's bytes to a buffer.
Use one of the many bigint libraries to represent a 256-byte integer.
Make the 256-byte bigint use the bytes copied in step #2.
Perform the modulo operation on the bigint.
Convert the result back to a dynamic_bitset.
Profit
Hopefully, there's a bigint library out there that lets you access its buffer, so that you can copy the bytes from the dynamic_bitset directly into the bigint.
And hopefully, the overhead of copying the 256 bytes around is negligible compared to the modulo operation itself.
Oh, and the bigint representation should have the same byte order as the dynamic_bitset.
What I need to do is open a text file with 0s and 1s to find patterns between the columns in the file.
So my first thought was to parse each column into a big array of bools, and then do the logic between the columns (now in arrays). Until I found that the size of bools is actually a byte not a bit, so i would be wasting 1/8 of memory, assigning each value to a bool.
Is it even relevant in a grid of 800x800 values? What would be the best way to handle this?
I would appreciate a code snippet in case its a complicated answer
You could use std::bitset or Boosts dynamic_bitset which provide different methods which will help you manage your bits.
They for example support constructors which create bitsets from other default types like int or char. You can also export the bitset into an ulong or into a string (which then could be turned into a bitset again etc)
I once asked about concatenating those, which wasn't performantly possible to do. But perhaps you could use the info in that question too.
you can use std::vector<bool> which is a specialization of vector that uses a compact store for booleans....1 bit not 8 bits.
I think it was Knuth who said "premature optimization is the root of all evil." Let's find out a little bit more about the problem. Your array is 800**2 == 640,000 bytes, which is no big deal on anything more powerful than a digital watch.
While storing it as bytes may seem wasteful -- as you say, 7/8ths of the memory is redundant -- but on the other hand, most machines don't do bit operations as efficiently as bytes; by saving the memory, you might waste so much effort masking and testing that you would have been better off with the bytes model.
On the other hand, if what you want to do with it is look for larger patterns, you might want to use a bitwise representation because you can do things with 8 bits at a time.
The real point here is that there are several possibilities, but no one can tell you the "right" representation without knowing what the problem is.
For that size grid your array of bools would be about 640KB. Depends how much memory you have if that will be a problem. It would probably be the simplest for the logic analysis code.
By grouping the bits and storing in an array of int you could drop the memory requirement to 80KB, but the logic code would be more complicated as you'd be always isolating the bits you wanted to check.