Fast hamming distance between 2 bitset - c++

I'm writing a software that heavily relies on (1) accessing single bit and (2) Hamming distance computations between 2 bitset A and B (ie. the numbers of bits that differ between A and B). The bitsets are quite big, between 10K and 1M bits and i have a bunch of them. Since it is impossible to know the bitset sizes at compilation time, i'm using vector < bool > , but i plan to migrate to boost::dynamic_bitset soon.
Hereafter are my questions:
(1) Any ideas about which implementations have the fastest single bit access time?
(2) To compute Hamming distance, the naive approach is to loop over the single bits and to count differences between the 2 bitsets. But, my feeling is that it might be much faster to loop over bytes instead of bits, perform R = byteA XOR byteB, and look in a table with 255 entries what "local" distance is associated with R. Another solutions would be store a 255 x 255 matrix and access directly without operation to the distance between byteA and byteB. So my question: Any idea how to implement that from std::vector < bool > or boost::dynamic_bitset? In other words, do you know if there is a way to get access to the bytes array or i have to recode everything from scratch?

(1) Probably vector<char> (or even vector<int>), but that wastes at least 7/8 space on typical hardware. You don't need to unpack the bits if you use a byte or more to store them. Which of vector<bool> or dynamic_bitset is faster, I don't know. That might depend on the C++ implementation.
(2) boost::dynamic_bitset has operator^ and a count member, which together can be used to compute the Hamming distance in a probably fast, though memory-wasting way. You can also get to the underlying buffer with to_block_range; to use that, you need to implement a Hamming distance calculator as an OutputIterator.

If you do code from scratch, you can probably do even better than a byte at a time: take a word at a time from each bitset. The cost of XOR should be very low, then use either an implementation-specific builtin popcount, or else the fastest bit-twiddling popcount you can find (which may or may not involve a 256-entry lookup).
[Edit: looks as if this could apply to boost::dynamic_bitset::to_block_range, with the Block chosen as either int or long. It's a shame that it writes to an OutputIterator rather than giving you an InputIterator -- I can't immediately see how to use it to iterate over two bitsets together, except by using an extra thread or else copying one of the bitsets out to an int array first. Either way you'll take some copy overhead that could have been avoided if it had left the program control to you. The thread is pretty complicated for this task, and of course has its own overheads, and copying out the data probably isn't any better than using operator^ and count().]

I know this will get downvoted for heresy, but here it is: you can get a pointer to the actual data from a vector using &vector[0]; (for vector ymmv). Then, you can iterate over it using c-style functions; meaning, cast your pointer to an int pointer or something big like that, perform your hamming arithmetic as above, and move the pointer one word-length at a time. This would only work because you know that the bits are packed together continuously, and would be vulnerable (for example, if the vector is modified, it could move memory locations).

Related

Why pair numbers using bit interleaving instead of separating by high and low bits?

I recently learned of Morton coding (Z-order curve) as a bitwise pairing function. It was presented to me as a computationally faster way to pair numbers compared to the Cantor pairing function.
The way Morton coding works is to combine two numbers by interleaving their bits and storing the result in a wider data type. For example, interleave the bits of two 8-bit integers and store the result as a 16-bit integer.
Why would you want to interleave the bits instead of splitting the two numbers among the high and low bits of the target data type? I would expect using high and low bits to be faster still. When might there be an advantage in interleaving them?
Like the Cantor pairing function, and unlike concatenation, it does not place an a-priori bound on the coordinates. In order words, Morton-coding can also be formulated for arbitrary length integers. That is not the case for concatenation really, because while anything can be concatenated, the result would be ambiguous and its interpretation would depend on the original sizes of the coordinates. The sizes of all but one dimension have to be fixed to avoid ambiguity.
If it is used in a context where there is an a-priori bound anyway, and locality is not an issue, then of course concatenation is an even simpler option.
Locality is a commonly used advantage though. Two coordinates that are close by are mostly mapped relatively close by in terms of their Z-values as well. The Hilbert curve works even better for that purpose, but is harder to encode, decode, and offset (and like concatenation, it also depends on the size of the space which must be fixed in advance). Concatenated coordinates preserve locality in only one dimension (but really well) and not the other(s), but are the easiest to encode/decode/offset (when these things are possible at all, which means the size of all but one dimension must be predetermined).

What's the most performant way to create a sorted copy of a C++ vector?

Given a C++ vector (let's say it's of doubles, and let's call it unsorted), what is the most performant way to create a new vector sorted which contains a sorted copy of unsorted?
Consider the following naïve solution:
std::vector<double> sorted = unsorted;
std::sort(sorted.begin(), sorted.end());
This solution has two steps:
Create an entire copy of unsorted.
Sort it.
However, there is potentially a lot of wasted effort in the initial copy of step 1, particularly for a large vector that is (for example) already mostly sorted.
Were I writing this code by hand, I could combine the first pass of my sorting algorithm with step 1, by having the first pass read the values from the unsorted vector while writing them, partially sorted as necessary, into sorted. Depending on the algorithm, subsequent steps might then work just with the data in sorted.
Is there a way to do such a thing with the C++ standard library, Boost, or a third-party cross-platform library?
One important point would be to ensure that the memory for the sorted C++ vector isn't needlessly initialized to zeroes before the sorting begins. Many sorting algorithms would require immediate random-write access to the sorted vector, so using reserve() and push_back() won't work for that first pass, yet resize() would waste time initializing the vector.
Edit: As the answers and comments don't necessarily see why the "naïve solution" is inefficient, consider the case where the unsorted array is in fact already in sorted order (or just needs a single swap to become sorted). In that case, regardless of the sort algorithm, with the naïve solution every value will need to be read at least twice--once while copying, and once while sorting. But with a copy-while-sorting solution, the number of reads could potentially be halved, and thus the performance approximately doubled. A similar situation arises, regardless of the data in unsorted, when using sorting algorithms that are more performant than std::sort (which may be O(n) rather than O(n log n)).
The standard library - on purpose - doesn't have a sort-while-copying function, because the copy is O(n) while std::sort is O(n log n).
So the sort will totally dominate the cost for any larger values of n. (And if n is small, it doesn't matter anyway).
Assuming the vector of doubles doesn't contain special numbers like NAN or infinity, then the doubles can be treated as 64 bit sign + magnitude integers, which can be converted to be used for a radix sort which is fastest. These "sign + magnitude integers" will need to be converted into 64 bit unsigned integers. These macros can be used to convert back and forth SM stands fro sign + magnitude, ULL for unsigned long long (uint64_t). It's assumed that the doubles are cast to type unsigned long long in order to use these macros:
#define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
#define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))
Note that using these macros will treat negative zero as less than positive zero, but this is normally not an issue.
Since radix sort needs an initial read pass to generate a matrix of counts (which are then converted into the starting or ending indices of logical bucket boundaries), then in this case, the initial read pass would be a copy pass that also generates the matrix of counts. A base 256 sort would use a matrix of size [8][256], and after the copy, 8 radix sort passes would be performed. If the vector is much larger than cache size, then the dominant time factor will be the random access writes during each radix sort pass.

Advantages of a bit matrix over a bitmap

I want to create a simple representation of an environment that basically just represents if at a certain position is an object or is not.
I would thus only need a big matrix filled with 1's and 0'. It is important to work effectively on this matrix, since I am going to have random positioned get and set operations on it, but also iterate over the whole matrix.
What would be the best solution for this?
My approach would be to create a vector of vectors containing bit elements. Otherwise, would there be an advantage of using a bitmap?
Note that while std::vector<bool> may consume less memory it is also slower than std::vector<char> (depending on the use case), because of all the bitwise operations. As with any optimization questions, there is only one answer: try different solutions and profile properly.

How to make a fast dictionary that contains another dictionary?

I have a map<size_t, set<size_t>>, which, for better performance, I'm actually representing as a lexicographically-sorted vector<pair<size_t, vector<size_t>>>.
What I need is a set<T> with fast insertion times (removal doesn't matter), where T is the data type above, so that I can check for duplicates (my program runs until there are no more unique T's being generated.).
So far, switching from set to unordered_set has turned out to be quite beneficial (it makes my program run > 25% faster), but even now, inserting T still seems to be one of the main bottlenecks.
The maximum number of integers in a given T is around ~1000, and each integer is also <= ~1000, so the numbers are quite small (but there are thousands of these T's being generated).
What I have already tried:
Using unsigned short. It actually decreases performance slightly.
Using Google's btree::btree_map.
It's actually much slower because I have to work around the iterator invalidation.
(I have to copy the keys, and I think that's why it turned out slow. It was at least twice as slow.)
Using a different hash function. I haven't found any measurable difference as long as I use something reasonable, so it seems like this can't be improved.
What I have not tried:
Storing "fingerprints"/hashes instead of the actual sets.
This sounds like the perfect solution, except that the fingerprinting function needs to be fast, and I need to be extremely confident that collisions won't happen, or they'll screw up my program.
(It's a deterministic program that needs exact results; collisions render it useless.)
Storing the data in some other compact, CPU-friendly way.
I'm not sure how beneficial this would be, because it might involve copying around data, and most of the performance I've gained so far is by (cleverly) avoiding copying data in many situations.
What else can I do to improve the speed, if anything?
I am under the impression that you have 3 different problems here:
you need the T itself to be relatively compact and easy to move around
you need to quickly check whether a T is a possible duplicate of an already existing one
you finally need to quickly insert the new T in whatever data structure you have to check for duplicates
Regarding T itself, it is not yet as compact as it could be. You could probably use a single std::vector<size_t> to represent it:
N pairs
N Indexes
N "Ids" of I elements each
all that can be linearized:
[N, I(0), ..., I(N-1),
R(0) = Size(Id(0)), Id(0, 0), ... , Id(0, R(0)-1),
R(1) = ... ]
and this way you have a single chunk of memory.
Note: depending on the access pattern you may have to tweak it, specifically if you need random access to any ID.
Regarding the possibility of duplicates, a hash-map seems indeed quite appropriate. You will need a good hash function, but with a single array of size_t (or unsigned short if you can, it is smaller), you can just pick MurmurHash or CityHash or SipHash. They all are blazing fast and do their damnest to produce good quality hash (not cryptographic ones, emphasis is on speed).
Now, the question is when is it slow when checking for duplicates.
If you spend too much time checking for non-existing duplicates because the hash-map is too big, you might want to invest in a Bloom Filter in front of it.
Otherwise, check your hash function to make sure that it is really fast and has a low collision rate and your hash-map implementation to make sure it only ever computes the hash once.
Regarding insertion speed. Normally a hash-map, specifically if well-balanced and pre-sized, should have one of the quickest insertion. Make sure you move data into it and do not copy it; if you cannot move, it might be worth using a shared_ptr to limit the cost of copying.
Don't be afraid of collisions, use cryptographic hash. But choose a fast one. 256 bit collision is MUCH less probable than hardware error. Sun used it to deduplicate blocks in ZFS. ZFS uses SHA256. Probably you can use less secure hash. If it takes $1000000 to find one collision hash isn't secure but one collision don't seem to drop your performance. Many collisions would cost many $1000000. You can use something like (unordered) multimap<SHA, T> to deal with collisions. By the way, ANY hash table suffer from collisions (or takes too many memory), so ordered map (rbtree in gcc) or btree_map has better guaranteed time. Also hash table can be DOSed via hash collisions. Probably a secret salt can solve this problem. It is due to table size is much less than number of possible hashes.
You can also:
1) use short ints
2) reinterpret your array as an array of something like uint64_t for fast comparison (+some trailing elements), or even reinterpret it as an array of 128-bit values (or 256-bit, depending on your CPU) and compare via SSE. This should push your performance to memory speed limit.
From my experience SSE works fast with aligned memory access only. uint64_t comparison probably needs alignment for speed too, so you have to allocate memory manually with proper alignment (allocate more and skip first bytes). tcmalloc is 16 byte-aligned, uint64_t-ready. It is strange that you have to copy keys in btree, you can avoid it using shared_ptr. With fast comparisons and slow hash btree or std::map may turn out to be faster than hash table. I guess any hash is slower than memory. You can also calculate hash via SSE and probably find a library that does it.
PS I strongly recommend you to use profiler if you don't yet. Please tell % of time your program spend to insert, compare in insert and calculate hash.

Best Data Structure for Genetic Algorithm in C++?

i need to implement a genetic algorithm customized for my problem (college project), and the first version had it coded as an matrix of short ( bits per chromosome x size of population).
That was a bad design, since i am declaring a short but only using the "0" and "1" values... but it was just a prototype and it worked as intended, and now it is time for me to develop a new, improved version. Performance is important here, but simplicity is also appreciated.
I researched around and came up with:
for the chromosome :
- String class (like "0100100010")
- Array of bool
- Vector (vectors appears to be optimized for bool)
- Bitset (sounds the most natural one)
and for the population:
- C Array[]
- Vector
- Queue
I am inclined to pick vector for chromossome and array for pop, but i would like the opinion of anyone with experience on the subject.
Thanks in advance!
I'm guessing you want random access to the population and to the genes. You say performance is important, which I interpret as execution speed. So you're probably best off using a vector<> for the chromosomes and a vector<char> for the genes. The reason for vector<char> is that bitset<> and vector<bool> are optimized for memory consumption, and are therefore slow. vector<char> will give you higher speed at the cost of x8 memory (assuming char = byte on your system). So if you want speed, go with vector<char>. If memory consumption is paramount, then use vector<bool> or bitset<>. bitset<> would seem like a natural choice here, however, bear in mind that it is templated on the number of bits, which means that a) the number of genes must be fixed and known at compile time (which I would guess is a big no-no), and b) if you use different sizes, you end up with one copy per bitset size of each of the bitset methods you use (though inlining might negate this), i.e., code bloat. Overall, I would guess vector<bool> is better for you if you don't want vector<char>.
If you're concerned about the aesthetics of vector<char> you could typedef char gene; and then use vector<gene>, which looks more natural.
A string is just like a vector<char> but more cumbersome.
Specifically to answer your question. I am not exactly sure what you are suggestion. You talk about Array and string class. Are you talking about the STL container classes where you can have a queue, bitset, vector, linked list etc. I would suggest a vector for you population (closest thing to a C array there is) and a bitset for you chromosome if you are worried about memory capacity. Else as you are already using a vector of your string representaion of your dna. ("10110110")
For ideas and a good tool to dabble. Recommend you download and initially use this library. It works with the major compilers. Works on unix variants. Has all the source code.
All the framework stuff is done for you and you will learn a lot. Later on you could write your own code from scratch or inherit from these classes. You can also use them in commercial code if you want.
Because they are objects you can change representaion of your DNA easily from integers to reals to structures to trees to bit arrays etc etc.
There is always learning cure involved but it is worth it.
I use it to generate thousands of neural nets then weed them out with a simple fitness function then run them for real.
galib
http://lancet.mit.edu/ga/
Assuming that you want to code this yourself (if you want an external library kingchris seems to have a good one there) it really depends on what kind of manipulation you need to do. To get the most bang for your buck in terms of memory, you could use any integer type and set/manipulate individual bits via bitmasks etc. Now this approach likely not optimal in terms of ease of use... The string example above would work ok, however again its not significantly different than the shorts, here you are now just representing either '0' or '1' with an 8 bit value as opposed to 16 bit value. Also, again depending on the manipulation, the string case will probably prove unwieldly. So if you could give some more info on the algorithm we could maybe give more feedback. Myself I like the individual bits as part of an integer (a bitset), but if you aren't used to masks, shifts, and all that good stuff it may not be right for you.
I suggest writing a class for each member of population, that simplifies things considerably, since you can keep all your member relevant functions in the same place nicely wrapped with the actual data.
If you need a "array of bools" I suggest using an int or several ints (then use mask and bit wise operations to access (modify / flip) each bit) depending on number of your chromosomes.
I usually used some sort of collection class for the population, because just an array of population members doesn't allow you to simply add to your population. I would suggest implementing some sort of dynamic list (if you are familiar with ArrayList then that is a good example).
I had major success with genetic algorithms with the recipe above. If you prepare your member class properly it can really simplify things and allows you to focus on coding better genetic algorithms instead of worrying about your data structures.