How to efficiently implement bitwise rotate of an arbitrary sequence? - c++

"The permutation p of n elements defined by an index permutation p(i) = (i + k) mod n is called the k-rotation." -- Stepanov & McJones
std::rotate has become a well known algorithm thanks to Sean Parent, but how to efficiently implement it for an arbitrary sequence of bits?
By efficient, I mean minimizes at least two things, i) the number of writes and ii) the worst-case space complexity.
That is, the input should be similar to std::rotate but bit-wise specific, I guess like this:
A pointer to the memory where the bit sequence starts.
Three bit indices: first, middle and last.
The type of the pointer could be any unsigned integer, and presumably the larger the better. (Boost.Dynamic Bitset calls it the "block".)
It's important to note that the indices may all be offset from the start of a block by different amounts.
According to Stepanov and McJones, rotate on random access data can be implemented in n + gcd(n, k) assignments. The algorithm that reverses each subrange followed by reversing the entire range takes 3n assignments. (However, I agree with the comments below that it is effectively 2n assignments.) Since the bits in an array can be accessed randomly, I assume the same optimal bound applies. Each assignment will usually require two reads because of different subrange block offsets but I'm less concerned about reads than writes.
Does an efficient or optimal implementation of this algorithm already exist out in the open source wild?
If not, how would one do it?
I've looked through Hacker's Delight and Volume 4A of Knuth but can't find an algorithm for it.

Using a vector<uint32_t>, for example, it's easy and reasonably efficient to do the fractional-element part of the rotation in one pass yourself (shift_amount%32), and then call std::rotate to do the rest. The fractional part is easy and only operates on adjacent elements, except at the ends, so you only need to remember one partial element while you're working.
If you want to do the whole thing yourself, then you can do the rotation by reversing the order of the entire vector, and then reversing the order of the front and back sections. The trick to doing this efficiently is that when you reverse the whole vector, you don't actually bit-reverse each element -- you just think of them as being in the opposite order. The reversal of the front and back sections is trickier and requires you to remember 4 partial elements while you work.
In terms of writes to memory or cache, both of the above methods make 2N writes. The optimal rotation you refer to in the question takes N, but if you extend it to work with fractional-word rotations, then each write spans two words and it then takes 2N writes. It provides no advantage and I think it would turn out to be complicated.
That said... I'm sure you could get closer to N writes with a fixed amount of register storage by doing m words at a time, but that's a lot of code for a simple rotation, though, and your time (or at least my time :) would be better spent elsewhere.

Related

performance: find the index of max value in an arr(tie allowed)

Just as the title, and BTW, it's just out of curiosity and it's not a homework question. It might seem to be trivial for people of CS major. The problem is I would like to find the indices of max value in an array. Basically I have two approaches.
scan over and find the maximum, then scan twice to get the vector of indices
scan over and find the maximum, along this scan construct indices array and abandon if a better one is there.
May I now how should I weigh over these two approaches in terms of performance(mainly time complexity I suppose)? It is hard for me because I have even no idea what the worst case should be for the second approach! It's not a hard problem perse. But I just want to know how to approach this problem or how should I google this type of problem to get the answer.
In term of complexity:
scan over and find the maximum,
then scan twice to get the vector of indices
First scan is O(n).
Second scan is O(n) + k insertions (with k, the number of max value)
vector::push_back has amortized complexity of O(1).
so a total O(2 * n + k) which might be simplified to O(n) as k <= n
scan over and find the maximum,
along this scan construct indices array and abandon if a better one is there.
Scan is O(n).
Number of insertions is more complicated to compute.
Number of clear (and number of element cleared) is more complicated to compute too. (clear's complexity would be less or equal to number of element removed)
But both have upper bound to n, so complexity is less or equal than O(3 * n) = O(n) but also greater than equal to O(n) (Scan) so it is O(n) too.
So for both methods, complexity is the same: O(n).
For performance timing, as always, you have to measure.
For your first method, you can set a condition to add the index to the array. Whenever the max changes, you need to clear the array. You don't need to iterate twice.
For the second method, the implementation is easier. You just find max the first go. Then you find the indices that match on the second go.
As stated in a previous answer, complexity is O(n) in both cases, and measures are needed to compare performances.
However, I would like to add two points:
The first one is that the performance comparison may depend on the compiler, how optimisation is performed.
The second point is more critical: performance may depend on the input array.
For example, let us consider the corner case: 1,1,1, .., 1, 2, i.e. a huge number of 1 followed by one 2. With your second approach, you will create a huge temporary array of indices, to provide at the end an array of one element. It is possible at the end to redefine the size of the memory allocated to this array. However, I don't like the idea to create a temporary unnecessary huge vector, independently of the time performance concern. Note that such a array could suffer of several reallocations, which would impact time performance.
This is why in the general case, without any knowledge on the input, I would prefer your first approach, two scans. The situation could be different if you want to implement a function dedicated to a specific type of data.

What's the most performant way to create a sorted copy of a C++ vector?

Given a C++ vector (let's say it's of doubles, and let's call it unsorted), what is the most performant way to create a new vector sorted which contains a sorted copy of unsorted?
Consider the following naïve solution:
std::vector<double> sorted = unsorted;
std::sort(sorted.begin(), sorted.end());
This solution has two steps:
Create an entire copy of unsorted.
Sort it.
However, there is potentially a lot of wasted effort in the initial copy of step 1, particularly for a large vector that is (for example) already mostly sorted.
Were I writing this code by hand, I could combine the first pass of my sorting algorithm with step 1, by having the first pass read the values from the unsorted vector while writing them, partially sorted as necessary, into sorted. Depending on the algorithm, subsequent steps might then work just with the data in sorted.
Is there a way to do such a thing with the C++ standard library, Boost, or a third-party cross-platform library?
One important point would be to ensure that the memory for the sorted C++ vector isn't needlessly initialized to zeroes before the sorting begins. Many sorting algorithms would require immediate random-write access to the sorted vector, so using reserve() and push_back() won't work for that first pass, yet resize() would waste time initializing the vector.
Edit: As the answers and comments don't necessarily see why the "naïve solution" is inefficient, consider the case where the unsorted array is in fact already in sorted order (or just needs a single swap to become sorted). In that case, regardless of the sort algorithm, with the naïve solution every value will need to be read at least twice--once while copying, and once while sorting. But with a copy-while-sorting solution, the number of reads could potentially be halved, and thus the performance approximately doubled. A similar situation arises, regardless of the data in unsorted, when using sorting algorithms that are more performant than std::sort (which may be O(n) rather than O(n log n)).
The standard library - on purpose - doesn't have a sort-while-copying function, because the copy is O(n) while std::sort is O(n log n).
So the sort will totally dominate the cost for any larger values of n. (And if n is small, it doesn't matter anyway).
Assuming the vector of doubles doesn't contain special numbers like NAN or infinity, then the doubles can be treated as 64 bit sign + magnitude integers, which can be converted to be used for a radix sort which is fastest. These "sign + magnitude integers" will need to be converted into 64 bit unsigned integers. These macros can be used to convert back and forth SM stands fro sign + magnitude, ULL for unsigned long long (uint64_t). It's assumed that the doubles are cast to type unsigned long long in order to use these macros:
#define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
#define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))
Note that using these macros will treat negative zero as less than positive zero, but this is normally not an issue.
Since radix sort needs an initial read pass to generate a matrix of counts (which are then converted into the starting or ending indices of logical bucket boundaries), then in this case, the initial read pass would be a copy pass that also generates the matrix of counts. A base 256 sort would use a matrix of size [8][256], and after the copy, 8 radix sort passes would be performed. If the vector is much larger than cache size, then the dominant time factor will be the random access writes during each radix sort pass.

O(N) Identification of Permutations

This answer determines if two strings are permutations by comparing their contents. If they contain the same number of each character, they are obviously permutations. This is accomplished in O(N) time.
I don't like the answer though because it reinvents what is_permutation is designed to do. That said, is_permutation has a complexity of:
At most O(N2) applications of the predicate, or exactly N if the sequences are already equal, where N=std::distance(first1, last1)
So I cannot advocate the use of is_permutation where it is orders of magnitude slower than a hand-spun algorithm. But surely the implementer of the standard would not miss such an obvious improvement? So why is is_permutation O(N2)?
is_permutation works on almost any data type. The algorithm in your link works for data types with a small number of values only.
It's the same reason why std::sort is O(N log N) but counting sort is O(N).
It was I who wrote that answer.
When the string's value_type is char, the number of elements required in a lookup table is 256. For a two-byte encoding, 65536. For a four-byte encoding, the lookup table would have just over 4 billion entries, at a likely size of 16 GB! And most of it would be unused.
So the first thing is to recognize that even if we restrict the types to char and wchar_t, it may still be untenable. Likewise if we want to do is_permutation on sequences of type int.
We could have a specialization of std::is_permutation<> for integral types of size 1 or 2 bytes. But this is somewhat reminiscent of std::vector<bool> which not everyone thinks was a good idea in retrospect.
We could also use a lookup table based on std::map<T, size_t>, but this is likely to be allocation-heavy so it might not be a performance win (or at least, not always). It might be worth implementing one for a detailed comparison though.
In summary, I don't fault the C++ standard for not including a high-performance version of is_permutation for char. First because in the real world I'm not sure it's the most common use of the template, and second because the STL is not the be-all and end-all of algorithms, especially where domain knowledge can be used to accelerate computation for special cases.
If it turns out that is_permutation for char is quite common in the wild, C++ library implementors would be within their rights to provide a specialization for it.
The answer you cite works on chars. It assumes they are 8 bit (not necessarily the case) and so there are only 256 possibilities for each value, and that you can cheaply go from each value to a numeric index to use for a lookup table of counts (for char in this case, the value and the index are the same thing!)
It generates a count of how many times each char value occurs in each string; then, if these distributions are the same for both strings, the strings are permutations of each other.
What is the time complexity?
you have to walk each character of each string, so M+N steps for two inputs of lengths M and N
each of these steps involves incrementing an count in a fixed size table at an index given by the char, so is constant time
So the overall time complexity is O(N+M): linear, as you describe.
Now, std::is_permutation makes no such assumptions about its input. It doesn't know that there are only 256 possibilities, or indeed that they are bounded at all. It doesn't know how to go from an input value to a number it can use as an index, never mind how to do that in constant time. The only thing it knows is how to compare two values for equality, because the caller supplies that information.
So, the time complexity:
we know it has to consider each element of each input at some point
we know that, for each element it hasn't seen before (I'll leave discussion of how that's determined and why that doesn't impact the big O complexity as an exercise), it's not able to turn the element into any kind of index or key for a table of counts, so it has no way of counting how many occurrences of that element exist which is better than a linear walk through both inputs to see how many elements match
so the complexity is going to be quadratic at best.

Performance of doing bitwise operations on bitsets

In C++ if I do a logical OR (or AND) on two bitsets, for example:
bitset<1000000> b1, b2;
//some stuff
b1 |= b2;
Does this happen in O(n) or O(1) time? Why?
Also, can this be accomplished using an array of bools in O(1) time?
Thanks.
It has to happen in O(N) time since there is a finite number of bits that can be processed in any given chunk of time by a given processor platform. In other words, the larger the bit-set, the longer the amount of time each operation will take, and the increase will be linear with respect to the number of bits in the bitset.
You also end up with the same problem using the array of bool types. While each individual operation itself will take O(1) time, the total amount of time for N objects will be O(N).
It's impossible to perform a logical operation (e.g. OR or AND) on arbitrary arrays of flags in unit time. True Big-Oh analysis deals with runtime as the size of the data tends to infinity, and a Core i7 is never going to OR together a billion bits in the same time it takes to OR together two bit.
I think it needs to be made clear that Big O is a boundary - an asymptotic boundary (minimum time required cannot be less than the f(x)'s Big O., and in in thinking about it, it states the order of magnitude of the speed of a computation. So if you think about how an array works - if you can say I can do this operation all in one computation or so, or there's a known amount that is very small and much less than N, then it is constant. If you need to iterate in some manner (in this case you will see all the bits need to be checked, and there is no short cut for bitwise OR - therefore N bits need to be computed, and therefore it's O(n). [It's actually tighter boundary than that, but we're dealing with just Big O]. An array itself stores N-bits in it.
In fact, few things are really O(1) (index look ups at a known address using a pointer can be O(1) (if you already know what you are looking up). But, if you have M things that need to be looked up in constant time, then it is O(M) * O(1) = O(M).
This is a function of modern day computer - since most things are processed sequentially. (multi-core helps but doesn't come close to affecting big O notation yet). There is of course, the ability of the computer to process words in parallel, but even that is just a constant subtraction. O(n) / O(64) is still O(n).

Fast hamming distance between 2 bitset

I'm writing a software that heavily relies on (1) accessing single bit and (2) Hamming distance computations between 2 bitset A and B (ie. the numbers of bits that differ between A and B). The bitsets are quite big, between 10K and 1M bits and i have a bunch of them. Since it is impossible to know the bitset sizes at compilation time, i'm using vector < bool > , but i plan to migrate to boost::dynamic_bitset soon.
Hereafter are my questions:
(1) Any ideas about which implementations have the fastest single bit access time?
(2) To compute Hamming distance, the naive approach is to loop over the single bits and to count differences between the 2 bitsets. But, my feeling is that it might be much faster to loop over bytes instead of bits, perform R = byteA XOR byteB, and look in a table with 255 entries what "local" distance is associated with R. Another solutions would be store a 255 x 255 matrix and access directly without operation to the distance between byteA and byteB. So my question: Any idea how to implement that from std::vector < bool > or boost::dynamic_bitset? In other words, do you know if there is a way to get access to the bytes array or i have to recode everything from scratch?
(1) Probably vector<char> (or even vector<int>), but that wastes at least 7/8 space on typical hardware. You don't need to unpack the bits if you use a byte or more to store them. Which of vector<bool> or dynamic_bitset is faster, I don't know. That might depend on the C++ implementation.
(2) boost::dynamic_bitset has operator^ and a count member, which together can be used to compute the Hamming distance in a probably fast, though memory-wasting way. You can also get to the underlying buffer with to_block_range; to use that, you need to implement a Hamming distance calculator as an OutputIterator.
If you do code from scratch, you can probably do even better than a byte at a time: take a word at a time from each bitset. The cost of XOR should be very low, then use either an implementation-specific builtin popcount, or else the fastest bit-twiddling popcount you can find (which may or may not involve a 256-entry lookup).
[Edit: looks as if this could apply to boost::dynamic_bitset::to_block_range, with the Block chosen as either int or long. It's a shame that it writes to an OutputIterator rather than giving you an InputIterator -- I can't immediately see how to use it to iterate over two bitsets together, except by using an extra thread or else copying one of the bitsets out to an int array first. Either way you'll take some copy overhead that could have been avoided if it had left the program control to you. The thread is pretty complicated for this task, and of course has its own overheads, and copying out the data probably isn't any better than using operator^ and count().]
I know this will get downvoted for heresy, but here it is: you can get a pointer to the actual data from a vector using &vector[0]; (for vector ymmv). Then, you can iterate over it using c-style functions; meaning, cast your pointer to an int pointer or something big like that, perform your hamming arithmetic as above, and move the pointer one word-length at a time. This would only work because you know that the bits are packed together continuously, and would be vulnerable (for example, if the vector is modified, it could move memory locations).