I need to do substantial simple algebras on equisized arrays of small integers. The operations consist of only three kinds: (i) add arrays and (ii) subtract arrays element-wisely, and (iii) compare if all elements in one array are no less / greater than their counterparts in another.
To boost cache locality and computing speed, I cram the small integers of every array bit-by-bit into a certain number of 64-bit integers. The number of 64-bit integers for use is determined by numbers of bits assigned to array elements. Let a[j] denote an array element. My design of bits for a[j] consists of (i) bits that can hold the largest absolute value a[j] could hit during computation, (ii) a sign bit, and (iii) a bit on the sign bit's left. The leftmost bit holds the possible carry from the right and gets zeroed after addition or subtraction.
Below is a toy example of adding, subtracting and comparing two 64-bit integers, each of which includes five small integers: the first 10 bits, the next 5 bits, the next 10 bits, the next 13 bits, and the next 20 bits. The rest bits are useless and set to 0.
// leftmostBitMask =
// 0b0111111111011110111111111011111111111101111111111111111111000000
// ^ ^ ^ ^ ^
// leftmost
std::size_t add(std::size_t x, std::size_t y, std::size_t leftmostBitMask)
{
return (x + y) & leftmostBitMask;
}
std::size_t minus(std::size_t x, std::size_t y, std::size_t leftmostBitMask)
{
return (x - y + ((~leftmostBitMask) << 1)) & leftmostBitMask;
}
bool notAllGreaterEqual(std::size_t x, std::size_t y, std::size_t leftmostBitMask)
{
// return (minus(x, y, leftmostBitMask) & (leftmostBitMask >> 1)) == 0;
return (x - y) & ((~leftmostBitMask) >> 1);
}
My algorithms seem complex, especially the comparison function. Are there any faster solutions?
Thanks!
BTW, SIMD is not what I am describing. My question is one lower level of optimization than SIMD.
More background: the idea serves a quite complex search algorithm in multidimensional space. We observed large differences between magnitudes of values in different dimensions. For instance, during computing an important 6-dimensional test case, one dimension could reach 50000 in absolute value yet all the others fall well below 1000. Without integer compression, each object requires a 32-bit array of size 6, while integer compression reduces the dimensionality to 1 (64-bit integer). Such reduction prompts me to think about cramming integers..
After careful thought and comprehensive simulations, algorithms listed in the question turn out largely over-engineered. The leftmost bit for receiving carry is unnecessary. The code below works:
// signBitMask =
// 0b1000000000100001000000000100000000000010000000000000000000000000
// ^ ^ ^ ^ ^
// sign bit
std::size_t add(std::size_t x, std::size_t y)
{
return x + y;
}
std::size_t subtract(std::size_t x, std::size_t y)
{
return x - y;
}
bool notAllGreaterEqual(std::size_t x, std::size_t y, std::size_t signBitMask)
{
return (x - y) & signBitMask != 0;
}
The key factor here is that every comparison made on two arrays is AND-based. We require notAllGreaterEqual() returns true as long as one elemental integer in x is below its counterpart in y. At first glance, the solution above could hardly be true: What happens when one negative elemental integer is added to a positive counterpart and the result stays positive? There must be a carry over the sign bit. In this case isn't the successive elemental integer contaminated? The answer is yes, but it does not matter. Collectively notAllGreaterEqual() would still fully serve its purpose. Instead of thinking in bits, one can easily prove notAllGreaterEqual() correct with elementary algebra. Problems could come only if we want to recover the integer array from those 64-bit buffers.
Creating the 64-bit buffer consists of (i) casting integers to std::size_t, (ii) shift the integer by pre-computed bits, and (iii) adding shifted integers. If an integer is negative being shifted to the right, 1 must be padded on its left.
Related
I am attempting to vectorize this fairly expensive function (Scaler Now working!):
template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));
return (x - std::trunc(x)) < 0.000001;
}//End of isPower
Here's what I have so far (for 32-bit int only):
template<typename RETURN_T>
inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
{
RETURN_T cnt = 0;
const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));
const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);
__m256i _count = _mm256_setzero_si256();
__m256i _N_INT = _mm256_setzero_si256();
__m256 _N_DBL = _mm256_setzero_ps();
__m256 LOG_OF_N = _mm256_setzero_ps();
__m256 DIVIDE_LOG = _mm256_setzero_ps();
__m256 TRUNCATED = _mm256_setzero_ps();
__m256 CMP_MASK = _mm256_setzero_ps();
for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
{
//Set Values
_N_INT = _mm256_load_si256((__m256i*) &arr[i]);
_N_DBL = _mm256_cvtepi32_ps(_N_INT);
LOG_OF_N = _mm256_log_ps(_N_DBL);
DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);
TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));
CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);
_count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
}//End for
cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
}//End of count_powers_of
The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.
The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.
Testing:
int64_t count = 0;
for (size_t i = 0; i < vec.size(); ++i)
{
if (isPower(vec[i], 4))
{
++count;
}//End if
}//End for
std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000
std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array
This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).
It could also be this sum function (based off of code by #PeterCordes ):
inline uint32_t _mm_sum_epi32(__m128i& x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
}
inline uint32_t _mm256_sum_epi32(__m256i& v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1));
return _mm_sum_epi32(sum128);
}
I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?
Thanks for all your insights and suggestions thus far.
A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.
Re: the original version:
This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.
Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.
I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.
I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.
You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.
Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.
That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!
Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).
Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:
_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.
Updated version using (x - std::trunc(x)) < 0.000001;
This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)
Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.
float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.
I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.
__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)
Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.
In runtime I have 2 ranges defined by their uint32_t borders a..b and c..d. The first range tends to be much greater than the second: 8 < (b - a) / (d - c) < 64.
Exact limits: a >= 0, b <= 2^31 - 1, c >= 0, d <= 2^20 - 1.
I need a routine that performs linear mapping of an integer from the first range onto the second one: f(uint32_t x) -> round_to_uint32_t((float)(x - a) / (b - a) * (d - c) + c).
When b - a >= d - c it is important to mantain the ratio as close to ideal as possible, otherwise in cases when element from [a; b] can be mapped on more than one integer from [c; d] it is okay to return any of these integers.
Sounds like a simple ratio problem and was already answered in many questions like
Convert a number range to another range, maintaining ratio
but here I need a really really fast solution.
This routine is a pivotal part of a specialized sorting algorithm and will be called at least once for every element of a sorted array.
SIMD solution is also acceptable if it doesn't drop overall performance.
Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math); the compiler can't turn x / foo * bar into x * (bar/foo) for you, even though that's very good with loop-invariant bar/foo. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.
_mm256_fmadd_ps looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a). If float rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a)). Using double for this calculation avoids rounding error during conversion to FP of the division operands.
You're reusing the same a,b,c,d for many x, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)
Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i before looping through that scalar.
Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.
Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.
If not SIMD, then even scalar C++ fma() is still good, for vfmadd213sd, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq rather than vcvttps2dq).
Note that uint32_t <-> float conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.
It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x and x-a (and b-a) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF. (Or at least non-negative. Zero is fine.)
Float Rounding
For SIMD, single-precision float is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t can be exactly represented as a float. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.
Using SIMD double is possible but requires some shuffling.
I don't think float is usually a problem for the formula as-written with (float)(x-a). If the b-a input range is large, that means both ranges are large and rounding error isn't going to map all possible x values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a values.
But if we want to factor out the -a * (d - c) / (b - a) part and combine it with the +c at the end, then
We potentially have precision loss from catastrophic cancellation in that value to be added.
We need to do (float)x on the raw input value. If a is huge and b-a is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x values to the same float.
To make best use of FMA, we want to do the +c before converting back to integer, which again risks output rounding error if the d-c is a small output range but c is huge. In your case not a problem; with d <= 2^20 - 1 we know that float can exactly represent every output integer value in that c..d range.
If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U on input and ...+c+0x80000000U on output (after rounding to nearest int32_t). But that would introduce huge float rounding error for small uint32_t inputs (close to 0) which get range-shifted to close to INT_MIN.
We don't need to range-shift for the b-a or d-c because the + or - or XOR with 0x80000000U would cancel out in the subtractions.
Example:
The const vectors should be hoisted out of a loop by the compiler after this inlines,
or you can do that manually.
This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.
// fastest but not safe if b-a is small and a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
double scale_scalar = (d - c) / (double)(b - a);
const __m256 scale = _mm256_set1_ps(scale_scalar);
const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
// (x-a) * scale + c
// = x * scale + (-a*scale + c) but with different rounding error from doing -a*scale + c
__m256 in = _mm256_cvtepi32_ps(data);
__m256 out = _mm256_fmadd_ps(in, scale, add);
return _mm256_cvtps_epi32(out); // convert back with round to nearest-even
// _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
}
Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer
static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
const __m256 cvec = _m256_set1_ps(c);
__m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a)); // add can more easily fold a load of a memory operand than sub because it's commutative. Only some compilers will do this for you.
__m256 in_fp = _mm256_cvtepi32_ps(in_offset);
__m256 out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c)); // in*scale + c
return _mm256_cvtps_epi32(out);
}
Without FMA you could still use vmulps. You might as well convert back to integer before adding c if you're doing that, although vaddps would be safe.
You might use this in a loop like
void foo(uint32_t *arr, ptrdiff_t len)
{
if (len < 24) special case;
alignas(32) uint32_t tmpbuf[16];
// peel half of first iteration for software pipelining / loop rotation
__m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
__m256i outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)tmpbuf, outrange);
// could have used an unsigned loop counter
// since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {
// prep next 8 elements
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[8], outrange);
// use first 8 elements
for (int j=0 ; j<8 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
// prep 8 more for next iteration
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[0], outrange);
// use 2nd 8 elements
for (int j=8 ; j<16 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
}
// use tmpbuf[0..7]
// then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
}
These variable-names sound dumb but I couldn't think of anything better.
This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange)) if you want.)
Special cases
If there cases where you know (b - a) is a power of 2, you could bitscan with tzcnt or bsf, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz() to count trailing zeros.)
Or can you ensure that (b - a) is always a power of 2?
Or better, if (b - a) / (d - c) is an exact power of 2 the whole thing can just be sub / right shift / add.
If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.
I'm using the following code to map two signed 16-bit integers to the upper and lower 16 bits of an unsigned 32 bit integer.
inline uint32_t to_score(int16_t mg, int16_t eg) {
return ((1u * mg) << 16 | (eg & 0xFFFF));
}
inline int16_t extract_mg(uint32_t score) {
return int16_t(score >> 16);
}
inline int16_t extract_eg(uint32_t score) {
return int16_t(score & 0xFFFF);
}
I need to perform various calculations on both the mg and eg parts simultaneously, before interpolating the two parts at the end of a function.
As I understand it, as long as there is no overflow, it should be safe to add two uint32_ts created to_score, and then extract the int16_ts to find the results of the individual calculations: i.e. the results if I added the the values for mg and eg separately.
I'm not sure whether this assumption holds if either mg or eg are negative, or whether this method can be used for subtraction, multiplication and/or division.
Which operations can I expect to function correctly? Are there alternative ways of representing two integers which can be added/subtracted/multiplied quickly?
There will be a problem with a carry going from the low half into the high half, but it can be avoided with extra operations, as detailed on for example chessprogramming.org/SIMD_and_SWAR_Techniques
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
Where in this case H = 0x80008000.
As an other alternative, it could be done with two additions, but with optimized extraction/recombination:
// low half addition, leaving upper half corrupted but it will be ignored
l = x + y
// high half addition, adding 0 to the bottom so no carry
h = x + (y & 0xFFFF0000)
// recombine
z = (l & 0xFFFF) | (h & 0xFFFF0000)
Subtraction is a minor variation on addition.
Multiplication unfortunately cares about absolute bit-positions, so values have to be moved (shifted) to their notional position for it to work. Actual SIMD can still be used though, such as _mm_mullo_epi16 with SSE2.
C++ signed integers are two's complement, it is on the way to be standardized in C++20, in practice you may already assume that.
Some cases of addition and subtraction would work, those cases that don't cause either of following: eg to overflow, mg to overflow, mg to change sign.
The optimization does not make much sense.
If there's larger array, you can try to get your operations vectorized with proper SIMD instruction, if they are available for your platform by enabling compiler optimization or by using intrinsics ( _mm_adds_pi16 might be the one you need ).
If you have just two integers, just compute them one by one.
Is there a way to properly left rotate (not just shift) BigIntegers of a fixed size?
I tried writing a method which resembles the classic rotation method which is used to rotate integers, but it does not work on BigIntegers. It just shifts the bits to the left by r positions, filling zeros at the end.
public static BigInteger rotate(BigInteger n, int r){
return n.shiftLeft(r).or(n.shiftRight(128-r));
}
EDIT: Not using BigIntegers and using arrays of longs or integers looks like another option, but I'm not sure how you'd be able to combine them (except using BigIntegers) to perform the rotation.
That is actually not so easy. Where would the rotation point be? That is easy for fixed size numbers like 32 bit or 64 bit integers, but not for BigIntegers.
But... in theory, BigIntegers are unlimited in size, and two's complement (or at least, they behave like they are, in reality they are usually sign-magnitude). So positive numbers are (virtually) preceded with an unlimited number of 0 bits and negative numbers with an unlimited number of 1 bits.
So rotating left by 1 would actually mean that you shift left by 1, and if the number was/is negative, the lowest bit is set to 1.
UPDATE
If the BigInteger is just used to represent a fixed size integer (BigIntegers themselves do not have a fixed size), you will have to move the top bits to the bottom. Then you can do something like:
public static BigInteger rotateLeft(BigInteger value, int shift, int bitSize)
{
// Note: shift must be positive, if necessary add checks.
BigInteger topBits = value.shiftRight(bitSize - shift);
BigInteger mask = BigInteger.ONE.shiftLeft(bitSize).subtract(BigInteger.ONE);
return value.shiftLeft(shift).or(topBits).and(mask);
}
And you call it like:
public static void main(String[] args)
{
BigInteger rotated = rotateLeft(new
BigInteger("1110000100100011010001010110011110001001101010111100110111101111" +
"1111111011011100101110101001100001110110010101000011001000010010",
2), 7, 128);
System.out.println(rotated.toString(2));
}
Note: I did test this and it seems to produce the desired result:
10010001101000101011001111000100110101011110011011110111111111110110111001011101010011000011101100101010000110010000100101110000
If the bitSize is fixed (e.g. always 128), you can pre-calculate the mask and do not have to pass the bitSize to the function, of course.
EDIT:
To obtain the mask, instead of shifting BigInteger.ONE left, you can just as well do:
BigInteger.ZERO.setBit(bitSize).subtract(BigInteger.ONE);
That is probably a little faster.
So I'm writing a program where I need to produce strings of binary numbers that are not only a specific length, but also have a specific number of 1's and 0's. In addition, theses strings that are produced are compared to a higher and lower value to see if they are in that specific range. The issue that I'm having is that I'm dealing with 64 bit unsigned integers. So sometimes, very large numbers that require al 64 bits produce a lot of permutations of binary strings for values which are not in the range at all and it's taking a ton of time.
I'm curious if it is possible for an algorithm to take in two bound values, a number of ones, and only produce binary strings in between the bound values with that specific number of ones.
This is what I have so far, but it's producing way to many numbers.
void generatePermutations(int no_ones, int length, uint64_t smaller, uint64_t larger, uint64_t& accum){
char charArray[length+1];
for(int i = length - 1; i > -1; i--){
if(no_ones > 0){
charArray[i] = '1';
no_ones--;
}else{
charArray[i] = '0';
}
}
charArray[length] = '\0';
do {
std::string val(charArray);
uint64_t num = convertToNum(val);
if(num >= smaller && num <= larger){
accum ++;
}
} while ( std::next_permutation(charArray, (charArray + length)));
}
(Note: The number of 1-bits in a binary value is generally called the population count -- popcount, for short -- or Hamming weight.)
There is a well-known bit-hack to cycle through all binary words with the same population count, which basically does the following:
Find the longest suffix of the word consisting of a 0, a non-empty sequence of 1s, and finally a possibly empty sequence of 0s.
Change the first 0 to a 1; the following 1 to a 0, and then shift all the others 1s (if any) to the end of the word.
Example:
00010010111100
^-------- beginning of the suffix
00010011 0 becomes 1
0 1 becomes 0
00111 remaining 1s right-shifted to the end
That can be done quite rapidly by using the fact that the lowest-order set bit in x is x & -x (where - represents the 2s-complement negative of x). To find the beginning of the suffix, it suffices to add the lowest-order set bit to the number, and then find the new lowest-order set bit. (Try this with a few numbers and you should see how it works.)
The biggest problem is performing the right shift, since we don't actually know the bit count. The traditional solution is to do the right-shift with a division (by the original low-order 1 bit), but it turns out that divide on modern hardware is really slow, relative to other operands. Looping a one-bit shift is generally faster than dividing, but in the code below I use gcc's __builtin_ffsll, which normally compiles into an appropriate opcode if one exists on the target hardware. (See man ffs for details; I use the builtin to avoid feature-test macros, but it's a bit ugly and limits the range of compilers you can use. OTOH, ffsll is also an extension.)
I've included the division-based solution as well for portability; however, it takes almost three times as long on my i5 laptop.
template<typename UInt>
static inline UInt last_one(UInt ui) { return ui & -ui; }
// next_with_same_popcount(ui) finds the next larger integer with the same
// number of 1-bits as ui. If there isn't one (within the range
// of the unsigned type), it returns 0.
template<typename UInt>
UInt next_with_same_popcount(UInt ui) {
UInt lo = last_one(ui);
UInt next = ui + lo;
UInt hi = last_one(next);
if (next) next += (hi >> __builtin_ffsll(lo)) - 1;
return next;
}
/*
template<typename UInt>
UInt next_with_same_popcount(UInt ui) {
UInt lo = last_one(ui);
UInt next = ui + lo;
UInt hi = last_one(next) >> 1;
if (next) next += hi/lo - 1;
return next;
}
*/
The only remaining problem is to find the first number with the correct popcount inside of the given range. To help with this, the following simple algorithm can be used:
Start with the first value in the range.
As long as the popcount of the value is too high, eliminate the last run of 1s by adding the low-order 1 bit to the number (using exactly the same x&-x trick as above). Since this works right-to-left, it cannot loop more than 64 times, once per bit.
While the popcount is too small, add the smallest possible bit by changing the low-order 0 bit to a 1. Since this adds a single 1-bit on each loop, it also cannot loop more than k times (where k is the target popcount), and it is not necessary to recompute the population count on each loop, unlike the first step.
In the following implementation, I again use a GCC builtin, __builtin_popcountll. This one doesn't have a corresponding Posix function. See the Wikipedia page for alternative implementations and a list of hardware which does support the operation. Note that it is possible that the value found will exceed the end of the range; also, the function might return a value less than the supplied argument, indicating that there is no appropriate value. So you need to check that the result is inside the desired range before using it.
// next_with_popcount_k returns the smallest integer >= ui whose popcnt
// is exactly k. If ui has exactly k bits set, it is returned. If there
// is no such value, returns the smallest integer with exactly k bits.
template<typename UInt>
UInt next_with_popcount_k(UInt ui, int k) {
int count;
while ((count = __builtin_popcountll(ui)) > k)
ui += last_one(ui);
for (int i = count; i < k; ++i)
ui += last_one(~ui);
return ui;
}
It's possible to make this slightly more efficient by changing the first loop to:
while ((count = __builtin_popcountll(ui)) > k) {
UInt lo = last_one(ui);
ui += last_one(ui - lo) - lo;
}
That shaved about 10% off of the execution time, but I doubt whether the function will be called often enough to make that worthwhile. Depending on how efficiently your CPU implements the POPCOUNT opcode, it might be faster to do the first loop with a single bit sweep in order to be able to track the popcount instead of recomputing it. That will almost certainly be the case on hardware without a POPCOUNT opcode.
Once you have those two functions, iterating over all k-bit values in a range becomes trivial:
void all_k_bits(uint64_t lo, uint64_t hi, int k) {
uint64_t i = next_with_popcount_k(lo, k);
if (i >= lo) {
for (; i > 0 && i < hi; i = next_with_same_popcount(i)) {
// Do what needs to be done
}
}
}