Given an integer n(1≤n≤1018). I need to make all the unset bits in this number as set (i.e. only the bits meaningful for the number, not the padding bits required to fit in an unsigned long long).
My approach: Let the most significant bit be at the position p, then n with all set bits will be 2p+1-1.
My all test cases matched except the one shown below.
Input
288230376151711743
My output
576460752303423487
Expected output
288230376151711743
Code
#include<bits/stdc++.h>
using namespace std;
typedef long long int ll;
int main() {
ll n;
cin >> n;
ll x = log2(n) + 1;
cout << (1ULL << x) - 1;
return 0;
}
The precision of typical double is only about 15 decimal digits.
The value of log2(288230376151711743) is 57.999999999999999994994646087789191106964114967902921472132432244... (calculated using Wolfram Alpha)
Threfore, this value is rounded to 58 and this result in putting a bit 1 to higher digit than expected.
As a general advice, you should avoid using floating-point values as much as possible when dealing with integer values.
You can solve this with shift and or.
uint64_t n = 36757654654;
int i = 1;
while (n & (n + 1) != 0) {
n |= n >> i;
i *= 2;
}
Any set bit will be duplicated to the next lower bit, then pairs of bits will be duplicated 2 bits lower, then quads, bytes, shorts, int until all meaningful bits are set and (n + 1) becomes the next power of 2.
Just hardcoding the maximum of 6 shifts and ors might be faster than the loop.
If you need to do integer arithmetics and count bits, you'd better count them properly, and avoid introducing floating point uncertainty:
unsigned x=0;
for (;n;x++)
n>>=1;
...
(demo)
The good news is that for n<=1E18, x will never reach the number of bits in an unsigned long long. So the rest of you code is not at risk of being UB and you could stick to your minus 1 approach, (although it might in theory not be portable for C++ before C++20) ;-)
Btw, here are more ways to efficiently find the most significant bit, and the simple log2() is not among them.
I found this loop in the source code of an algorithm. I think that details about the problems aren't relevant here, because this is a really small part of the solution.
void update(int i, int value, int array[], int n) {
for(; i < n; i += ~i & (i + 1)) {
array[i] += value;
}
}
I don't really understand what happens in that for loop, is it some sort of trick? I found something similar named Fenwick trees, but they look a bit different than what I have here.
Any ideas what this loop means?
Also, found this :
"Bit Hack #9. Isolate the rightmost 0-bit.
y = ~x & (x+1)
"
You are correct: the bit-hack ~i & (i + 1) should evaluate to an integer which is all binary 0's, except the one corresponding to the rightmost zero-bit of i, which is set to binary 1.
So at the end of each pass of the for loop, it adds this value to itself. Since the corresponding bit in i is zero, this has the effect of setting it, without affecting any other bits in i. This will strictly increase the value of i at each pass, until i overflows (or becomes -1, if you started with i<0). In context, you can probably expect that it is called with i>=0, and that i < n is set terminate the loop before your index walks off the array.
The overall function should have the effect of iterating through the zero-bits of the original value of i from least- to most-significant, setting them one by one, and incrementing the corresponding elements of the array.
Fenwick trees are a clever way to accumulate and query statistics efficiently; as you say, their update loop looks a bit like this, and typically uses a comparable bit-hack. There are bound to be multiple ways to accomplish this kind of bit-fiddling, so it is certainly possible that your source code is updating a Fenwick tree, or something comparable.
Assume that from the right to the left, you have some number of 1 bits, a 0 bit, and then more bits in x.
If you add x + 1, then all the 1's at the right are changed to 0, the 0 is changed to 1, the rest is unchanged. For example xxxx011 + 1 = xxxx100.
In ~x, you have the same number of 0 bits, a 1 bit, and the inverses of the other bits. The bitwise and produces the 0 bits, one 1 bit, and since the remaining bits are and'ed with their negation, those bits are 0.
So the result of ~x & (x + 1) is a number with one 1 bit where x had its rightmost zero bit.
If you add this to x, you change the rightmost 0 to a 1. So if you do this repeatedly, you change the 0 bits in x to 1, from the right to the left.
The update function iterates and sets the 0-bits of i from the leftmost zero to the rightmost zero and add value to the ith element of array.
The for loop checks if i is less than n, if so, ~i & (i + 1) would be an integer has all binary 0's, except for the rightmost bit ( i.e. 1). Then array[i] += value adds value to iterated itself.
Setting i to 8 and going through iterations may clear things to you.
I was given this algorithm to write a hash function:
BEGIN Hash (string)
UNSIGNED INTEGER key = 0;
FOR_EACH character IN string
key = ((key << 5) + key) ^ character;
END FOR_EACH
RETURN key;
END Hash
The <<operator refers to shift bits to the left. The ^ refers to the XOR operation and the character refers to the ASCII value of the character. Seems pretty straightforward.
Below is my code
unsigned int key = 0;
for (int i = 0; i < data.length(); i++) {
key = ((key<<5) + key) ^ (int)data[i];
}
return key;
However, I keep getting ridiculous positive and negative huge numbers when i should actually get a hash value from 0 - n. n is a value set by the user beforehand. I'm not sure where things went wrong but I'm thinking it could be the XOR operation.
Any suggestions or opinions will be greatly appreciated. Thanks!
The output of this code is a 32-bit (or 64-bit or however wide your unsigned int is) unsigned integer. To restrict it to the range from 0 to n−1, simply reduce it modulo n, using the % operator:
unsigned int hash = key % n;
(It should be obvious that your code, as written, cannot return "a hash value from 0 - n", since n does not appear anywhere in your code.)
In fact, there's a good reason not to reduce the hash value modulo n too soon: if you ever need to grow your hash, storing the unreduced hash codes of your strings saves you the effort of recalculating them whenever n changes.
Finally, a few general notes on your hash function:
As Joachim Pileborg comments above, the explicit (int) cast is unnecessary. If you want to keep it for clarity, it really should say (unsigned int) to match the type of key, since that's what the value actually gets converted into.
For unsigned integer types, ((key<<5) + key) is equal to 33 * key (since shifting left by 5 bits is the same as multiplying by 25 = 32). On modern CPUs, using multiplication is almost certainly faster; on old or very low-end processors with slow multiplication, it's likely that any decent compiler will optimize multiplication by a constant into a combination of shifts and adds anyway. Thus, either way, expressing the operation as a multiplication is IMO preferable.
You don't want to call data.length() on every iteration of the loop. Call it once before the loop and store the result in a variable.
Initializing key to zero means that your hash value is not affected by any leading zero bytes in the string. The original version of your hash function, due to Dan Bernstein, uses a (more or less random) initial value of 5381 instead.
My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang, gcc, or intel icpc generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.
It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.
Next thing before optimization is to fix the original implementation
void increment (ctr_type &ctr)
{
if (++ctr[0] != 0) return;
if (++ctr[1] != 0) return;
if (++ctr[2] != 0) return;
++ctr[3];
}
If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.
As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.
EDIT
If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:
(*counter) += 1;
while (*counter == 0)
{
counter++; // Move to next word
if (counter > tail_of_array) {
counter = head_of_array;
memset(counter,0, 16);
break;
}
}
The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.
If you're using GCC or compilers with __int128 like Clang or ICC
unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;
On systems where __int128 isn't available
std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
c[1]++;
if (c[1] == 0)
{
c[2]++;
if (c[2] == 0)
{
c[3]++;
}
}
}
In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll
Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life
Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.
You mention "Generate successive values, such that each new one was never generated before"
To generate a set of such values, look at linear congruential generators
the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.
You also complain that your simple code is "slow"
At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*232=8796093022208 increments per second, that is 264 increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.
You also require "until all possible values are generated".
You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:
getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe
Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:
ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow
A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.
The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef uint16_t T;
/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
T cy, t0, t1;
counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
for (int i = 1; i < (n - 1); i++) {
counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
}
counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}
#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)
int main (void)
{
uint32_t count32 = 0, incr32 = INCREMENT;
T count_arr2 [UINT32_ARRAY_LEN] = {0};
T incr_arr2 [UINT32_ARRAY_LEN] = {INCREMENT};
do {
count32 = count32 + incr32;
inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
} while (count32 < (0U - INCREMENT - 1));
printf ("count32 = %08x arr_count = %08x\n",
count32, (((uint32_t)count_arr2 [1] << 16) +
((uint32_t)count_arr2 [0] << 0)));
uint64_t count64 = 0, incr64 = INCREMENT;
T count_arr4 [UINT64_ARRAY_LEN] = {0};
T incr_arr4 [UINT64_ARRAY_LEN] = {INCREMENT};
do {
count64 = count64 + incr64;
inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
} while (count64 < 0xa987654321ULL);
printf ("count64 = %016llx arr_count = %016llx\n",
count64, (((uint64_t)count_arr4 [3] << 48) +
((uint64_t)count_arr4 [2] << 32) +
((uint64_t)count_arr4 [1] << 16) +
((uint64_t)count_arr4 [0] << 0)));
return EXIT_SUCCESS;
}
Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:
count32 = fffffffa arr_count = fffffffa
count64 = 000000a987654326 arr_count = 000000a987654326
Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.
I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?
Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];
It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.
The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.
You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.
Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length