The largest n-bit integer - c++

I thought that computing the largest n-bit integer would be trivial by using bit-shifts. Specifically, my idea was to set all of the bits to 1, and then shift them to the right:
template <typename T = uint16_t>
auto largest(uint8_t n){
constexpr auto bits = 8*sizeof(T);
assert(n <= bits);
return static_cast<T>(-1) >> (bits - n);
}
Generally, this idea seems to work. If I print out the result for 0, 1, ..., I get 0, 1, 3,..., 65535 (as expected).
However, this is where things get strange...
If instead of a uint16_t, I use a uint32_t or uint64_t then I find that
largest<uint16_t>(1) = 1
largest<uint32_t>(1) = 1
largest<uint64_t>(1) = 1
which is saying, "The largest 1-bit integer is 1" (as expected). However...
largest<uint16_t>(0) = 0
largest<uint32_t>(0) = 4294967295
largest<uint64_t>(0) = 18446744073709551615
So the value of 0 seems to be an edge case if I use uint32_t or uint64_t to hold the integer type.
To diagnose further, I hard-coded those edge cases so that the compiler can better see it:
static_cast<uint16_t>(-1) >> 16;
static_cast<uint32_t>(-1) >> 32;
static_cast<uint64_t>(-1) >> 64;
and now for the 32 and 64-bit cases, both GCC and Clang head throw a warning
prog.cc:22:31: warning: right shift count >= width of type [-Wshift-count-overflow]
22 | static_cast<uint64_t>(-1) >> 64;
I couldn't find any documentation about why this isn't allowed, and why this only happens for the 32 and 64 bit case. I understand why it might complain about the count > width, but the count == width case seems valid to me.
Does anybody have some insight as to what is going on?
Also, I would like to hear suggestions for how to compute the largest n-bit integer without having to put in a branch (obviously I could handle the case of n==0 specially).
Here is a code link so that you don't have to retype everything: https://wandbox.org/permlink/3oqxqQR9ypP5q7yw

You can use a lookup table to avoid branching.

Related

Set all meaningful unset bits of a number

Given an integer n(1≤n≤1018). I need to make all the unset bits in this number as set (i.e. only the bits meaningful for the number, not the padding bits required to fit in an unsigned long long).
My approach: Let the most significant bit be at the position p, then n with all set bits will be 2p+1-1.
My all test cases matched except the one shown below.
Input
288230376151711743
My output
576460752303423487
Expected output
288230376151711743
Code
#include<bits/stdc++.h>
using namespace std;
typedef long long int ll;
int main() {
ll n;
cin >> n;
ll x = log2(n) + 1;
cout << (1ULL << x) - 1;
return 0;
}
The precision of typical double is only about 15 decimal digits.
The value of log2(288230376151711743) is 57.999999999999999994994646087789191106964114967902921472132432244... (calculated using Wolfram Alpha)
Threfore, this value is rounded to 58 and this result in putting a bit 1 to higher digit than expected.
As a general advice, you should avoid using floating-point values as much as possible when dealing with integer values.
You can solve this with shift and or.
uint64_t n = 36757654654;
int i = 1;
while (n & (n + 1) != 0) {
n |= n >> i;
i *= 2;
}
Any set bit will be duplicated to the next lower bit, then pairs of bits will be duplicated 2 bits lower, then quads, bytes, shorts, int until all meaningful bits are set and (n + 1) becomes the next power of 2.
Just hardcoding the maximum of 6 shifts and ors might be faster than the loop.
If you need to do integer arithmetics and count bits, you'd better count them properly, and avoid introducing floating point uncertainty:
unsigned x=0;
for (;n;x++)
n>>=1;
...
(demo)
The good news is that for n<=1E18, x will never reach the number of bits in an unsigned long long. So the rest of you code is not at risk of being UB and you could stick to your minus 1 approach, (although it might in theory not be portable for C++ before C++20) ;-)
Btw, here are more ways to efficiently find the most significant bit, and the simple log2() is not among them.

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?
The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.
If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.
I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}
To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)
unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}
Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

C++: Binary to Decimal Conversion

I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?
Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];
It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.
The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.
You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.
Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length

Constrain a 16 bit signed value between 0 and 4095 using Bit Manipulation only (without branching)

I want to constrain the value of a signed short variable between 0 and 4095, after which I take the most significant 8 bits as my final value for use elsewhere. Right now I'm doing it in a basic manner as below:
short color = /* some external source */;
/*
* I get the color value as a 16 bit signed integer from an
* external source I cannot trust. 16 bits are being used here
* for higher precision.
*/
if ( color < 0 ) {
color = 0;
}
else if ( color > 4095 ) {
color = 4095;
}
unsigned char color8bit = 0xFF & (color >> 4);
/*
* color8bit is my final value which I would actually use
* in my application.
*/
Is there any way this can be done using bit manipulation only, i.e. without using any conditionals? It might help quite a bit in speeding things up as this operation is happening thousands of time in the code.
The following won't help as it doesn't take care of edge cases such as negative values and overflows:
unsigned char color8bit = 0xFF & (( 0x0FFF & color ) >> 4 );
Edit: Adam Rosenfield's answer is the one which takes the correct approach but its incorrectly implemented. ouah's answer gives correct results but takes a different approach that what I originally intended to find out.
This is what I ended up using:
const static short min = 0;
const static short max = 4095;
color = min ^ (( min ^ color ) & -( min < color ));
color = max ^ (( color ^ max ) & -( color < max ));
unsigned char color8bit = 0xFF & (( 0x0FFF & color ) >> 4 );
Yes, see these bit-twiddling hacks:
short color = ...;
color = color ^ (color & -(color < 0)); // color = max(color, 0)
color = 4096 ^ ((color ^ 4096) & -(color < 4096)); // color = min(color, 4096)
unsigned char color8bit = 0xFF & (color >> 4);
Whether this actually turns out to be faster, I don't know -- you should profile. Most modern x86 and x86-64 chips these days support "conditional move" instructions (cmov) which conditionally store a value depending on the EFLAGS status bits, and optimizing compilers will often produce these instructions from ternary expressions like color >= 0 ? color : 0. Those will likely be fastest, but they won't run on older x86 chips.
You can do the following:
BYTE data[0x10000] = { ..... };
BYTE byte_color = data[(unsiged short)short_color];
In your days 64kb table is not something outrageous and may be acceptable. The number of assembler commands in this variant of code will be absolute minimum compared to other possible approaches.
short color = /* ... */
color = ((((!!(color >> 12)) * 0xFFF)) | (!(color >> 12) * color ))
& (!(color >> 15) * 0xFFF);
unsigned char color8bit = 0xFF & (color >> 4);
It assumes two's complement representation.
This has the advantage of not using any equality or relational operators. There are situations you want to avoid branches at all costs: in some security applications you don't want the attackers to perform branch predictions. Without branches (in embedded processors particularly) you can make your function run in constant time for all inputs.
Note that: x * 0xFFF can be further reduced to (x << 12) - x. Also the multiplication in (!(color >> 12) * color ) can also be further optimized as the left operand of * here is 0 or 1.
EDIT:
I add a little explanation: the expression above simply does the same as below without the use of the conditional and relational operators:
y = ((y > 4095 ? 4095 : 0) | (y > 4095 ? 0 : y))
& (y < 0 ? 0 : 4095);
EDIT2:
as #HotLicks correctly noted in his comment, the ! is still a conceptual branch. Nevertheless it can also be computed with bitwise operators. For example !!a can be done with the trivial:
b = (a >> 15 | a >> 14 | ... | a >> 1 | a) & 1
and !a can be done as b ^ 1. And I'm sure there is a nice hack to do it more effectively.
I assume a short is 16 bits.
Remove negative values:
int16_t mask=-(int16_t)((uint16_t)color>>15);//0xFFFF if +ve, 0 if -ve
short value=color&mask;//0 if -ve, colour if +ve
value is now between 0 and 32767 inclusive.
You can then do something similar to clamp the value:
mask=(uint16_t)(value-4096)>>15;//1 if <=4095, 0 if >4095
--mask;//0 if <=4095, 0xFFFF if >4095
mask&=0xFFF;//0 if <=4095, 4095 if >4095
value|=mask;//4095 if >4095, color if <4095
You could also easily vectorize this using Intel's SSE intrinsics. One 128-bit register would hold 8 of your short and there are functions to min/max/shift/mask all of them in parallel. In a loop the constants for min/max can be preloaded into a register. The pshufb instruction (part of SSSE3) will even pack the bytes for you.
I'm going to leave an answer even though it doesn't directly answer the original question, because in the end I think you'll find it much more useful.
I'm assuming that your color is coming from a camera or image scanner running at 12 bits, followed by some undetermined processing step that might create values beyond the 0 to 4095 range. If that's the case the values are almost certainly derived in a linear fashion. The problem is that displays are gamma corrected, so the conversion from 12 bit to 8 bit will require a non-linear gamma function rather than a simple right shift. This will be much slower than the clamping operation your question is trying to optimize. If you don't use a gamma function the image will appear too dark.
short color = /* some external source */;
unsigned char color8bit;
if (color <= 0)
color8bit = 0;
else if (color >= 4095)
color8bit = 255;
else
color8bit = (unsigned char)(255.99 * pow(color / 4095.0, 1/2.2));
At this point you might consider a lookup table as suggested by Kirill Kobelev.
This is somewhat akin to Tom Seddon's answer, but uses a slightly cleaner way to do the clamp above. Note that both Mr. Seddon's answer and mine avoid the issue of ouah's answer that shifting a signed value to the right is implementation defined behavior, and hence not guaranteed to work on all architenctures.
#include <inttypes.h>
#include <iostream>
int16_t clamp(int16_t value)
{
// clampBelow is 0xffff for -ve, 0x0000 for +ve
int16_t const clampBelow = -static_cast<int16_t>(static_cast<uint16_t>(value) >> 15);
// value is now clamped below at zero
value &= ~clampBelow;
// subtract 4095 so we can do the same trick again
value -= 4095;
// clampAbove is 0xffff for -ve, 0x0000 for +ve,
// i.e. 0xffff for original value < 4095, 0x0000 for original >= 4096
int16_t const clampAbove = -static_cast<int16_t>(static_cast<uint16_t>(value) >> 15);
// adjusted value now clamped above at zero
value &= clampAbove;
// and restore to original value.
value += 4095;
return value;
}
void verify(int16_t value)
{
int16_t const clamped = clamp(value);
int16_t const check = (value < 0 ? 0 : value > 4095 ? 4095 : value);
if (clamped != check)
{
std::cout << "Verification falure for value: " << value << ", clamped: " << clamped << ", check: " << check << std::endl;
}
}
int main()
{
for (int16_t i = 0x4000; i != 0x3fff; i++)
{
verify(i);
}
return 0;
}
That's a full test program (OK, so it doesn't test 0x3fff - sue me. ;) ) from which you can extract the clamp() routine for whatever you need.
I've also broken clamp out to "one step per line" for the sake of clarity. If your compiler has a half way decent optimizer, you can leave it as is and rely on the compiler to produce the best possible code. If your compiler's optimizer is not that great, then by all means, it can be reduced in line count, albeit at the cost of a little readability.
"Never sacrifice clarity for efficiency" -- Bob Buckley, comp sci professor, U-Warwick, Coventry, England, 1980.
Best piece of advice I ever got. ;)

Checking whether a number is positive or negative using bitwise operators

I can check whether a number is odd/even using bitwise operators. Can I check whether a number is positive/zero/negative without using any conditional statements/operators like if/ternary etc.
Can the same be done using bitwise operators and some trick in C or in C++?
Can I check whether a number is positive/zero/negative without using any conditional statements/operators like if/ternary etc.
Of course:
bool is_positive = number > 0;
bool is_negative = number < 0;
bool is_zero = number == 0;
If the high bit is set on a signed integer (byte, long, etc., but not a floating point number), that number is negative.
int x = -2300; // assuming a 32-bit int
if ((x & 0x80000000) != 0)
{
// number is negative
}
ADDED:
You said that you don't want to use any conditionals. I suppose you could do this:
int isNegative = (x & 0x80000000);
And at some later time you can test it with if (isNegative).
Or, you could use signbit() and the work's done for you.
I'm assuming that under the hood, the math.h implementation is an efficient bitwise check (possibly solving your original goal).
Reference: http://en.cppreference.com/w/cpp/numeric/math/signbit
There is a detailed discussion on the Bit Twiddling Hacks page.
int v; // we want to find the sign of v
int sign; // the result goes here
// CHAR_BIT is the number of bits per byte (normally 8).
sign = -(v < 0); // if v < 0 then -1, else 0.
// or, to avoid branching on CPUs with flag registers (IA32):
sign = -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1));
// or, for one less instruction (but not portable):
sign = v >> (sizeof(int) * CHAR_BIT - 1);
// The last expression above evaluates to sign = v >> 31 for 32-bit integers.
// This is one operation faster than the obvious way, sign = -(v < 0). This
// trick works because when signed integers are shifted right, the value of the
// far left bit is copied to the other bits. The far left bit is 1 when the value
// is negative and 0 otherwise; all 1 bits gives -1. Unfortunately, this behavior
// is architecture-specific.
// Alternatively, if you prefer the result be either -1 or +1, then use:
sign = +1 | (v >> (sizeof(int) * CHAR_BIT - 1)); // if v < 0 then -1, else +1
// On the other hand, if you prefer the result be either -1, 0, or +1, then use:
sign = (v != 0) | -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1));
// Or, for more speed but less portability:
sign = (v != 0) | (v >> (sizeof(int) * CHAR_BIT - 1)); // -1, 0, or +1
// Or, for portability, brevity, and (perhaps) speed:
sign = (v > 0) - (v < 0); // -1, 0, or +1
// If instead you want to know if something is non-negative, resulting in +1
// or else 0, then use:
sign = 1 ^ ((unsigned int)v >> (sizeof(int) * CHAR_BIT - 1)); // if v < 0 then 0, else 1
// Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C
// specification leaves the result of signed right-shift implementation-defined,
// so on some systems this hack might not work. For greater portability, Toby
// Speight suggested on September 28, 2005 that CHAR_BIT be used here and
// throughout rather than assuming bytes were 8 bits long. Angus recommended
// the more portable versions above, involving casting on March 4, 2006.
// Rohit Garg suggested the version for non-negative integers on September 12, 2009.
#include<stdio.h>
void main()
{
int n; // assuming int to be 32 bit long
//shift it right 31 times so that MSB comes to LSB's position
//and then and it with 0x1
if ((n>>31) & 0x1 == 1) {
printf("negative number\n");
} else {
printf("positive number\n");
}
getch();
}
Signed integers and floating points normally use the most significant bit for storing the sign so if you know the size you could extract the info from the most significant bit.
There is generally little benefit in doing this this since some sort of comparison will need to be made to use this information and it is just as easy for a processor to tests whether something is negative as it is to test whether it is not zero. If fact on ARM processors, checking the most significant bit will be normally MORE expensive than checking whether it is negative up front.
It is quite simple
It can be easily done by
return ((!!x) | (x >> 31));
it returns
1 for a positive number,
-1 for a negative, and
0 for zero
This can not be done in a portable way with bit operations in C. The representations for signed integer types that the standard allows can be much weirder than you might suspect. In particular the value with sign bit on and otherwise zero need not be a permissible value for the signed type nor the unsigned type, but a so-called trap representation for both types.
All computations with bit operators that you can thus do might have a result that leads to undefined behavior.
In any case as some of the other answers suggest, this is not really necessary and comparison with < or > should suffice in any practical context, is more efficient, easier to read... so just do it that way.
// if (x < 0) return -1
// else if (x == 0) return 0
// else return 1
int sign(int x) {
// x_is_not_zero = 0 if x is 0 else x_is_not_zero = 1
int x_is_not_zero = (( x | (~x + 1)) >> 31) & 0x1;
return (x & 0x01 << 31) >> 31 | x_is_not_zero; // for minux x, don't care the last operand
}
Here's exactly what you waht!
Here is an update related to C++11 for this old question. It is also worth considering std::signbit.
On Compiler Explorer using gcc 7.3 64bit with -O3 optimization, this code
bool s1(double d)
{
return d < 0.0;
}
generates
s1(double):
pxor xmm1, xmm1
ucomisd xmm1, xmm0
seta al
ret
And this code
bool s2(double d)
{
return std::signbit(d);
}
generates
s2(double):
movmskpd eax, xmm0
and eax, 1
ret
You would need to profile to ensure that there is any speed difference, but the signbit version does use 1 less opcode.
When you're sure about the size of an integer (assuming 16-bit int):
bool is_negative = (unsigned) signed_int_value >> 15;
When you are unsure of the size of integers:
bool is_negative = (unsigned) signed_int_value >> (sizeof(int)*8)-1; //where 8 is bits
The unsigned keyword is optional.
if( (num>>sizeof(int)*8 - 1) == 0 )
// number is positive
else
// number is negative
If value is 0 then number is positive else negative
A simpler way to find out if a number is positive or negative:
Let the number be x
check if [x * (-1)] > x. if true x is negative else positive.
You can differentiate between negative/non-negative by looking at the most significant bit.
In all representations for signed integers, that bit will be set to 1 if the number is negative.
There is no test to differentiate between zero and positive, except for a direct test against 0.
To test for negative, you could use
#define IS_NEGATIVE(x) ((x) & (1U << ((sizeof(x)*CHAR_BIT)-1)))
Suppose your number is a=10 (positive). If you shift a a times it will give zero.
i.e:
10>>10 == 0
So you can check if the number is positive, but in case a=-10 (negative):
-10>>-10 == -1
So you can combine those in an if:
if(!(a>>a))
print number is positive
else
print no. is negative
#include<stdio.h>
int checksign(int n)
{
return (n >= 0 && (n & (1<<32-1)) >=0);
}
void main()
{
int num = 11;
if(checksign(num))
{
printf("Unsigned number");
}
else
{
printf("signed Number");
}
}
Without if:
string pole[2] = {"+", "-"};
long long x;
while (true){
cin >> x;
cout << pole[x/-((x*(-1))-1)] << "\n\n";
}
(not working for 0)
if(n & (1<<31))
{
printf("Negative number");
}
else{
printf("positive number");
}
It check the first bit which is most significant bit of the n number and then & operation is work on it if the value is 1 which is true then the number is negative and it not then it is positive number