Is ((a + (b & 255)) & 255) the same as ((a + b) & 255)? - c++

I was browsing some C++ code, and found something like this:
(a + (b & 255)) & 255
The double AND annoyed me, so I thought of:
(a + b) & 255
(a and b are 32-bit unsigned integers)
I quickly wrote a test script (JS) to confirm my theory:
for (var i = 0; i < 100; i++) {
var a = Math.ceil(Math.random() * 0xFFFF),
b = Math.ceil(Math.random() * 0xFFFF);
var expr1 = (a + (b & 255)) & 255,
expr2 = (a + b) & 255;
if (expr1 != expr2) {
console.log("Numbers " + a + " and " + b + " mismatch!");
break;
}
}
While the script confirmed my hypothesis (both operations are equal), I still don't trust it, because 1) random and 2) I'm not a mathematician, I have no idea what am I doing.
Also, sorry for the Lisp-y title. Feel free to edit it.

They are the same. Here's a proof:
First note the identity (A + B) mod C = (A mod C + B mod C) mod C
Let's restate the problem by regarding a & 255 as standing in for a % 256. This is true since a is unsigned.
So (a + (b & 255)) & 255 is (a + (b % 256)) % 256
This is the same as (a % 256 + b % 256 % 256) % 256 (I've applied the identity stated above: note that mod and % are equivalent for unsigned types.)
This simplifies to (a % 256 + b % 256) % 256 which becomes (a + b) % 256 (reapplying the identity). You can then put the bitwise operator back to give
(a + b) & 255
completing the proof.

Lemma: a & 255 == a % 256 for unsigned a.
Unsigned a can be rewritten as m * 0x100 + b some unsigned m,b, 0 <= b < 0xff, 0 <= m <= 0xffffff. It follows from both definitions that a & 255 == b == a % 256.
Additionally, we need:
the distributive property: (a + b) mod n = [(a mod n) + (b mod n)] mod n
the definition of unsigned addition, mathematically: (a + b) ==> (a + b) % (2 ^ 32)
Thus:
(a + (b & 255)) & 255 = ((a + (b & 255)) % (2^32)) & 255 // def'n of addition
= ((a + (b % 256)) % (2^32)) % 256 // lemma
= (a + (b % 256)) % 256 // because 256 divides (2^32)
= ((a % 256) + (b % 256 % 256)) % 256 // Distributive
= ((a % 256) + (b % 256)) % 256 // a mod n mod n = a mod n
= (a + b) % 256 // Distributive again
= (a + b) & 255 // lemma
So yes, it is true. For 32-bit unsigned integers.
What about other integer types?
For 64-bit unsigned integers, all of the above applies just as well, just substituting 2^64 for 2^32.
For 8- and 16-bit unsigned integers, addition involves promotion to int. This int will definitely neither overflow or be negative in any of these operations, so they all remain valid.
For signed integers, if either a+b or a+(b&255) overflow, it's undefined behavior. So the equality can't hold — there are cases where (a+b)&255 is undefined behavior but (a+(b&255))&255 isn't.

In positional addition, subtraction and multiplication of unsigned numbers to produce unsigned results, more significant digits of the input don't affect less-significant digits of the result. This applies to binary arithmetic as much as it does to decimal arithmetic. It also applies to "twos complement" signed arithmetic, but not to sign-magnitude signed arithmetic.
However we have to be careful when taking rules from binary arithmetic and applying them to C (I beleive C++ has the same rules as C on this stuff but i'm not 100% sure) because C arithmetic has some arcane rules that can trip us up. Unsigned arithmetic in C follows simple binary wraparound rules but signed arithmetic overflow is undefined behaviour. Worse under some circumstances C will automatically "promote" an unsigned type to (signed) int.
Undefined behaviour in C can be especially insiduous. A dumb compiler (or a compiler on a low optimisation level) is likely to do what you expect based on your understanding of binary arithmetic while an optimising compiler may break your code in strange ways.
So getting back to the formula in the question the equivilence depends on the operand types.
If they are unsigned integers whose size is greater than or equal to the size of int then the overflow behaviour of the addition operator is well-defined as simple binary wraparound. Whether or not we mask off the high 24 bits of one operand before the addition operation has no impact on the low bits of the result.
If they are unsigned integers whose size is less than int then they will be promoted to (signed) int. Overflow of signed integers is undefined behaviour but at least on every platform I have encountered the difference in size between different integer types is large enough that a single addition of two promoted values will not cause overflow. So again we can fall back to the simply binary arithmetic argument to deem the statements equivalent.
If they are signed integers whose size is less than int then again overflow can't happen and on twos-complement implementations we can rely on the standard binary arithmetic argument to say they are equivilent. On sign-magnitude or ones complement implementations they would not be equivilent.
OTOH if a and b were signed integers whose size was greater than or equal to the size of int then even on twos complement implementations there are cases where one statement would be well-defined while the other would be undefined behaviour.

Yes, (a + b) & 255 is fine.
Remember addition in school? You add numbers digit by digit, and add a carry value to the next column of digits. There is no way for a later (more significant) column of digits to influence an already processed column. Because of this, it does not make a difference if you zero-out the digits only in the result, or also first in an argument.
The above is not always true, the C++ standard allows an implementation that would break this.
Such a Deathstation 9000 :-) would have to use a 33-bit int, if the OP meant unsigned short with "32-bit unsigned integers". If unsigned int was meant, the DS9K would have to use a 32-bit int, and a 32-bit unsigned int with a padding bit. (The unsigned integers are required to have the same size as their signed counterparts as per §3.9.1/3, and padding bits are allowed in §3.9.1/1.) Other combinations of sizes and padding bits would work too.
As far as I can tell, this is the only way to break it, because:
The integer representation must use a "purely binary" encoding scheme (§3.9.1/7 and the footnote), all bits except padding bits and the sign bit must contribute a value of 2n
int promotion is allowed only if int can represent all the values of the source type (§4.5/1), so int must have at least 32 bits contributing to the value, plus a sign bit.
the int can not have more value bits (not counting the sign bit) than 32, because else an addition can not overflow.

You already have the smart answer: unsigned arithmetic is modulo arithmetic and therefore the results will hold, you can prove it mathematically...
One cool thing about computers, though, is that computers are fast. Indeed, they are so fast that enumerating all valid combinations of 32 bits is possible in a reasonable amount of time (don't try with 64 bits).
So, in your case, I personally like to just throw it at a computer; it takes me less time to convince myself that the program is correct than it takes to convince myself than the mathematical proof is correct and that I didn't oversee a detail in the specification1:
#include <iostream>
#include <limits>
int main() {
std::uint64_t const MAX = std::uint64_t(1) << 32;
for (std::uint64_t i = 0; i < MAX; ++i) {
for (std::uint64_t j = 0; j < MAX; ++j) {
std::uint32_t const a = static_cast<std::uint32_t>(i);
std::uint32_t const b = static_cast<std::uint32_t>(j);
auto const champion = (a + (b & 255)) & 255;
auto const challenger = (a + b) & 255;
if (champion == challenger) { continue; }
std::cout << "a: " << a << ", b: " << b << ", champion: " << champion << ", challenger: " << challenger << "\n";
return 1;
}
}
std::cout << "Equality holds\n";
return 0;
}
This enumerates through all possible values of a and b in the 32-bits space and checks whether the equality holds, or not. If it does not, it prints the case which didn't work, which you can use as a sanity check.
And, according to Clang: Equality holds.
Furthermore, given that the arithmetic rules are bit-width agnostic (above int bit-width), this equality will hold for any unsigned integer type of 32 bits or more, including 64 bits and 128 bits.
Note: How can a compiler enumerates all 64-bits patterns in a reasonable time frame? It cannot. The loops were optimized out. Otherwise we would all have died before execution terminated.
I initially only proved it for 16-bits unsigned integers; unfortunately C++ is an insane language where small integers (smaller bitwidths than int) are first converted to int.
#include <iostream>
int main() {
unsigned const MAX = 65536;
for (unsigned i = 0; i < MAX; ++i) {
for (unsigned j = 0; j < MAX; ++j) {
std::uint16_t const a = static_cast<std::uint16_t>(i);
std::uint16_t const b = static_cast<std::uint16_t>(j);
auto const champion = (a + (b & 255)) & 255;
auto const challenger = (a + b) & 255;
if (champion == challenger) { continue; }
std::cout << "a: " << a << ", b: " << b << ", champion: "
<< champion << ", challenger: " << challenger << "\n";
return 1;
}
}
std::cout << "Equality holds\n";
return 0;
}
And once again, according to Clang: Equality holds.
Well, there you go :)
1 Of course, if a program ever inadvertently triggers Undefined Behavior, it would not prove much.

The quick answer is: both expressions are equivalent
since a and b are 32-bit unsigned integers, the result is the same even in case of overflow. unsigned arithmetic guarantees this: a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
The long answer is: there are no known platforms where these expressions would differ, but the Standard does not guarantee it, because of the rules of integral promotion.
If the type of a and b (unsigned 32 bit integers) has a higher rank than int, the computation is performed as unsigned, modulo 232, and it yields the same defined result for both expressions for all values of a and b.
Conversely, If the type of a and b is smaller than int, both are promoted to int and the computation is performed using signed arithmetic, where overflow invokes undefined behavior.
If int has at least 33 value bits, neither of the above expressions can overflow, so the result is perfectly defined and has the same value for both expressions.
If int has exactly 32 value bits, the computation can overflow for both expressions, for example values a=0xFFFFFFFF and b=1 would cause an overflow in both expressions. In order to avoid this, you would need to write ((a & 255) + (b & 255)) & 255.
The good news is there are no such platforms1.
1 More precisely, no such real platform exists, but one could configure a DS9K to exhibit such behavior and still conform to the C Standard.

Identical assuming no overflow. Neither version is truly immune to overflow but the double and version is more resistant to it. I am not aware of a system where an overflow in this case is a problem but I can see the author doing this in case there is one.

Yes you can prove it with arithmetic, but there is a more intuitive answer.
When adding, every bit only influences those more significant than itself; never those less significant.
Therefore, whatever you do to the higher bits before the addition won't change the result, as long as you only keep bits less significant than the lowest bit modified.

The proof is trivial and left as an exercise for the reader
But to actually legitimize this as an answer, your first line of code says take the last 8 bits of b** (all higher bits of b set to zero) and add this to a and then take only the last 8 bits of the result setting all higher bits to zero.
The second line says add a and b and take the last 8 bits with all higher bits zero.
Only the last 8 bits are significant in the result. Therefore only the last 8 bits are significant in the input(s).
** last 8 bits = 8 LSB
Also it is interesting to note that the output would be equivalent to
char a = something;
char b = something;
return (unsigned int)(a + b);
As above, only the 8 LSB are significant, but the result is an unsigned int with all other bits zero. The a + b will overflow, producing the expected result.

Related

Modulo Multiplication Function: Multiplying two integers under a modulus

I came across this modulo multiplication function in a code for the miller-rabin primality test. This is supposed to eliminate the integer overflow that occurs when calculating ( a * b ) % m.
I need some help in understanding what is going on here. Why does this work? and what is the significance of that number literal 0x8000000000000000ULL?
unsigned long long mul_mod(unsigned long long a, unsigned long long b, unsigned long long m) {
unsigned long long d = 0, mp2 = m >> 1;
if (a >= m) a %= m;
if (b >= m) b %= m;
for (int i = 0; i < 64; i++)
{
d = (d > mp2) ? (d << 1) - m : d << 1;
if (a & 0x8000000000000000ULL)
d += b;
if (d >= m) d -= m;
a <<= 1;
}
return d;
}
This code, which currently appears on the modular arithmetic Wikipedia page, only works for arguments of up to 63 bits -- see bottom.
Overview
One way to compute an ordinary multiplication a * b is to add left-shifted copies of b -- one for each 1-bit in a. This is similar to how most of us did long multiplication in school, but simplified: Since we only ever need to "multiply" each copy of b by 1 or 0, all we need to do is either add the shifted copy of b (when the corresponding bit of a is 1) or do nothing (when it's 0).
This code does something similar. However, to avoid overflow (mostly; see below), instead of shifting each copy of b and then adding it to the total, it adds an unshifted copy of b to the total, and relies on later left-shifts performed on the total to shift it into the correct place. You can think of these shifts "acting on" all the summands added to the total so far. For example, the first loop iteration checks whether the highest bit of a, namely bit 63, is 1 (that's what a & 0x8000000000000000ULL does), and if so adds an unshifted copy of b to the total; by the time the loop completes, the previous line of code will have shifted the total d left 1 bit 63 more times.
The main advantage of doing it this way is that we are always adding two numbers (namely b and d) that we already know are less than m, so handling the modulo wraparound is cheap: We know that b + d < 2 * m, so to ensure that our total so far remains less than m, it suffices to check whether b + d < m, and if not, subtract m. If we were to use the shift-then-add approach instead, we would need a % modulo operation per bit, which is as expensive as division -- and usually much more expensive than subtraction.
One of the properties of modulo arithmetic is that, whenever we want to perform a sequence of arithmetic operations modulo some number m, performing them all in usual arithmetic and taking the remainder modulo m at the end always yields the same result as taking remainders modulo m for each intermediate result (provided no overflows occur).
Code
Before the first line of the loop body, we have the invariants d < m and b < m.
The line
d = (d > mp2) ? (d << 1) - m : d << 1;
is a careful way of shifting the total d left by 1 bit, while keeping it in the range 0 .. m and avoiding overflow. Instead of first shifting it and then testing whether the result is m or greater, we test whether it is currently strictly above RoundDown(m/2) -- because if so, after doubling, it will surely be strictly above 2 * RoundDown(m/2) >= m - 1, and so require a subtraction of m to get back in range. Note that even though the (d << 1) in (d << 1) - m may overflow and lose the top bit of d, this does no harm as it does not affect the lowest 64 bits of the subtraction result, which are the only ones we are interested in. (Also note that if d == m/2 exactly, we wind up with d == m afterward, which is slightly out of range -- but changing the test from d > mp2 to d >= mp2 to fix this would break the case where m is odd and d == RoundDown(m/2), so we have to live with this. It doesn't matter, because it will be fixed up below.)
Why not simply write d <<= 1; if (d >= m) d -= m; instead? Suppose that, in infinite-precision arithmetic, d << 1 >= m, so we should perform the subtraction -- but the high bit of d is on and the rest of d << 1 is less than m: In this case, the initial shift will lose the high bit and the if will fail to execute.
Restriction to inputs of 63 bits or fewer
The above edge case can only occur when d's high bit is on, which can only occur when m's high bit is also on (since we maintain the invariant d < m). So it looks like the code is taking pains to work correctly even with very high values of m. Unfortunately, it turns out that it can still overflow elsewhere, resulting in incorrect answers for some inputs that set the top bit. For example, when a = 3, b = 0x7FFFFFFFFFFFFFFFULL and m = 0xFFFFFFFFFFFFFFFFULL, the correct answer should be 0x7FFFFFFFFFFFFFFEULL, but the code will return 0x7FFFFFFFFFFFFFFDULL (an easy way to see the correct answer is to rerun with the values of a and b swapped). Specifically, this behaviour occurs whenever the line d += b overflows and leaves the truncated d less than m, causing a subtraction to be erroneously skipped.
Provided this behaviour is documented (as it is on the Wikipedia page), this is just a limitation, not a bug.
Removing the restriction
If we replace the lines
if (a & 0x8000000000000000ULL)
d += b;
if (d >= m) d -= m;
with
unsigned long long x = -(a >> 63) & b;
if (d >= m - x) d -= m;
d += x;
the code will work for all inputs, including those with top bits set. The cryptic first line is just a conditional-free (and thus usually faster) way of writing
unsigned long long x = (a & 0x8000000000000000ULL) ? b : 0;
The test d >= m - x operates on d before it has been modified -- it's like the old d >= m test, but b (when the top bit of a is on) or 0 (otherwise) has been subtracted from both sides. This tests whether d would be m or larger once x is added to it. We know that the RHS m - x never underflows, because the largest x can be is b and we have established that b < m at the top of the function.

How does 0 flip back to max integer value when subtracting -1? [duplicate]

I have come across code from someone who appears to believe there is a problem subtracting an unsigned integer from another integer of the same type when the result would be negative. So that code like this would be incorrect even if it happens to work on most architectures.
unsigned int To, Tf;
To = getcounter();
while (1) {
Tf = getcounter();
if ((Tf-To) >= TIME_LIMIT) {
break;
}
}
This is the only vaguely relevant quote from the C standard I could find.
A computation involving unsigned operands can never overflow, because a
result that cannot be represented by the resulting unsigned integer
type is reduced modulo the number that is one greater than the largest
value that can be represented by the resulting type.
I suppose one could take that quote to mean that when the right operand is larger the operation is adjusted to be meaningful in the context of modulo truncated numbers.
i.e.
0x0000 - 0x0001 == 0x 1 0000 - 0x0001 == 0xFFFF
as opposed to using the implementation dependent signed semantics:
0x0000 - 0x0001 == (unsigned)(0 + -1) == (0xFFFF but also 0xFFFE or 0x8001)
Which or what interpretation is right? Is it defined at all?
When you work with unsigned types, modular arithmetic (also known as "wrap around" behavior) is taking place. To understand this modular arithmetic, just have a look at these clocks:
9 + 4 = 1 (13 mod 12), so to the other direction it is: 1 - 4 = 9 (-3 mod 12). The same principle is applied while working with unsigned types. If the result type is unsigned, then modular arithmetic takes place.
Now look at the following operations storing the result as an unsigned int:
unsigned int five = 5, seven = 7;
unsigned int a = five - seven; // a = (-2 % 2^32) = 4294967294
int one = 1, six = 6;
unsigned int b = one - six; // b = (-5 % 2^32) = 4294967291
When you want to make sure that the result is signed, then stored it into signed variable or cast it to signed. When you want to get the difference between numbers and make sure that the modular arithmetic will not be applied, then you should consider using abs() function defined in stdlib.h:
int c = five - seven; // c = -2
int d = abs(five - seven); // d = 2
Be very careful, especially while writing conditions, because:
if (abs(five - seven) < seven) // = if (2 < 7)
// ...
if (five - seven < -1) // = if (-2 < -1)
// ...
if (one - six < 1) // = if (-5 < 1)
// ...
if ((int)(five - seven) < 1) // = if (-2 < 1)
// ...
but
if (five - seven < 1) // = if ((unsigned int)-2 < 1) = if (4294967294 < 1)
// ...
if (one - six < five) // = if ((unsigned int)-5 < 5) = if (4294967291 < 5)
// ...
The result of a subtraction generating a negative number in an unsigned type is well-defined:
[...] A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
(ISO/IEC 9899:1999 (E) §6.2.5/9)
As you can see, (unsigned)0 - (unsigned)1 equals -1 modulo UINT_MAX+1, or in other words, UINT_MAX.
Note that although it does say "A computation involving unsigned operands can never overflow", which might lead you to believe that it applies only for exceeding the upper limit, this is presented as a motivation for the actual binding part of the sentence: "a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type." This phrase is not restricted to overflow of the upper bound of the type, and applies equally to values too low to be represented.
Well, the first interpretation is correct. However, your reasoning about the "signed semantics" in this context is wrong.
Again, your first interpretation is correct. Unsigned arithmetic follow the rules of modulo arithmetic, meaning that 0x0000 - 0x0001 evaluates to 0xFFFF for 32-bit unsigned types.
However, the second interpretation (the one based on "signed semantics") is also required to produce the same result. I.e. even if you evaluate 0 - 1 in the domain of signed type and obtain -1 as the intermediate result, this -1 is still required to produce 0xFFFF when later it gets converted to unsigned type. Even if some platform uses an exotic representation for signed integers (1's complement, signed magnitude), this platform is still required to apply rules of modulo arithmetic when converting signed integer values to unsigned ones.
For example, this evaluation
signed int a = 0, b = 1;
unsigned int c = a - b;
is still guaranteed to produce UINT_MAX in c, even if the platform is using an exotic representation for signed integers.
With unsigned numbers of type unsigned int or larger, in the absence of type conversions, a-b is defined as yielding the unsigned number which, when added to b, will yield a. Conversion of a negative number to unsigned is defined as yielding the number which, when added to the sign-reversed original number, will yield zero (so converting -5 to unsigned will yield a value which, when added to 5, will yield zero).
Note that unsigned numbers smaller than unsigned int may get promoted to type int before the subtraction, the behavior of a-b will depend upon the size of int.
Well, an unsigned integer subtraction has defined behavior, also it is a tricky thing. When you subtract two unsigned integers, result is promoted to higher type int if result (lvalue) type is not specified explicitly. In the latter case, for example, int8_t result = a - b; (where a and b have int8_t type) you can obtain very weird behavior. I mean you may loss transitivity property (i.e. if a > b and b > c it is true that a > c).
The loss of transitivity can destroy a tree-type data structure work. Care must be taken not to provide comparison function for sorting, searching, tree building that uses unsigned integer subtraction to deduce which key is higher or lower.
See example below.
#include <stdint.h>
#include <stdio.h>
void main()
{
uint8_t a = 255;
uint8_t b = 100;
uint8_t c = 150;
printf("uint8_t a = %+d, b = %+d, c = %+d\n\n", a, b, c);
printf(" b - a = %+d\tpromotion to int type\n"
" (int8_t)(b - a) = %+d\n\n"
" b + a = %+d\tpromotion to int type\n"
"(uint8_t)(b + a) = %+d\tmodular arithmetic\n"
" b + a %% %d = %+d\n\n",
b - a, (int8_t)(b - a),
b + a, (uint8_t)(b + a),
UINT8_MAX + 1,
(b + a) % (UINT8_MAX + 1));
printf("c %s b (b - c = %d), b %s a (b - a = %d), AND c %s a (c - a = %d)\n",
(int8_t)(c - b) < 0 ? "<" : ">", (int8_t)(c - b),
(int8_t)(b - a) < 0 ? "<" : ">", (int8_t)(b - a),
(int8_t)(c - a) < 0 ? "<" : ">", (int8_t)(c - a));
}
$ ./a.out
uint8_t a = +255, b = +100, c = +150
b - a = -155 promotion to int type
(int8_t)(b - a) = +101
b + a = +355 promotion to int type
(uint8_t)(b + a) = +99 modular arithmetic
b + a % 256 = +99
c > b (b - c = 50), b > a (b - a = 101), AND c < a (c - a = -105)
int d = abs(five - seven); // d = 2
std::abs is not "suitable" for unsigned integers. A cast is needed though.

implicit conversion of unsigned and signed [duplicate]

I have come across code from someone who appears to believe there is a problem subtracting an unsigned integer from another integer of the same type when the result would be negative. So that code like this would be incorrect even if it happens to work on most architectures.
unsigned int To, Tf;
To = getcounter();
while (1) {
Tf = getcounter();
if ((Tf-To) >= TIME_LIMIT) {
break;
}
}
This is the only vaguely relevant quote from the C standard I could find.
A computation involving unsigned operands can never overflow, because a
result that cannot be represented by the resulting unsigned integer
type is reduced modulo the number that is one greater than the largest
value that can be represented by the resulting type.
I suppose one could take that quote to mean that when the right operand is larger the operation is adjusted to be meaningful in the context of modulo truncated numbers.
i.e.
0x0000 - 0x0001 == 0x 1 0000 - 0x0001 == 0xFFFF
as opposed to using the implementation dependent signed semantics:
0x0000 - 0x0001 == (unsigned)(0 + -1) == (0xFFFF but also 0xFFFE or 0x8001)
Which or what interpretation is right? Is it defined at all?
When you work with unsigned types, modular arithmetic (also known as "wrap around" behavior) is taking place. To understand this modular arithmetic, just have a look at these clocks:
9 + 4 = 1 (13 mod 12), so to the other direction it is: 1 - 4 = 9 (-3 mod 12). The same principle is applied while working with unsigned types. If the result type is unsigned, then modular arithmetic takes place.
Now look at the following operations storing the result as an unsigned int:
unsigned int five = 5, seven = 7;
unsigned int a = five - seven; // a = (-2 % 2^32) = 4294967294
int one = 1, six = 6;
unsigned int b = one - six; // b = (-5 % 2^32) = 4294967291
When you want to make sure that the result is signed, then stored it into signed variable or cast it to signed. When you want to get the difference between numbers and make sure that the modular arithmetic will not be applied, then you should consider using abs() function defined in stdlib.h:
int c = five - seven; // c = -2
int d = abs(five - seven); // d = 2
Be very careful, especially while writing conditions, because:
if (abs(five - seven) < seven) // = if (2 < 7)
// ...
if (five - seven < -1) // = if (-2 < -1)
// ...
if (one - six < 1) // = if (-5 < 1)
// ...
if ((int)(five - seven) < 1) // = if (-2 < 1)
// ...
but
if (five - seven < 1) // = if ((unsigned int)-2 < 1) = if (4294967294 < 1)
// ...
if (one - six < five) // = if ((unsigned int)-5 < 5) = if (4294967291 < 5)
// ...
The result of a subtraction generating a negative number in an unsigned type is well-defined:
[...] A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
(ISO/IEC 9899:1999 (E) §6.2.5/9)
As you can see, (unsigned)0 - (unsigned)1 equals -1 modulo UINT_MAX+1, or in other words, UINT_MAX.
Note that although it does say "A computation involving unsigned operands can never overflow", which might lead you to believe that it applies only for exceeding the upper limit, this is presented as a motivation for the actual binding part of the sentence: "a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type." This phrase is not restricted to overflow of the upper bound of the type, and applies equally to values too low to be represented.
Well, the first interpretation is correct. However, your reasoning about the "signed semantics" in this context is wrong.
Again, your first interpretation is correct. Unsigned arithmetic follow the rules of modulo arithmetic, meaning that 0x0000 - 0x0001 evaluates to 0xFFFF for 32-bit unsigned types.
However, the second interpretation (the one based on "signed semantics") is also required to produce the same result. I.e. even if you evaluate 0 - 1 in the domain of signed type and obtain -1 as the intermediate result, this -1 is still required to produce 0xFFFF when later it gets converted to unsigned type. Even if some platform uses an exotic representation for signed integers (1's complement, signed magnitude), this platform is still required to apply rules of modulo arithmetic when converting signed integer values to unsigned ones.
For example, this evaluation
signed int a = 0, b = 1;
unsigned int c = a - b;
is still guaranteed to produce UINT_MAX in c, even if the platform is using an exotic representation for signed integers.
With unsigned numbers of type unsigned int or larger, in the absence of type conversions, a-b is defined as yielding the unsigned number which, when added to b, will yield a. Conversion of a negative number to unsigned is defined as yielding the number which, when added to the sign-reversed original number, will yield zero (so converting -5 to unsigned will yield a value which, when added to 5, will yield zero).
Note that unsigned numbers smaller than unsigned int may get promoted to type int before the subtraction, the behavior of a-b will depend upon the size of int.
Well, an unsigned integer subtraction has defined behavior, also it is a tricky thing. When you subtract two unsigned integers, result is promoted to higher type int if result (lvalue) type is not specified explicitly. In the latter case, for example, int8_t result = a - b; (where a and b have int8_t type) you can obtain very weird behavior. I mean you may loss transitivity property (i.e. if a > b and b > c it is true that a > c).
The loss of transitivity can destroy a tree-type data structure work. Care must be taken not to provide comparison function for sorting, searching, tree building that uses unsigned integer subtraction to deduce which key is higher or lower.
See example below.
#include <stdint.h>
#include <stdio.h>
void main()
{
uint8_t a = 255;
uint8_t b = 100;
uint8_t c = 150;
printf("uint8_t a = %+d, b = %+d, c = %+d\n\n", a, b, c);
printf(" b - a = %+d\tpromotion to int type\n"
" (int8_t)(b - a) = %+d\n\n"
" b + a = %+d\tpromotion to int type\n"
"(uint8_t)(b + a) = %+d\tmodular arithmetic\n"
" b + a %% %d = %+d\n\n",
b - a, (int8_t)(b - a),
b + a, (uint8_t)(b + a),
UINT8_MAX + 1,
(b + a) % (UINT8_MAX + 1));
printf("c %s b (b - c = %d), b %s a (b - a = %d), AND c %s a (c - a = %d)\n",
(int8_t)(c - b) < 0 ? "<" : ">", (int8_t)(c - b),
(int8_t)(b - a) < 0 ? "<" : ">", (int8_t)(b - a),
(int8_t)(c - a) < 0 ? "<" : ">", (int8_t)(c - a));
}
$ ./a.out
uint8_t a = +255, b = +100, c = +150
b - a = -155 promotion to int type
(int8_t)(b - a) = +101
b + a = +355 promotion to int type
(uint8_t)(b + a) = +99 modular arithmetic
b + a % 256 = +99
c > b (b - c = 50), b > a (b - a = 101), AND c < a (c - a = -105)
int d = abs(five - seven); // d = 2
std::abs is not "suitable" for unsigned integers. A cast is needed though.

reversing two's complement for 18bit int

I have an 18 bit integer that is in two's complement and I'd like to convert it to a signed number so I can better use it. On the platform I'm using, ints are 4 bytes (i.e. 32 bits). Based on this post:
Convert Raw 14 bit Two's Complement to Signed 16 bit Integer
I tried the following to convert the number:
using SomeType = uint64_t;
SomeType largeNum = 0x32020e6ed2006400;
int twosCompNum = (largeNum & 0x3FFFF);
int regularNum = (int) ((twosCompNum << 14) / 8192);
I shifted the number left 14 places to get the sign bit as the most significant bit and then divided by 8192 (in binary, it's 1 followed by 13 zeroes) to restore the magnitude (as mentioned in the post above). However, this doesn't seem to work for me. As an example, inputting 249344 gives me -25600, which prima facie doesn't seem correct. What am I doing wrong?
The almost-portable way (with assumption that negative integers are natively 2s-complement) is to simply inspect bit 17, and use that to conditionally mask in the sign bits:
constexpr SomeType sign_bits = ~SomeType{} << 18;
int regularNum = twosCompNum & 1<<17 ? twosCompNum | sign_bits : twosCompNum;
Note that this doesn't depend on the size of your int type.
The constant 8192 is wrong, it should be 16384 = (1<<14).
int regularNum = (twosCompNum << 14) / (1<<14);
With this, the answer is correct, -12800.
It is correct, because the input (unsigned) number is 249344 (0x3CE00). It has its highest bit set, so it is a negative number. We can calculate its signed value by subtracting "max unsigned value+1" from it: 0x3CE00-0x40000=-12800.
Note, that if you are on a platform, for which right signed shift does the right thing (like on x86), then you can avoid division:
int regularNum = (twosCompNum << 14) >> 14;
This version can be slightly faster (but has implementation-defined behavior), if the compiler doesn't notice that division can be exactly replaced by a shift (clang 7 notices, but gcc 8 doesn't).
Two problems: first your test input is not an 18-bit two's complement number. With n bits, two's compliment permits -(2 ^ (n - 1)) <= value < 2 ^ (n - 1). In the case of 18 bits, that's -131072 <= value < 131071. You say you input 249344 which is outside of this range and would actually be interpreted as -12800.
The second problem is that your powers of two are off. In the answer you cite, the solution offered is of the form
mBitOutput = (mBitCast)(nBitInput << (m - n)) / (1 << (m - n));
For your particular problem, you desire
int output = (nBitInput << (32 - 18)) / (1 << (32 - 18));
// or equivalent
int output = (nBitInput << 14) / 16384;
Try this out.

Carry bits in incidents of overflow

/*
* isLessOrEqual - if x <= y then return 1, else return 0
* Example: isLessOrEqual(4,5) = 1.
* Legal ops: ! ~ & ^ | + << >>
* Max ops: 24
* Rating: 3
*/
int isLessOrEqual(int x, int y)
{
int msbX = x>>31;
int msbY = y>>31;
int sum_xy = (y+(~x+1));
int twoPosAndNegative = (!msbX & !msbY) & sum_xy; //isLessOrEqual is FALSE.
// if = true, twoPosAndNegative = 1; Overflow true
// twoPos = Negative means y < x which means that this
int twoNegAndPositive = (msbX & msbY) & !sum_xy;//isLessOrEqual is FALSE
//We started with two negative numbers, and subtracted X, resulting in positive. Therefore, x is bigger.
int isEqual = (!x^!y); //isLessOrEqual is TRUE
return (twoPosAndNegative | twoNegAndPositive | isEqual);
}
Currently, I am trying to work through how to carry bits in this operator.
The purpose of this function is to identify whether or not int y >= int x.
This is part of a class assignment, so there are restrictions on casting and which operators I can use.
I'm trying to account for a carried bit by applying a mask of the complement of the MSB, to try and remove the most significant bit from the equation, so that they may overflow without causing an issue.
I am under the impression that, ignoring cases of overflow, the returned operator would work.
EDIT: Here is my adjusted code, still not working. But, I think this is progress? I feel like I'm chasing my own tail.
int isLessOrEqual(int x, int y)
{
int msbX = x >> 31;
int msbY = y >> 31;
int sign_xy_sum = (y + (~x + 1)) >> 31;
return ((!msbY & msbX) | (!sign_xy_sum & (!msbY | msbX)));
}
I figured it out with the assistance of one of my peers, alongside the commentators here on StackOverflow.
The solution is as seen above.
The asker has self-answered their question (a class assignment), so providing alternative solutions seems appropriate at this time. The question clearly assumes that integers are represented as two's complement numbers.
One approach is to consider how CPUs compute predicates for conditional branching by means of a compare instruction. "signed less than" as expressed in processor condition codes is SF ≠ OF. SF is the sign flag, a copy of the sign-bit, or most significant bit (MSB) of the result. OF is the overflow flag which indicates overflow in signed integer operations. This is computed as the XOR of the carry-in and the carry-out of the sign-bit or MSB. With two's complement arithmetic, a - b = a + ~b + 1, and therefore a < b = a + ~b < 0. It remains to separate computation on the sign bit (MSB) sufficiently from the lower order bits. This leads to the following code:
int isLessOrEqual (int a, int b)
{
int nb = ~b;
int ma = a & ((1U << (sizeof(a) * CHAR_BIT - 1)) - 1);
int mb = nb & ((1U << (sizeof(b) * CHAR_BIT - 1)) - 1);
// for the following, only the MSB is of interest, other bits are don't care
int cyin = ma + mb;
int ovfl = (a ^ cyin) & (a ^ b);
int sign = (a ^ nb ^ cyin);
int lteq = sign ^ ovfl;
// desired predicate is now in the MSB (sign bit) of lteq, extract it
return (int)((unsigned int)lteq >> (sizeof(lteq) * CHAR_BIT - 1));
}
The casting to unsigned int prior to the final right shift is necessary because right-shifting of signed integers with negative value is implementation-defined, per the ISO-C++ standard, section 5.8. Asker has pointed out that casts are not allowed. When right shifting signed integers, C++ compilers will generate either a logical right shift instruction, or an arithmetic right shift instruction. As we are only interested in extracting the MSB, we can isolate ourselves from the choice by shifting then masking out all other bits besides the LSB, at the cost of one additional operation:
return (lteq >> (sizeof(lteq) * CHAR_BIT - 1)) & 1;
The above solution requires a total of eleven or twelve basic operations. A significantly more efficient solution is based on the 1972 MIT HAKMEM memo, which contains the following observation:
ITEM 23 (Schroeppel): (A AND B) + (A OR B) = A + B = (A XOR B) + 2 (A AND B).
This is straightforward, as A AND B represent the carry bits, and A XOR B represent the sum bits. In a newsgroup posting to comp.arch.arithmetic on February 11, 2000, Peter L. Montgomery provided the following extension:
If XOR is available, then this can be used to average
two unsigned variables A and B when the sum might overflow:
(A+B)/2 = (A AND B) + (A XOR B)/2
In the context of this question, this allows us to compute (a + ~b) / 2 without overflow, then inspect the sign bit to see if the result is less than zero. While Montgomery only referred to unsigned integers, the extension to signed integers is straightforward by use of an arithmetic right shift, keeping in mind that right shifting is an integer division which rounds towards negative infinity, rather than towards zero as regular integer division.
int isLessOrEqual (int a, int b)
{
int nb = ~b;
// compute avg(a,~b) without overflow, rounding towards -INF; lteq(a,b) = SF
int lteq = (a & nb) + arithmetic_right_shift (a ^ nb, 1);
return (int)((unsigned int)lteq >> (sizeof(lteq) * CHAR_BIT - 1));
}
Unfortunately, C++ itself provides no portable way to code an arithmetic right shift, but we can emulate it fairly efficiently using this answer:
int arithmetic_right_shift (int a, int s)
{
unsigned int mask_msb = 1U << (sizeof(mask_msb) * CHAR_BIT - 1);
unsigned int ua = a;
ua = ua >> s;
mask_msb = mask_msb >> s;
return (int)((ua ^ mask_msb) - mask_msb);
}
When inlined, this adds just a couple of instructions to the code when the shift count is a compile-time constant. If the compiler documentation indicates that the implementation-defined handling of signed integers of negative value is accomplished via arithmetic right shift instruction, it is safe to simplify to this six-operation solution:
int isLessOrEqual (int a, int b)
{
int nb = ~b;
// compute avg(a,~b) without overflow, rounding towards -INF; lteq(a,b) = SF
int lteq = (a & nb) + ((a ^ nb) >> 1);
return (int)((unsigned int)lteq >> (sizeof(lteq) * CHAR_BIT - 1));
}
The previously made comments regarding use of a cast when converting the sign bit into a predicate apply here as well.