Which arithmetic properties do two's complement integers have? - twos-complement

There are multiple arithmetic properties that hold for Whole Numbers, the ones we're used to when performing calculations:
'+' Commutativity: a+b == b+a
'+' Associativity: (a+b)+c == a+(b+c)
'*' Commutativity: a*b == b*a
'*' Associativity: (a*b)*c == a*(b*c)
Distributivity of '*' over '+': a*(b+c) == a*b + a*c
I know that commutativity relations hold for the two's complement integers. But I'm not sure about the other properties.
Which of the properties above are satisfied by two's complement integers. If the result depends on signedness I would like to know that as well.

I wrote a program that checks every 8-bit integer. Since the search space is not that big I could check every single 8-bit two's complement integer. It's not a rigorous mathematical proof but worked well for my purposes
for(I8 x = 0; x != int8Max; ++x){
for(I8 y = 0; y != int8Max; ++y) {
for(I8 z = 0; z != int8Max; ++z) {
I8 t1=x+y;
I8 r1=t1*z;
I8 t2=x*z;
I8 t3=y*z;
I8 r2=t2+t3;
if(r1!=r2) {
printf("Achtung!!\n");
break;
}
}
}
}
I used program like this to verify the truthness of values, slightly modifying the program for each case.
I found that both signed and unsigned integers have all of the properties listed above:
Associativity for + and * operations
Commutativity for + and * operations
Distributivity of * over +

Related

How does 0 flip back to max integer value when subtracting -1? [duplicate]

I have come across code from someone who appears to believe there is a problem subtracting an unsigned integer from another integer of the same type when the result would be negative. So that code like this would be incorrect even if it happens to work on most architectures.
unsigned int To, Tf;
To = getcounter();
while (1) {
Tf = getcounter();
if ((Tf-To) >= TIME_LIMIT) {
break;
}
}
This is the only vaguely relevant quote from the C standard I could find.
A computation involving unsigned operands can never overflow, because a
result that cannot be represented by the resulting unsigned integer
type is reduced modulo the number that is one greater than the largest
value that can be represented by the resulting type.
I suppose one could take that quote to mean that when the right operand is larger the operation is adjusted to be meaningful in the context of modulo truncated numbers.
i.e.
0x0000 - 0x0001 == 0x 1 0000 - 0x0001 == 0xFFFF
as opposed to using the implementation dependent signed semantics:
0x0000 - 0x0001 == (unsigned)(0 + -1) == (0xFFFF but also 0xFFFE or 0x8001)
Which or what interpretation is right? Is it defined at all?
When you work with unsigned types, modular arithmetic (also known as "wrap around" behavior) is taking place. To understand this modular arithmetic, just have a look at these clocks:
9 + 4 = 1 (13 mod 12), so to the other direction it is: 1 - 4 = 9 (-3 mod 12). The same principle is applied while working with unsigned types. If the result type is unsigned, then modular arithmetic takes place.
Now look at the following operations storing the result as an unsigned int:
unsigned int five = 5, seven = 7;
unsigned int a = five - seven; // a = (-2 % 2^32) = 4294967294
int one = 1, six = 6;
unsigned int b = one - six; // b = (-5 % 2^32) = 4294967291
When you want to make sure that the result is signed, then stored it into signed variable or cast it to signed. When you want to get the difference between numbers and make sure that the modular arithmetic will not be applied, then you should consider using abs() function defined in stdlib.h:
int c = five - seven; // c = -2
int d = abs(five - seven); // d = 2
Be very careful, especially while writing conditions, because:
if (abs(five - seven) < seven) // = if (2 < 7)
// ...
if (five - seven < -1) // = if (-2 < -1)
// ...
if (one - six < 1) // = if (-5 < 1)
// ...
if ((int)(five - seven) < 1) // = if (-2 < 1)
// ...
but
if (five - seven < 1) // = if ((unsigned int)-2 < 1) = if (4294967294 < 1)
// ...
if (one - six < five) // = if ((unsigned int)-5 < 5) = if (4294967291 < 5)
// ...
The result of a subtraction generating a negative number in an unsigned type is well-defined:
[...] A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
(ISO/IEC 9899:1999 (E) §6.2.5/9)
As you can see, (unsigned)0 - (unsigned)1 equals -1 modulo UINT_MAX+1, or in other words, UINT_MAX.
Note that although it does say "A computation involving unsigned operands can never overflow", which might lead you to believe that it applies only for exceeding the upper limit, this is presented as a motivation for the actual binding part of the sentence: "a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type." This phrase is not restricted to overflow of the upper bound of the type, and applies equally to values too low to be represented.
Well, the first interpretation is correct. However, your reasoning about the "signed semantics" in this context is wrong.
Again, your first interpretation is correct. Unsigned arithmetic follow the rules of modulo arithmetic, meaning that 0x0000 - 0x0001 evaluates to 0xFFFF for 32-bit unsigned types.
However, the second interpretation (the one based on "signed semantics") is also required to produce the same result. I.e. even if you evaluate 0 - 1 in the domain of signed type and obtain -1 as the intermediate result, this -1 is still required to produce 0xFFFF when later it gets converted to unsigned type. Even if some platform uses an exotic representation for signed integers (1's complement, signed magnitude), this platform is still required to apply rules of modulo arithmetic when converting signed integer values to unsigned ones.
For example, this evaluation
signed int a = 0, b = 1;
unsigned int c = a - b;
is still guaranteed to produce UINT_MAX in c, even if the platform is using an exotic representation for signed integers.
With unsigned numbers of type unsigned int or larger, in the absence of type conversions, a-b is defined as yielding the unsigned number which, when added to b, will yield a. Conversion of a negative number to unsigned is defined as yielding the number which, when added to the sign-reversed original number, will yield zero (so converting -5 to unsigned will yield a value which, when added to 5, will yield zero).
Note that unsigned numbers smaller than unsigned int may get promoted to type int before the subtraction, the behavior of a-b will depend upon the size of int.
Well, an unsigned integer subtraction has defined behavior, also it is a tricky thing. When you subtract two unsigned integers, result is promoted to higher type int if result (lvalue) type is not specified explicitly. In the latter case, for example, int8_t result = a - b; (where a and b have int8_t type) you can obtain very weird behavior. I mean you may loss transitivity property (i.e. if a > b and b > c it is true that a > c).
The loss of transitivity can destroy a tree-type data structure work. Care must be taken not to provide comparison function for sorting, searching, tree building that uses unsigned integer subtraction to deduce which key is higher or lower.
See example below.
#include <stdint.h>
#include <stdio.h>
void main()
{
uint8_t a = 255;
uint8_t b = 100;
uint8_t c = 150;
printf("uint8_t a = %+d, b = %+d, c = %+d\n\n", a, b, c);
printf(" b - a = %+d\tpromotion to int type\n"
" (int8_t)(b - a) = %+d\n\n"
" b + a = %+d\tpromotion to int type\n"
"(uint8_t)(b + a) = %+d\tmodular arithmetic\n"
" b + a %% %d = %+d\n\n",
b - a, (int8_t)(b - a),
b + a, (uint8_t)(b + a),
UINT8_MAX + 1,
(b + a) % (UINT8_MAX + 1));
printf("c %s b (b - c = %d), b %s a (b - a = %d), AND c %s a (c - a = %d)\n",
(int8_t)(c - b) < 0 ? "<" : ">", (int8_t)(c - b),
(int8_t)(b - a) < 0 ? "<" : ">", (int8_t)(b - a),
(int8_t)(c - a) < 0 ? "<" : ">", (int8_t)(c - a));
}
$ ./a.out
uint8_t a = +255, b = +100, c = +150
b - a = -155 promotion to int type
(int8_t)(b - a) = +101
b + a = +355 promotion to int type
(uint8_t)(b + a) = +99 modular arithmetic
b + a % 256 = +99
c > b (b - c = 50), b > a (b - a = 101), AND c < a (c - a = -105)
int d = abs(five - seven); // d = 2
std::abs is not "suitable" for unsigned integers. A cast is needed though.

implicit conversion of unsigned and signed [duplicate]

I have come across code from someone who appears to believe there is a problem subtracting an unsigned integer from another integer of the same type when the result would be negative. So that code like this would be incorrect even if it happens to work on most architectures.
unsigned int To, Tf;
To = getcounter();
while (1) {
Tf = getcounter();
if ((Tf-To) >= TIME_LIMIT) {
break;
}
}
This is the only vaguely relevant quote from the C standard I could find.
A computation involving unsigned operands can never overflow, because a
result that cannot be represented by the resulting unsigned integer
type is reduced modulo the number that is one greater than the largest
value that can be represented by the resulting type.
I suppose one could take that quote to mean that when the right operand is larger the operation is adjusted to be meaningful in the context of modulo truncated numbers.
i.e.
0x0000 - 0x0001 == 0x 1 0000 - 0x0001 == 0xFFFF
as opposed to using the implementation dependent signed semantics:
0x0000 - 0x0001 == (unsigned)(0 + -1) == (0xFFFF but also 0xFFFE or 0x8001)
Which or what interpretation is right? Is it defined at all?
When you work with unsigned types, modular arithmetic (also known as "wrap around" behavior) is taking place. To understand this modular arithmetic, just have a look at these clocks:
9 + 4 = 1 (13 mod 12), so to the other direction it is: 1 - 4 = 9 (-3 mod 12). The same principle is applied while working with unsigned types. If the result type is unsigned, then modular arithmetic takes place.
Now look at the following operations storing the result as an unsigned int:
unsigned int five = 5, seven = 7;
unsigned int a = five - seven; // a = (-2 % 2^32) = 4294967294
int one = 1, six = 6;
unsigned int b = one - six; // b = (-5 % 2^32) = 4294967291
When you want to make sure that the result is signed, then stored it into signed variable or cast it to signed. When you want to get the difference between numbers and make sure that the modular arithmetic will not be applied, then you should consider using abs() function defined in stdlib.h:
int c = five - seven; // c = -2
int d = abs(five - seven); // d = 2
Be very careful, especially while writing conditions, because:
if (abs(five - seven) < seven) // = if (2 < 7)
// ...
if (five - seven < -1) // = if (-2 < -1)
// ...
if (one - six < 1) // = if (-5 < 1)
// ...
if ((int)(five - seven) < 1) // = if (-2 < 1)
// ...
but
if (five - seven < 1) // = if ((unsigned int)-2 < 1) = if (4294967294 < 1)
// ...
if (one - six < five) // = if ((unsigned int)-5 < 5) = if (4294967291 < 5)
// ...
The result of a subtraction generating a negative number in an unsigned type is well-defined:
[...] A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
(ISO/IEC 9899:1999 (E) §6.2.5/9)
As you can see, (unsigned)0 - (unsigned)1 equals -1 modulo UINT_MAX+1, or in other words, UINT_MAX.
Note that although it does say "A computation involving unsigned operands can never overflow", which might lead you to believe that it applies only for exceeding the upper limit, this is presented as a motivation for the actual binding part of the sentence: "a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type." This phrase is not restricted to overflow of the upper bound of the type, and applies equally to values too low to be represented.
Well, the first interpretation is correct. However, your reasoning about the "signed semantics" in this context is wrong.
Again, your first interpretation is correct. Unsigned arithmetic follow the rules of modulo arithmetic, meaning that 0x0000 - 0x0001 evaluates to 0xFFFF for 32-bit unsigned types.
However, the second interpretation (the one based on "signed semantics") is also required to produce the same result. I.e. even if you evaluate 0 - 1 in the domain of signed type and obtain -1 as the intermediate result, this -1 is still required to produce 0xFFFF when later it gets converted to unsigned type. Even if some platform uses an exotic representation for signed integers (1's complement, signed magnitude), this platform is still required to apply rules of modulo arithmetic when converting signed integer values to unsigned ones.
For example, this evaluation
signed int a = 0, b = 1;
unsigned int c = a - b;
is still guaranteed to produce UINT_MAX in c, even if the platform is using an exotic representation for signed integers.
With unsigned numbers of type unsigned int or larger, in the absence of type conversions, a-b is defined as yielding the unsigned number which, when added to b, will yield a. Conversion of a negative number to unsigned is defined as yielding the number which, when added to the sign-reversed original number, will yield zero (so converting -5 to unsigned will yield a value which, when added to 5, will yield zero).
Note that unsigned numbers smaller than unsigned int may get promoted to type int before the subtraction, the behavior of a-b will depend upon the size of int.
Well, an unsigned integer subtraction has defined behavior, also it is a tricky thing. When you subtract two unsigned integers, result is promoted to higher type int if result (lvalue) type is not specified explicitly. In the latter case, for example, int8_t result = a - b; (where a and b have int8_t type) you can obtain very weird behavior. I mean you may loss transitivity property (i.e. if a > b and b > c it is true that a > c).
The loss of transitivity can destroy a tree-type data structure work. Care must be taken not to provide comparison function for sorting, searching, tree building that uses unsigned integer subtraction to deduce which key is higher or lower.
See example below.
#include <stdint.h>
#include <stdio.h>
void main()
{
uint8_t a = 255;
uint8_t b = 100;
uint8_t c = 150;
printf("uint8_t a = %+d, b = %+d, c = %+d\n\n", a, b, c);
printf(" b - a = %+d\tpromotion to int type\n"
" (int8_t)(b - a) = %+d\n\n"
" b + a = %+d\tpromotion to int type\n"
"(uint8_t)(b + a) = %+d\tmodular arithmetic\n"
" b + a %% %d = %+d\n\n",
b - a, (int8_t)(b - a),
b + a, (uint8_t)(b + a),
UINT8_MAX + 1,
(b + a) % (UINT8_MAX + 1));
printf("c %s b (b - c = %d), b %s a (b - a = %d), AND c %s a (c - a = %d)\n",
(int8_t)(c - b) < 0 ? "<" : ">", (int8_t)(c - b),
(int8_t)(b - a) < 0 ? "<" : ">", (int8_t)(b - a),
(int8_t)(c - a) < 0 ? "<" : ">", (int8_t)(c - a));
}
$ ./a.out
uint8_t a = +255, b = +100, c = +150
b - a = -155 promotion to int type
(int8_t)(b - a) = +101
b + a = +355 promotion to int type
(uint8_t)(b + a) = +99 modular arithmetic
b + a % 256 = +99
c > b (b - c = 50), b > a (b - a = 101), AND c < a (c - a = -105)
int d = abs(five - seven); // d = 2
std::abs is not "suitable" for unsigned integers. A cast is needed though.

Implementing safe shift-left

I would like to implement a shift-left function that would trigger a failure upon overflow.
Here is my code:
uint32_t safe_shl(uint32_t x, uint8_t y) {
uint32_t z = x << y;
assert((z >> y) == x);
return z;
}
Please assume that I the assert function registers an error in my system.
I would like to ensure that my method is bullet-proof (i.e., fails on every erroneous input and only on erroneous input).
And I would also like to ask if you know of a more efficient way to implement this (assuming that it is indeed bullet-proof).
If x << y is undefined, all bets are off.
The only safe way is to check that it's a valid shift before attempting it.
uint32_t safe_shl(uint32_t x, uint8_t y) {
assert (y < 32);
if (y < 32)
{
uint32_t z = x << y;
assert((z >> y) == x);
return z;
}
return 0;
}
Note that you need the condition – shifting unconditionally lets the compiler assume that y < 32 is true.
Step 1. If x == 0and any shift amount, the result is conceptually still 0 and is not a problem.
Step 2. Do not attempt excessive shifts.
If the value of the right operand is negative or greater than or equal to the width of the promoted left operand, the behavior is undefined. C11 §6.5.7 3
Step 3. Insure unsigned math while shifting.
If int/unsigned is wider than uintN_t x, then x << y is done with int math. This is rare with N==32 yet possible. Signed math overflow is possible and leads to UB. By 1u*x or (0u+x), code can insure the shift uses the wider of unsigned and uintN_t math. Good compilers will still make optimal code.
Step 4. Detect if a reduction occurred.
If E1 has an unsigned type, the value of the result is E1 × 2E2, reduced modulo
one more than the maximum value representable in the result type §6.5.7 4
uint32_t safe_shl(uint32_t x, uint8_t y) {
if (x == 0) {
return 0;
}
assert(y < 32);
uint32_t z = (1u*x) << y;
assert((z >> y) == x);
return z;
}
Are you asking to assert if the shift would cause a carry?
In which case it's a bit nasty in c++ without resorting to intrinsics or assembler.
#include <cassert>
#include <cstdint>
#include <limits>
bool shl_would_carry(uint32_t x, uint8_t y)
{
constexpr auto nof_bits = std::numeric_limits<decltype(x)>::digits;
if (y >= nof_bits)
{
if (x != 0) return true;
}
else
{
auto limit = decltype(x)(1) << (nof_bits - y);
if (x >= limit) return true;
}
return false;
}
uint32_t safe_shl(uint32_t x, uint8_t y)
{
assert(!shl_would_carry(x, y));
return x << y;
}
I think that's right.
This might be better:
std::tuple<uint32_t, uint32_t> shl(uint32_t x, uint8_t y)
{
uint32_t overflow, result;
constexpr auto nof_bits = std::numeric_limits<decltype(x)>::digits;
overflow = x >> (nof_bits - y);
result = x << y;
return std::make_tuple(overflow, result);
}
uint32_t safe_shl(uint32_t x, uint8_t y)
{
auto t = shl(x, y);
assert(!std::get<0>(t));
return std::get<1>(t);
}
In C, x << y if defined for uint32_t provided y < 32. From the n1570 draft for C11 in 6.5.7 Bitwise shift operators:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
The result in then required to be: x × 2y, reduced modulo
one more than the maximum value representable in the result type
Let call that value z like it is in your proposed code. Like you use an unsigned type the value of z >> y is required to be the integral part of z/2y.
That means that provided y < 32 if there is an overflow, the value of z >> y will be strictly less than x because of the modulo, and if there is non overflow, you get exactly x
Full reference from 6.5.7 Bitwise shift operators:
...4 The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with
zeros. If E1 has an unsigned type, the value of the result is E1 × 2E2, reduced modulo
one more than the maximum value representable in the result type. If E1 has a signed
type and nonnegative value, and E1 × 2E2 is representable in the result type, then that is
the resulting value; otherwise, the behavior is undefined.
5 The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type
or if E1 has a signed type and a nonnegative value, the value of the result is the integral
part of the quotient of E1 / 2E2. If E1 has a signed type and a negative value, the
resulting value is implementation-defined.
It is exactly the same in C++ from n4296 draft for C++14 in 5.8 Shift operators [expr.shift]:
...The behavior is undefined if the right operand
is negative, or greater than or equal to the length in bits of the promoted left operand.
2 The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned
type, the value of the result is E1 × 2E2, reduced modulo one more than the maximum value representable
in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1×2E2 is representable
in the corresponding unsigned type of the result type, then that value, converted to the result type, is the
resulting value; otherwise, the behavior is undefined.
3 The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed
type and a non-negative value, the value of the result is the integral part of the quotient of E1/2E2. If E1
has a signed type and a negative value, the resulting value is implementation-defined.
So in both languages, and assuming that the assert function registers an error in [your] system, code should be:
uint32_t safe_shl(uint32_t x, uint8_t y) {
assert(y<32);
uint32_t z = x << y;
assert((z >> y) == x);
return z;
}
In order to write a safe function, you must first identify what isn't safe. If you don't do this, the task is nonsense. The kind of "overflow" you mention is actually well-defined. But the following cases of dangerous behavior exist:
Left-shifting further than the size of the variable, including shifting data into the sign bit of a signed variable. (Undefined behavior)
The the right operator is a negative number. (Undefined behavior)
Right-shifting a negative number. (Impl.-defined behavior)
Implicit integer promotion of the left operand causing it to silently change signedness and thereby invoking one of the above errors.
To avoid this, you need to ensure that:
The left operand must be unsigned.
The right operand must be valid and in range of the type of the left operand.
The left operand must not be a small integer type.
1) and 3) are solved by using uint32_t. There exist no system where uint32_t is smaller than int.
2) is solved by using an unsigned type and checking that it isn't too large.
In addition, you seem to have a requirement that shifting out of bounds of the left operand should not be allowed. This is weird, but ok, lets implement that too. It can be done by checking if the MSB bit position plus the number of shifts are larger than 31.
uint8_t msb_pos32 (uint32_t data)
{
uint8_t result = 0;
while(data>>=1 > 0)
{
result++;
}
return result;
}
uint32_t safe_LSL32 (uint32_t x, uint8_t y)
{
if(y > 31 || y+msb_pos32(x) > 31)
{
__asm HCF; // error handling here
}
return x << y;
}
Note that this code can be further optimized.

Is ((a + (b & 255)) & 255) the same as ((a + b) & 255)?

I was browsing some C++ code, and found something like this:
(a + (b & 255)) & 255
The double AND annoyed me, so I thought of:
(a + b) & 255
(a and b are 32-bit unsigned integers)
I quickly wrote a test script (JS) to confirm my theory:
for (var i = 0; i < 100; i++) {
var a = Math.ceil(Math.random() * 0xFFFF),
b = Math.ceil(Math.random() * 0xFFFF);
var expr1 = (a + (b & 255)) & 255,
expr2 = (a + b) & 255;
if (expr1 != expr2) {
console.log("Numbers " + a + " and " + b + " mismatch!");
break;
}
}
While the script confirmed my hypothesis (both operations are equal), I still don't trust it, because 1) random and 2) I'm not a mathematician, I have no idea what am I doing.
Also, sorry for the Lisp-y title. Feel free to edit it.
They are the same. Here's a proof:
First note the identity (A + B) mod C = (A mod C + B mod C) mod C
Let's restate the problem by regarding a & 255 as standing in for a % 256. This is true since a is unsigned.
So (a + (b & 255)) & 255 is (a + (b % 256)) % 256
This is the same as (a % 256 + b % 256 % 256) % 256 (I've applied the identity stated above: note that mod and % are equivalent for unsigned types.)
This simplifies to (a % 256 + b % 256) % 256 which becomes (a + b) % 256 (reapplying the identity). You can then put the bitwise operator back to give
(a + b) & 255
completing the proof.
Lemma: a & 255 == a % 256 for unsigned a.
Unsigned a can be rewritten as m * 0x100 + b some unsigned m,b, 0 <= b < 0xff, 0 <= m <= 0xffffff. It follows from both definitions that a & 255 == b == a % 256.
Additionally, we need:
the distributive property: (a + b) mod n = [(a mod n) + (b mod n)] mod n
the definition of unsigned addition, mathematically: (a + b) ==> (a + b) % (2 ^ 32)
Thus:
(a + (b & 255)) & 255 = ((a + (b & 255)) % (2^32)) & 255 // def'n of addition
= ((a + (b % 256)) % (2^32)) % 256 // lemma
= (a + (b % 256)) % 256 // because 256 divides (2^32)
= ((a % 256) + (b % 256 % 256)) % 256 // Distributive
= ((a % 256) + (b % 256)) % 256 // a mod n mod n = a mod n
= (a + b) % 256 // Distributive again
= (a + b) & 255 // lemma
So yes, it is true. For 32-bit unsigned integers.
What about other integer types?
For 64-bit unsigned integers, all of the above applies just as well, just substituting 2^64 for 2^32.
For 8- and 16-bit unsigned integers, addition involves promotion to int. This int will definitely neither overflow or be negative in any of these operations, so they all remain valid.
For signed integers, if either a+b or a+(b&255) overflow, it's undefined behavior. So the equality can't hold — there are cases where (a+b)&255 is undefined behavior but (a+(b&255))&255 isn't.
In positional addition, subtraction and multiplication of unsigned numbers to produce unsigned results, more significant digits of the input don't affect less-significant digits of the result. This applies to binary arithmetic as much as it does to decimal arithmetic. It also applies to "twos complement" signed arithmetic, but not to sign-magnitude signed arithmetic.
However we have to be careful when taking rules from binary arithmetic and applying them to C (I beleive C++ has the same rules as C on this stuff but i'm not 100% sure) because C arithmetic has some arcane rules that can trip us up. Unsigned arithmetic in C follows simple binary wraparound rules but signed arithmetic overflow is undefined behaviour. Worse under some circumstances C will automatically "promote" an unsigned type to (signed) int.
Undefined behaviour in C can be especially insiduous. A dumb compiler (or a compiler on a low optimisation level) is likely to do what you expect based on your understanding of binary arithmetic while an optimising compiler may break your code in strange ways.
So getting back to the formula in the question the equivilence depends on the operand types.
If they are unsigned integers whose size is greater than or equal to the size of int then the overflow behaviour of the addition operator is well-defined as simple binary wraparound. Whether or not we mask off the high 24 bits of one operand before the addition operation has no impact on the low bits of the result.
If they are unsigned integers whose size is less than int then they will be promoted to (signed) int. Overflow of signed integers is undefined behaviour but at least on every platform I have encountered the difference in size between different integer types is large enough that a single addition of two promoted values will not cause overflow. So again we can fall back to the simply binary arithmetic argument to deem the statements equivalent.
If they are signed integers whose size is less than int then again overflow can't happen and on twos-complement implementations we can rely on the standard binary arithmetic argument to say they are equivilent. On sign-magnitude or ones complement implementations they would not be equivilent.
OTOH if a and b were signed integers whose size was greater than or equal to the size of int then even on twos complement implementations there are cases where one statement would be well-defined while the other would be undefined behaviour.
Yes, (a + b) & 255 is fine.
Remember addition in school? You add numbers digit by digit, and add a carry value to the next column of digits. There is no way for a later (more significant) column of digits to influence an already processed column. Because of this, it does not make a difference if you zero-out the digits only in the result, or also first in an argument.
The above is not always true, the C++ standard allows an implementation that would break this.
Such a Deathstation 9000 :-) would have to use a 33-bit int, if the OP meant unsigned short with "32-bit unsigned integers". If unsigned int was meant, the DS9K would have to use a 32-bit int, and a 32-bit unsigned int with a padding bit. (The unsigned integers are required to have the same size as their signed counterparts as per §3.9.1/3, and padding bits are allowed in §3.9.1/1.) Other combinations of sizes and padding bits would work too.
As far as I can tell, this is the only way to break it, because:
The integer representation must use a "purely binary" encoding scheme (§3.9.1/7 and the footnote), all bits except padding bits and the sign bit must contribute a value of 2n
int promotion is allowed only if int can represent all the values of the source type (§4.5/1), so int must have at least 32 bits contributing to the value, plus a sign bit.
the int can not have more value bits (not counting the sign bit) than 32, because else an addition can not overflow.
You already have the smart answer: unsigned arithmetic is modulo arithmetic and therefore the results will hold, you can prove it mathematically...
One cool thing about computers, though, is that computers are fast. Indeed, they are so fast that enumerating all valid combinations of 32 bits is possible in a reasonable amount of time (don't try with 64 bits).
So, in your case, I personally like to just throw it at a computer; it takes me less time to convince myself that the program is correct than it takes to convince myself than the mathematical proof is correct and that I didn't oversee a detail in the specification1:
#include <iostream>
#include <limits>
int main() {
std::uint64_t const MAX = std::uint64_t(1) << 32;
for (std::uint64_t i = 0; i < MAX; ++i) {
for (std::uint64_t j = 0; j < MAX; ++j) {
std::uint32_t const a = static_cast<std::uint32_t>(i);
std::uint32_t const b = static_cast<std::uint32_t>(j);
auto const champion = (a + (b & 255)) & 255;
auto const challenger = (a + b) & 255;
if (champion == challenger) { continue; }
std::cout << "a: " << a << ", b: " << b << ", champion: " << champion << ", challenger: " << challenger << "\n";
return 1;
}
}
std::cout << "Equality holds\n";
return 0;
}
This enumerates through all possible values of a and b in the 32-bits space and checks whether the equality holds, or not. If it does not, it prints the case which didn't work, which you can use as a sanity check.
And, according to Clang: Equality holds.
Furthermore, given that the arithmetic rules are bit-width agnostic (above int bit-width), this equality will hold for any unsigned integer type of 32 bits or more, including 64 bits and 128 bits.
Note: How can a compiler enumerates all 64-bits patterns in a reasonable time frame? It cannot. The loops were optimized out. Otherwise we would all have died before execution terminated.
I initially only proved it for 16-bits unsigned integers; unfortunately C++ is an insane language where small integers (smaller bitwidths than int) are first converted to int.
#include <iostream>
int main() {
unsigned const MAX = 65536;
for (unsigned i = 0; i < MAX; ++i) {
for (unsigned j = 0; j < MAX; ++j) {
std::uint16_t const a = static_cast<std::uint16_t>(i);
std::uint16_t const b = static_cast<std::uint16_t>(j);
auto const champion = (a + (b & 255)) & 255;
auto const challenger = (a + b) & 255;
if (champion == challenger) { continue; }
std::cout << "a: " << a << ", b: " << b << ", champion: "
<< champion << ", challenger: " << challenger << "\n";
return 1;
}
}
std::cout << "Equality holds\n";
return 0;
}
And once again, according to Clang: Equality holds.
Well, there you go :)
1 Of course, if a program ever inadvertently triggers Undefined Behavior, it would not prove much.
The quick answer is: both expressions are equivalent
since a and b are 32-bit unsigned integers, the result is the same even in case of overflow. unsigned arithmetic guarantees this: a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
The long answer is: there are no known platforms where these expressions would differ, but the Standard does not guarantee it, because of the rules of integral promotion.
If the type of a and b (unsigned 32 bit integers) has a higher rank than int, the computation is performed as unsigned, modulo 232, and it yields the same defined result for both expressions for all values of a and b.
Conversely, If the type of a and b is smaller than int, both are promoted to int and the computation is performed using signed arithmetic, where overflow invokes undefined behavior.
If int has at least 33 value bits, neither of the above expressions can overflow, so the result is perfectly defined and has the same value for both expressions.
If int has exactly 32 value bits, the computation can overflow for both expressions, for example values a=0xFFFFFFFF and b=1 would cause an overflow in both expressions. In order to avoid this, you would need to write ((a & 255) + (b & 255)) & 255.
The good news is there are no such platforms1.
1 More precisely, no such real platform exists, but one could configure a DS9K to exhibit such behavior and still conform to the C Standard.
Identical assuming no overflow. Neither version is truly immune to overflow but the double and version is more resistant to it. I am not aware of a system where an overflow in this case is a problem but I can see the author doing this in case there is one.
Yes you can prove it with arithmetic, but there is a more intuitive answer.
When adding, every bit only influences those more significant than itself; never those less significant.
Therefore, whatever you do to the higher bits before the addition won't change the result, as long as you only keep bits less significant than the lowest bit modified.
The proof is trivial and left as an exercise for the reader
But to actually legitimize this as an answer, your first line of code says take the last 8 bits of b** (all higher bits of b set to zero) and add this to a and then take only the last 8 bits of the result setting all higher bits to zero.
The second line says add a and b and take the last 8 bits with all higher bits zero.
Only the last 8 bits are significant in the result. Therefore only the last 8 bits are significant in the input(s).
** last 8 bits = 8 LSB
Also it is interesting to note that the output would be equivalent to
char a = something;
char b = something;
return (unsigned int)(a + b);
As above, only the 8 LSB are significant, but the result is an unsigned int with all other bits zero. The a + b will overflow, producing the expected result.

How to check dependencies of floats

I want to determine (in c++) if one float number is the multiplicative inverse of another float number. The problem is that i have to use a third variable to do it. For instance this code:
float x=5,y=0.2;
if(x==(1/y)) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;
will output: "they are not..." which is wrong and this code:
float x=5,y=0.2,z;
z=1/y;
if(x==z) cout<<"They are the multiplicative inverse of eachother"<<endl;
else cout<<"They are NOT the multiplicative inverse of eachother"<<endl;
will output: "they are..." which is right.why is this happening?
The Float Precision Problem
You have two problems here, but both come from the same root
You can't compare floats precisely. You can't subtract or divide them precisely. You can't count anything for them precisely. Any operation with them could (and almost always does) bring some error into the result. Even a=0.2f is not a precise operation. The deeper reasons of that are very well explained by the authors of the other answers here. (My thanks and votes to them for that.)
Here comes your first and more simple error. You should never, never, never, never, NEVER use on them == or its equivalent in any language.
Instead of a==b, use Abs(a-b)<HighestPossibleError instead.
But this is not the sole problem in your task.
Abs(1/y-x)<HighestPossibleError won't work, either. At least, it won't work often enough. Why?
Let's take pair x=1000 and y=0.001. Let's take the "starting" relative error of y for 10-6.
(Relative error = error/value).
Relative errors of values are adding to at multiplication and division.
1/y is about 1000. Its relative error is the same 10-6. ("1" hasn't errors)
That makes absolute error =1000*10-6=0.001. When you subtract x later, that error will be all that remains. (Absolute errors are adding to at adding and subtracting, and the error of x is negligibly small.) Surely, you are not counting on so large errors, HighestPossibleError would be surely set lower and your program would throw off a good pair of x,y
So, the next two rule for float operations: try not to divide greater valuer by lesser one and God save you from subtracting the close values after that.
There are two simple ways to escape this problem.
By founding what of x,y has the greater abs value and divide 1 by the greater one and only later to subtract the lesser one.
If you want to compare 1/y against x, while you are working yet with letters, not values, and your operations make no errors, multiply the both sides of comparison by y
and you have 1 against x*y. (Usually you should check signs in that operation, but here we use abs values, so, it is clean.) The result comparison has no division at all.
In a shorter way:
1/y V x <=> y*(1/y) V x*y <=> 1 V x*y
We already know that such comparison as 1 against x*y should be done so:
const float HighestPossibleError=1e-10;
if(Abs(x*y-1.0)<HighestPossibleError){...
That is all.
P.S. If you really need it all on one line, use:
if(Abs(x*y-1.0)<1e-10){...
But it is bad style. I wouldn't advise it.
P.P.S. In your second example the compiler optimizes the code so, that it sets z to 5 before running any code. So, checking 5 against 5 works even for floats.
The problem is that 0.2 cannot be represented exactly in binary, because its binary expansion has an infinite number of digits:
1/5: 0.0011001100110011001100110011001100110011...
This is similar to how 1/3 cannot be represented exactly in decimal. Since x is stored in a float which has a finite number of bits, these digits will get cut off at some point, for example:
x: 0.0011001100110011001100110011001
The problem arises because CPUs often use a higher precision internally, so when you've just calculated 1/y, the result will have more digits, and when you load x to compare them, x will get extended to match the internal precision of the CPU.
1/y: 0.0011001100110011001100110011001100110011001100110011
x: 0.0011001100110011001100110011001000000000000000000000
So when you do a direct bit-by-bit comparison, they are different.
In your second example, however, storing the result into a variable means it gets truncated before doing the comparison, so comparing them at this precision, they're equal:
x: 0.0011001100110011001100110011001
z: 0.0011001100110011001100110011001
Many compilers have switches you can enable to force intermediate values to be truncated at every step for consistency, however the usual advice is to avoid doing direct comparisons between floating-point values and instead check if they differ by less than some epsilon value, which is what Gangnus is suggesting.
You will have to precisely define what it means for two approximations to be multiplicative inverses. Otherwise, you won't know what it is you're supposed to be testing.
0.2 has no exact binary representation. If you store numbers that have no exact representation with limited precision, you won't get answers that are exactly correct.
The same things happens in decimal. For example, 1/3 has no exact decimal representation. You can store it as .333333. But then you have a problem. Are 3 and .333333 multiplicative inverses? If you multiply them, you get .999999. If you want the answer to be "yes" you'll have to create a test for multiplicative inverses that isn't as simple as multiplying and testing for equality to 1.
The same thing happens with binary.
The discussions in other replies are great and so I won't repeat any of them, but there's no code. Here's a little bit of code to actually check if a pair of floats gives exactly 1.0 when multiplied.
The code makes a few assumptions/assertions (which are normally met on the x86 platform):
- float's are 32-bit binary (AKA single precision) IEEE-754
- either int's or long's are 32-bit (I decided not to rely on the availability of uint32_t)
- memcpy() copies floats to ints/longs such that 8873283.0f becomes 0x4B076543 (i.e. certain "endianness" is expected)
One extra assumption is this:
- it receives the actual floats that * would multiply (i.e. multiplication of floats wouldn't use higher precision values that the math hardware/library can use internally)
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>
#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
#if UINT_MAX >= 0xFFFFFFFF
typedef unsigned int uint32;
#else
typedef unsigned long uint32;
#endif
typedef unsigned long long uint64;
C_ASSERT(CHAR_BIT == 8);
C_ASSERT(sizeof(uint32) == 4);
C_ASSERT(sizeof(float) == 4);
int ProductIsOne(float f1, float f2)
{
uint32 m1, m2;
int e1, e2, s1, s2;
int e;
uint64 m;
// Make sure floats are 32-bit IEE754 and
// reinterpreted as integers as we expect
{
static const float testf = 8873283.0f;
uint32 testi;
memcpy(&testi, &testf, sizeof(testf));
assert(testi == 0x4B076543);
}
memcpy(&m1, &f1, sizeof(f1));
s1 = m1 >= 0x80000000;
m1 &= 0x7FFFFFFF;
e1 = m1 >> 23;
m1 &= 0x7FFFFF;
if (e1 > 0) m1 |= 0x800000;
memcpy(&m2, &f2, sizeof(f2));
s2 = m2 >= 0x80000000;
m2 &= 0x7FFFFFFF;
e2 = m2 >> 23;
m2 &= 0x7FFFFF;
if (e2 > 0) m2 |= 0x800000;
if (e1 == 0xFF || e2 == 0xFF || s1 != s2) // Inf, NaN, different signs
return 0;
m = (uint64)m1 * m2;
if (!m || (m & (m - 1))) // not a power of 2
return 0;
e = e1 + !e1 - 0x7F - 23 + e2 + !e2 - 0x7F - 23;
while (m > 1) m >>= 1, e++;
return e == 0;
}
const float testData[][2] =
{
{ .1f, 10.0f },
{ 0.5f, 2.0f },
{ 0.25f, 2.0f },
{ 4.0f, 0.25f },
{ 0.33333333f, 3.0f },
{ 0.00000762939453125f, 131072.0f }, // 2^-17 * 2^17
{ 1.26765060022822940E30f, 7.88860905221011805E-31f }, // 2^100 * 2^-100
{ 5.87747175411143754E-39f, 1.70141183460469232E38f }, // 2^-127 (denormalized) * 2^127
};
int main(void)
{
int i;
for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
printf("%g * %g %c= 1\n",
testData[i][0], testData[i][1],
"!="[ProductIsOne(testData[i][0], testData[i][1])]);
return 0;
}
Output (see at ideone.com):
0.1 * 10 != 1
0.5 * 2 == 1
0.25 * 2 != 1
4 * 0.25 == 1
0.333333 * 3 != 1
7.62939e-06 * 131072 == 1
1.26765e+30 * 7.88861e-31 == 1
5.87747e-39 * 1.70141e+38 == 1
What is striking is that whatever the rounding rule is, you expect the outcome of the two versions to be the same (either twice wrong or twice right)!
Most probably, in the first case a promotion to higher accuracy in the FPU registers takes place when evaluating x==1/y, whereas z= 1/y really stores the single-precision result.
Other contributors have explaine why 5==1/0.2 can fail, I needn't repeat that.