Algebraic reductions of signed integer expressions in C/C++ - c++

I wanted to see if GCC would reduce a - (b - c) to (a + c) - b with signed and unsigned integers so I created two tests
//test1.c
unsigned fooau(unsigned a, unsigned b, unsigned c) { return a - (b - c); }
signed fooas(signed a, signed b, signed c) { return a - (b - c); }
signed fooms(signed a) { return a*a*a*a*a*a; }
unsigned foomu(unsigned a) { return a*a*a*a*a*a; }
//test2.c
unsigned fooau(unsigned a, unsigned b, unsigned c) { return (a + c) - b; }
signed fooas(signed a, signed b, signed c) { return (a + c) - b; }
signed fooms(signed a) { return (a*a*a)*(a*a*a); }
unsigned foomu(unsigned a) { return (a*a*a)*(a*a*a); }
I compiled first with gcc -O3 test1.c test2.c -S and looked at the assembly. For both tests fooau were identical however fooas was not.
As far as I understand unsigned arithmetic can be derived from the following formula
(a%n + b%n)%n = (a+b)%n
which can be used to show that unsigned arithmetic is associative. But since signed overflow is undefined behavior this equality does not necessarily hold for signed addition (i.e. signed addition is not associative) which explains why GCC did not reduce a - (b - c) to (a + c) - b for signed integers. But we can tell GCC to use this formula using -fwrapv. Using this option fooas for both tests is identical.
But what about multiplication? For both tests fooms and foomu were simplified to three multiplications (a*a*a*a*a*a to (a*a*a)*(a*a*a)). But multiplication can be written as repeated addition so using the formula above I think it can be shown that
((a%n)*(b%n))%n = (a*b)%n
which I think can also show that unsigned modular multiplication is associative as well. But since GCC used only three multiplications for foomu this shows that GCC assumes signed integer multiplication is associative.
This seems like a contradiction to me. For addition signed arithmetic was not associative but for multiplication it is.
Two questions:
Is it true that addition is not associative with signed integers but multiplication is in C/C++?
If signed overflow is used for optimization isn't the fact that GCC not reducing the algebraic expression a failure to optimize? Wouldn't it better better for optimization to use -fwrapv (I understand that a - (b - c) to (a + c) - b is not much of a reduction but I'm worried about more complicated cases)? Does this mean for optimization sometimes using -fwrapv is more efficient and sometimes it's not?

No, multiplication is not associative in signed integers. Consider (0 * x) * x vs. 0 * (x * x) - the latter has potentially undefined behavior while the former is always defined.
The potential for undefined behavior only ever introduces new optimization opportunities, the classic example being optimizing x + 1 > x to true for signed x, an optimization that is not available for unsigned integers.
I don't think you can assume that gcc failing to change a - (b - c) to (a + c) - b represents a missed optimization opportunity; the two calculations compile to the same two instructions on x86-64 (leal and subl), just in a different order.
Indeed, the implementation is entitled to assume that arithmetic is associative, and use that for optimizations, since anything can happen on UB including modulo arithmetic or infinite-range arithmetic. However, you as the programmer are not entitled to assume associativity unless you can guarantee that no intermediate result overflows.
As another example, try (a + a) - a - gcc will optimize this to a for signed a as well as for unsigned.

Algebraic reduction of signed integer expressions can be performed provided it has the same result for any defined set of inputs. So if the expression
a * a * a * a * a * a
is defined -- that is, a is small enough that no signed overflow occurs during the computation -- then any regrouping of the multiplications will produce the same value, because no product of less than six as can overflow.
The same would be true for a + a + a + a + a + a.
Things change if the variables multiplied (or added) are not all the same, or if the additions are intermingled with subtractions. In those cases, regrouping and rearranging the computation could lead to a signed overflow which did not occur in the canonical computation.
For example, take the expression
a - (b - c)
Algebraically, that's equivalent to
(a + c) - b
But the compiler can not do that rearrangement because it is possible that the intermediate value a+c will overflow with inputs which would not cause an overflow in the original. Suppose we had a=INT_MAX-1; b=1; c=2; then a+c results in an overflow, but a - (b - c) is computed as a - (-1), which is INT_MAX, without overflow.
If the compiler can assume that signed overflow does not trap but instead is computed modulo INT_MAX+1, then these rearrangements are possible. The -fwrapv options allows gcc to make that assumption.

Related

Can integer division in C/C++ run into loss of precision issues?

Suppose we have three integer (int, long, long long, unsigned int, etc) variables a, b, c. Normally, performing
c = a / b;
would result in truncate of fractions. However, is it possible for c to end up with an incorrect value?
I am not talking about a / b may be out of range for c's type. Rather, I am talking about how integer division is implemented in C. Does performing a / b first generate a float type intermediate result, and then the intermediate value is truncated?
If so, I wonder if loss of precision of the intermediate value can lead to an incorrect value of c. For example, suppose the precise value for a / b is 2, but somehow the intermediate result is 1.9999..., thus c will end up with an incorrect value of 1. Can such cases happen, or does integer division always result in a correct value if the expected value is in the range of c's type?
Does performing a / b first generate a float type intermediate result
As far as the language is concerned, there are no intermediate results.
does integer division always result in a correct value if the expected value is in the range of c's type?
Yes.
Section 6.5.5 of the C11 standards states
When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded. If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a;
Which means there's no way, mathematically, that you'll get wrong results.
Suppose we have three integer (int, long, long long, unsigned int, etc) variables a, b, c. Normally, performing
c = a / b;
would result in truncate of fractions. However, is it possible for c to end up with an incorrect value? I am not talking about a / b may be out of range for c's type.
It should not be possible that for example the last digit of division be wrong, if all rules were followed otherwise. C11 6.5.5p6:
When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded.
i.e. the result is not "close" to but exactly the same as a / b would be algebraically, just anything following the point discarded.
That does not mean there won't be any gotchas: it is possible that the division of a / b might be mathematically not out of range for c's type yet out of range for the type used in the division itself which can cause wrong values be set in c.
Consider this example:
#include <stdio.h>
#include <inttypes.h>
int main(void) {
int32_t a = INT32_MIN;
int32_t b = -1;
int64_t c = a / b;
printf("%" PRId64, c);
}
The result of division of INT32_MIN / -1 is representable in c, it is INT32_MAX + 1, which is positive. However on 32-bit platforms the arithmetic happens in 32 bits, and this division produces an integer overflow which causes the behaviour to be undefined. What happens on my computer is that if I compile without optimizations it aborts the program. If I compile with optimizations enabled (-O3), the compiler will resolve this calculation at compilation time, and handles the overflow in a peculiar way and produces the result -2147483648 which is negative.
Likewise, if you do this:
uint16_t a = 16;
int16_t b = -1;
int32_t result = a / b;
printf("%" PRId32 "\n", result);
the result on a 32-bit int machine is -16. If you change the type of a to uint32_t the math happens in unsigned:
uint32_t a = 16;
int16_t b = -1;
int32_t result = a / b;
printf("%" PRId32 "\n", result);
The result is of course 0. And you would get 0 from the former calculation too on a 16-bit machine.

Safest and most efficient way to compute an integer operation that may overflow

Suppose we have 2 constants A & B and a variable i, all 64 bits integers. And we want to compute a simple common arithmetic operation such as:
i * A / B (1)
To simplify the problem, let's assume that variable i is always in the range [INT64_MIN*B/A, INT64_MAX*B/A], so that the final result of the arithmetic operation (1) does not overflow (i.e.: fits in the range [INT64_MIN, INT64_MAX]).
In addition, i is assumed to be more likely in the friendly range Range1 = [INT64_MIN/A, INT64_MAX/A] (i.e.: close to 0), however i may be (less likely) outside this range. In the first case, a trivial integer computation of i * A would not overflow (that's why we called the range friendly); and in the latter case, a trivial integer computation of i * A would overflow, leading to an erroneous result in computation of (1).
What would be the "safest" and "most efficient" way to compute operation (1) (where "safest" means: preserving exactness or at least a decent precision, and where "most efficient" means: lowest average computation time), provided i is more likely in the friendly range Range1.
At now, the solution currently implemented in the code is the following one :
(int64_t)((double)A / B * i)
which solution is quite safe (no overflow) though inaccurate (precision loss due to double significand 53 bits limitation) and quite fast because double division (double)A / B is precomputed at compile time, letting only a double multiplication to be computed at runtime.
If you cannot get better bounds on the ranges involved then you're best off following iammilind's advice to use __int128.
The reason is that otherwise you would have to implement the full logic of word to double-word multiplication and double-word by word division. The Intel and AMD processor manuals contain helpful information and ready-made code, but it gets quite involved, and using C/C++ instead of in assembler makes things doubly complicated.
All good compilers expose useful primitives as intrinsics. Microsoft's list doesn't seem to include a muldiv-like primitive but the __mul128 intrinsic gives you the two halves of the 128-bit product as two 64-bit integers. Based on that you can perform long division of two digits by one digit, where one 'digit' would be a 64-bit integer (usually called 'limb' because bigger than a digit but still only part of the whole). Still quite involved but lots better than using pure C/C++. However, portability-wise it is no better than using __int128 directly. At least that way the compiler implementers have already done all the hard work for you.
If your application domain can give you useful bounds, like that (u % d) * v will not overflow then you can use the identity
(u * v) / d = (u / d) * v + ((u % d) * v) / d
where / signifies integer division, as long as u is non-negative and d is positive (otherwise you might run afoul of the leeway allowed for the semantics of operator %).
In any case you may have to separate out the signs of the operands and use unsigned operations in order to find more useful mechanisms that you can exploit - or to circumvent sabotage by the compiler, like the saturating multiplication that you mentioned. Overflow of signed integer operations invokes undefined behaviour, compilers are free to do whatever they please. By contrast, overflow for unsigned types is well-defined.
Also, with unsigned types you can fall back on rules like that with s = a (+) b (where (+) is possibly-overflowing unsigned addition) you will have either s == a + b or s < a && s < b, which lets you detect overflow after the fact with cheap operations.
However, it is unlikely that you will get much farther on this road because the effort required quickly approaches - or even exceeds - the effort of implementing the double-limb operations I alluded to earlier. Only a thorough analysis of the application domain could give the information required for planning/deploying such shortcuts. In the general case and with the bounds you have given you're pretty much out of luck.
In order to provide a quantified answer to the question, I made a benchmark of different solutions as part of the ones proposed here in this post (thanks to comments and answers).
The benchmark measures computation time of different implementations, when i is inside the friendly range Range1 = [INT64_MIN/A, INT64_MAX/A], and when i is outside the friendly range (yet in the safe range Range2 = [INT64_MIN*B/A, INT64_MAX*B/A]).
Each implementation performs a "safe" (i.e. with no overflow) computation of the operation: i * A / B (except the 1st implementation, given as reference computation time). However, some implementations may return infrequent inaccurate computation result (which behavior is notified).
Some solutions proposed have not been tested or are not listed hereafter; these are: solution using __int128 (unsupported by ms vc compiler), yet boost int128_t has been used instead; solutions using extended 80 bits long double (unsupported by ms vc compiler); solution using InfInt (working and tested though too slow to be a decent competitor).
Time measurements are specified in ps/op (picoseconds per operation). Benchmark platform is an Intel Q6600#3GHz under Windows 7 x64, executable compiled with MS vc14, x64/Release target. The variables, constants and function referenced hereafter are defined as:
int64_t i;
const int64_t A = 1234567891;
const int64_t B = 4321987;
inline bool in_safe_range(int64_t i) { return (INT64_MIN/A <= i) && (i <= INT64_MAX/A); }
(i * A / B) [reference]
i in Range1: 1469 ps/op, i outside Range1: irrelevant (overflows)
((int64_t)((double)i * A / B))
i in Range1: 10613 ps/op, i outside Range1: 10606 ps/op
Note: infrequent inaccurate result (max error = 1 bit) in the whole range Range2
((int64_t)((double)A / B * i))
i in Range1: 1073 ps/op, i outside Range1: 1071 ps/op
Note: infrequent inaccurate result (max error = 1 bit) in the whole range Range2
Note: compiler likely precomputes (double)A / B resulting in the observed performance boost vs previous solution.
(!in_safe_range(i) ? (int64_t)((double)A / B * i) : (i * A / B))
i in Range1: 2009 ps/op, i outside Range1: 1606 ps/op
Note: rare inaccurate result (max error = 1 bit) outside Range1
((int64_t)((int128_t)i * A / B)) [boost int128_t]
i in Range1: 89924 ps/op, i outside Range1: 89289 ps/op
Note: boost int128_t performs dramatically bad on the bench platform (have no idea why)
((i / B) * A + ((i % B) * A) / B)
i in Range1: 5876 ps/op, i outside Range1: 5879 ps/op
(!in_safe_range(i) ? ((i / B) * A + ((i % B) * A) / B) : (i * A / B))
i in Range1: 1999 ps/op, i outside Range1: 6135 ps/op
Conclusion
a) If slight computation errors are acceptable in the whole range Range2, then solution (3) is the fastest one, even faster than the direct integer computation given as reference.
b) If computation errors are unacceptable in the friendly range Range1, yet acceptable outside this range, then solution (4) is the fastest one.
c) If computation errors are unacceptable in the whole range Range2, then solution (7) performs as well as solution (4) in the friendly range Range1, and remains decently fast outside this range.
I think you can detect the overflow before it happens. In your case of i * A / B, you are only worried about the i * A part because the division cannot overflow.
You can detect the overflow by performing test of bool overflow = i > INT64_MAX / A. You will have to do modify this depending on the sign of operands and result.
Some implementations permit __int128_t. Check if your implementation allows it, so that you can you may use it as placeholder instead of double. Refer below post:
Why isn't there int128_t?
If not very concerned about "fast"-ness, then for good portability I would suggest to use header only C++ library "InfInt".
It is pretty straight forward to use the library. Just create an
instance of InfInt class and start using it:
InfInt myint1 = "15432154865413186646848435184100510168404641560358";
InfInt myint2 = 156341300544608LL;
myint1 *= --myint2 - 3;
std::cout << myint1 << std::endl;
Not sure about value bounds, will (i / B) * A + (i % B) * A / B help?

Fast branchless max for unsigned integers

I found a trick from the AGGREGATE Magic for fast computing max values. The only problem that this is for integers, and however I have tried some things, have no idea how to make a version for unsigned integers.
inline int32_t max(int32_t a, int32_t b)
{
return a - ((a-b) & (a-b)>>31);
}
Any advice?
EDIT
Do not use this, because as others stated it produces undefined behavior. For any modern architecture the compiler will able to emit a branchless conditional move instruction from return (a > b) ? a : b, that will be faster than the function in question.
What does this code do? It takes the value of a and the difference a - b. Of course, a - (a - b) is b. And (a - b) >> 31 simply creates a mask of ones iff a - b is negative.
This code is incorrect, iff you get an overflow on the subtraction. That, however is the same story as for unsigned integers. So iff you are content with the fact, that your code is not correct for the entire value range, you can simply ignore unsignedness and use this:
inline uint32_t umax(uint32_t a, uint32_t b) {
return (uint32_t)max((int32_t)a, (int32_t)b);
}

Subtracting unsigned long longs with signed long long result?

Suppose I have these two types:
typedef unsigned long long uint64;
typedef signed long long sint64;
And I have these variables:
uint64 a = ...;
uint64 b = ...;
sint64 c;
I want to subtract b from a and assign the result to c, clearly if the absolute value of the difference is greater than 2^63 than it will wrap (or be undefined) which is ok. But for cases where the absolute difference is less than 2^63 I want the result to be correct.
Of the following three ways:
c = a - b; // sign conversion warning ignored
c = sint64(a - b);
c = sint64(a) - sint64(b);
Which of the them are guaranteed to work by the standard? (and why/how?)
None of the three work. The first fails if the difference is negative (no matter the absolute value), the second is the same as the first, and the third fails if either operand is too large.
It's impossible to implement without a branch.
c = b < a? a - b : - static_cast< sint64 >( b - a );
Fundamentally, unsigned types use modulo arithmetic without any kind of sign bit. They don't know they wrapped around, and the language spec doesn't identify wraparound with negative numbers. Also, assigning a value outside the range of a signed integral variable results in an implementation-defined, potentially nonsense result (integral overflow).
Consider a machine with no hardware to convert between native negative integers and two's complement. It can perform two's complement subtraction using bitwise negation and native two's complement addition, though. (Bizarre, maybe, but that is what C and C++ currently require.) The language leaves it up to the programmer, then, to convert the negative values. The only way to do that is to negate a positive value, which requires that the computed difference be positive. So…
The best solution is to avoid any attempt to represent a negative number as a large positive number in the first place.
EDIT: I forgot the cast before, which would have produced a large unsigned value, equivalently to the other solutions!
Potatoswatter's answer is probably the most pragmatic solution, but "impossible to implement without a branch" is like a red rag to a bull for me. If your hypothetical system implements undefined overflow/cast operations like that, my hypothetical system implements branches by killing puppies.
So I'm not completely familiar with what the standard(s) would say, but how about this:
sint64 c,d,r;
c = a >> 1;
d = b >> 1;
r = (c-d) * 2;
c = a & 1;
d = b & 1;
r += c - d;
I've written it in a fairly verbose fasion so the individual operations are clear, but have left some implicit casts. Is anything there undefined?
Steve Jessop rightly points out that this does fail in the case where the difference is exactly 2^63-1, as the multiply overflows before the 1 is subtracted.
So here's an even uglier version which should cover all underflow/overflow conditions:
sint64 c,d,r,ov;
c = a >> 1;
d = b >> 1;
ov = a >> 63;
r = (c-d-ov) * 2;
c = a & 1;
d = b & 1;
r += ov + ov + c - d;
if the absolute value of the difference is greater than 2^63 than it
will wrap (or be undefined) which is ok. But for cases where the
absolute difference is less than 2^63 I want the result to be correct.
Then all three of the notations you suggest work, assuming a conventional architecture. The notable difference
is that the third one sint64(a) - sint64(b) invokes undefined behavior
when the difference is not representable, whereas the first two are
guaranteed to wrap around (unsigned arithmetic overflow is guaranteed to wrap around and conversion from unsigned to signed is implementation-defined, whereas signed arithmetic overflow is undefined).

Is this multiply-divide function correct?

I'm trying to avoid long longs and integer overflow in some calculations, so I came up with the function below to calculate (a * b) / c (order is important due to truncating integer division).
unsigned muldiv(unsigned a, unsigned b, unsigned c)
{
return a * (b / c) + (a * (b % c)) / c;
}
Are there any edge cases for which this won't work as expected?
EDITED: This is correct for a superset of values for which the original obvious logic was correct. It still buys you nothing if c > b and possibly in other conditions. Perhaps you know something about your values of c but this may not help as much as you expect. Some combinations of a, b, c will still overflow.
EDIT: Assuming you're avoiding long long for strict C++98 portability reasons, you can get about 52 bits of precision by promoting your unsigned to doubles that happen to have integral values to do the math. Using double math may in fact be faster than doing three integral divisions.
This fails on quite a few cases. The most obvious is when a is large, so a * (b % c) overflows. You might try swapping a and b in that case, but that still fails if a, b, and c are all large. Consider a = b = 2^25-1 and c = 2^24 with a 32 bit unsigned. The correct result is 2^26-4, but both a * (b % c) and b * (a % c) will overflow. Even (a % c) * (b % c) would overflow.
By far the easisest way to solve this in general is to have a widening multiply, so you can get the intermediate product in higher precision. If you don't have that, you need to synthesize it out of smaller multiplies and divides, which is pretty much the same thing as implementing your own biginteger library.
If you can guarentee that c is small enough that (c-1)*(c-1) will not overflow an unsigned, you could use:
unsigned muldiv(unsigned a, unsigned b, unsigned c) {
return (a/c)*(b/c)*c + (a%c)*(b/c) + (a/c)*(b%c) + (a%c)*(b%c)/c;
}
This will actually give you the "correct" answer for ALL a and b -- (a * b)/c % (UINT_MAX+1)
To avoid overflow you have to pre-divide and then post-multiply by some factor.
The best factor to use is c, as long as one (or both) of a and b is greater than c. This is what Chris Dodd's function does. It has a greatest intermediate of ((a % c) * (b % c)), which, as Chris identifies, is less than or equal to ((c-1)*(c-1)).
If you could have a situation where both a and b are less than c, but (a * b) could still overflow, (which might be the case when c approaches the limit of the word size) then the best factor to use is a large power of two, to turn the multiply and divides into shifts. Try shifting by half the word size.
Note that using pre-divide and then post-multiplying is the equivalent of using longer words when you don't have longer words available. Assuming you don't discard the low order bits but just add them as another term, then you are just using several words instead of one larger one.
I'll let you fill the code in.