Fast branchless max for unsigned integers

Fast branchless max for unsigned integers - c++

I found a trick from the AGGREGATE Magic for fast computing max values. The only problem that this is for integers, and however I have tried some things, have no idea how to make a version for unsigned integers.
inline int32_t max(int32_t a, int32_t b)
{
return a - ((a-b) & (a-b)>>31);
}
Any advice?
EDIT
Do not use this, because as others stated it produces undefined behavior. For any modern architecture the compiler will able to emit a branchless conditional move instruction from return (a > b) ? a : b, that will be faster than the function in question.

What does this code do? It takes the value of a and the difference a - b. Of course, a - (a - b) is b. And (a - b) >> 31 simply creates a mask of ones iff a - b is negative.
This code is incorrect, iff you get an overflow on the subtraction. That, however is the same story as for unsigned integers. So iff you are content with the fact, that your code is not correct for the entire value range, you can simply ignore unsignedness and use this:
inline uint32_t umax(uint32_t a, uint32_t b) {
return (uint32_t)max((int32_t)a, (int32_t)b);
}

Related

In practice, is there any "lazy" evalution in multiplication with zero in run time

Will the machine or compiler make use of the property that zero multiplied with an integer must always be zero? Consider
int f(int c, int d, int e, int f)
{
// some complex calculations here...
}
int foo(int a, int b, int c, int d, int e, int f)
{
return (a - b) * f(c, d, e, f);
}
In run time, if we pass some arguments where a == b, then mathematically there is no need to calculate the result of f() (assuming no undefined or strange behavior). The result must always be 0.
I am not quite sure if there is any possibility that the machine / compiler might use some optimizing technique to skip the calculation of f(). Or, asked the other way round, is it guaranteed that f() will be always called no matter what values of a and b are?
I am tagging this question with both C and C++ to avoid the slight chance that rules differ in C and C++ in this case. If so, please elaborate respectively.
Update Thanks for the discussion. From what I gather up to now, the possible existence of a function's side-effect would certainly be a factor to consider. However, I would like to clarify that the helper function f() is not a must in my intention. Code like
int foo(int a, int b, int c, int d, int e, int f)
{
return (a - b) * /* some complex expression with c, d, e, f */;
}
would also qualify. Apologies for not making it clear at the beginning.

Compiler
The compiler usually (meaning most compilers) optimizes arithmetic calculations if they consist of constants. For example i = 1 + 3; will be optimized to i = 4; but the more complex the calculation, fewer the compilers that will be able to optimize it. Compilers usually work recursively with tree structures and search them to find possible optimizations. So it makes no difference if you add 2 or 20 constants, but it makes a difference if the additions are inside a loop. In this case calling foo(a, a, x, y, z); is a bit less likely to be optimized that calling foo(1, 1, x, y, z);.
If the compiler first inlines small functions, and searches for arithmetic optimizations after, then it is quite likely that if the parameters are determined at compile time, the compiler will be able to optimize out all the extra instructions. After all this is what it boils down to, can the compiler be sure that the result of foo is 0 without running the program?
Two things to note:
Compilers can selectively optimize different things (for gcc using -O0, -O1, -O2, -O3 and other more specific commands)
Compilers themselves are written programmes and not a magic black box. For the compiler to optimize foo, a developer must write somewhere in there: check if a subtraction is about the same variable and if so substitute the result with 0. And somewhere near that: check if you multiply with zero and substitute that with 0. All that at compile time.
For the compiler to optimize at run time, then instead of multiplying a and b, the produced assembly will contain checks for each variable to check for any one zeros. I don't think any compiler does that.
Processor
The processor is for the most part a dumb machine that does exactly what its told. The multiplication is done by hardware that carries some bitwise logic. For the processor to optimize this multiplication, the circuits that do that calculation must also have a part that says the following: if one of the multiplicants is 0 then the result is 0 and we need not do the multiplication. If someone has programmed the processor to do that, then it is optimized. Depends on implementation but I think it's quite unlikely.

This would require the compiler to generate branching, which it generally doesn't like to do. If you want the branching, make it explicit:
int f(int c, int d, int e, int f)
{
// some complex calculations here...
}
int foo(int a, int b, int c, int d, int e, int f)
{
if (a == b) return 0;
return (a - b) * f(c, d, e, f);
}
As noted in the comments, f may also have side effects, in which case it is guaranteed not to be "optimized" away.

Unless the compiler can determine, at compile time, that (a - b) will always evaluate to 0, it won't try to add code to perform the evaluation at runtime.
The main reason is what it has been discussed in other answers: the function that provides one of the operands can have side effects, and you don't normally want to avoid them to happen (if you would want it, you would have to add the evaluation by yourself).
The other reason is that the hardware already does that, and multiplications in which one of the operands is 0 normally takes much less cycles than a regular one.
Note that this is different to what happens with short circuit evaluation in conditional expressions: if ( (a - b) == 0 || f(c, d, e, f) == 0 ) . In this case, the first condition may avoid the second one from executing f() at runtime.

How to get negative remainder with remainder operator on size_t?

Consider the following code sample:
#include <iostream>
#include <string>
int main()
{
std::string str("someString"); // length 10
int num = -11;
std::cout << num % str.length() << std::endl;
}
Running this code on http://cpp.sh, I get 5 as a result, while I was expecting it to be -1.
I know that this happens because the type of str.length() is size_t which is an implementation dependent unsigned, and because of the implicit type conversions that happen with binary operators that cause num to be converted from a signed int to an unsigned size_t (more here);
this causes the negative value to become a positive one and messes up the result of the operation.
One could think of addressing the problem with an explicit cast to int:
num % (int)str.length()
This might work but it's not guaranteed, for instance in the case of a string with length larger than the maximum value of int. One could reduce the risk using a larger type, like long long, but what if size_t is unsigned long long? Same problem.
How would you address this problem in a portable and robust way?

Since C++11, you can just cast the result of length to std::string::difference_type.
To address "But what if the size is too big?":
That won't happen on 64 bit platforms and even if you are on a smaller one: When was the last time you actually had a string that took up more than half of total RAM? Unless you are doing really specific stuff (which you would know), using the difference_type is just fine; quit fighting ghosts.
Alternatively, just use int64_t, that's certainly big enough. (Though maybe looping over one on some 32 bit processors is slower than int32_t, I don't know. Won't matter for that single modulus operation though.)
(Fun fact: Even some prominent committee members consider littering the standard library with unsigned types a mistake, for reference see
this panel at 9:50, 42:40, 1:02:50 )
Pre C++11, the sign of % with negative values was implementation defined, for well defined behavior, use std::div plus one of the casts described above.

We know that
-a % b == -(a % b)
So you could write something like this:
template<typename T, typename T2>
constexpr T safeModulo(T a, T2 b)
{
return (a >= 0 ? 1 : -1) * static_cast<T>(std::llabs(a) % b);
}
This won't overflow in 99.98% of the cases, because consider this
safeModulo(num, str.length());
If std::size_t is implemented as an unsigned long long, then T2 -> unsigned long long and T -> int.
As pointed out in the comments, using std::llabs instead of std::abs is important, because if a is the smallest possible value of int, removing the sign will overflow. Promoting a to a long long just before won't result in this problem, as long long has a larger range of values.
Now static_cast<int>(std::llabs(a) % b) will always result in a value that is smaller than a, so casting it to int will never overflow/underflow. Even if a gets promoted to an unsigned long long, it doesn't matter because a is already "unsigned" from std::llabs(a), and so the value is unchanged (i.e. didn't overflow/underflow).
Because of the property stated above, if a is negative, multiply the result with -1 and you get the correct result.
The only case where it results in undefined behavior is when a is std::numeric_limits<long long>::min(), as removing the sign overflows a, resulting in undefined behavior. There is probably another way to implement the function, I'll think about it.

Algebraic reductions of signed integer expressions in C/C++

I wanted to see if GCC would reduce a - (b - c) to (a + c) - b with signed and unsigned integers so I created two tests
//test1.c
unsigned fooau(unsigned a, unsigned b, unsigned c) { return a - (b - c); }
signed fooas(signed a, signed b, signed c) { return a - (b - c); }
signed fooms(signed a) { return a*a*a*a*a*a; }
unsigned foomu(unsigned a) { return a*a*a*a*a*a; }
//test2.c
unsigned fooau(unsigned a, unsigned b, unsigned c) { return (a + c) - b; }
signed fooas(signed a, signed b, signed c) { return (a + c) - b; }
signed fooms(signed a) { return (a*a*a)*(a*a*a); }
unsigned foomu(unsigned a) { return (a*a*a)*(a*a*a); }
I compiled first with gcc -O3 test1.c test2.c -S and looked at the assembly. For both tests fooau were identical however fooas was not.
As far as I understand unsigned arithmetic can be derived from the following formula
(a%n + b%n)%n = (a+b)%n
which can be used to show that unsigned arithmetic is associative. But since signed overflow is undefined behavior this equality does not necessarily hold for signed addition (i.e. signed addition is not associative) which explains why GCC did not reduce a - (b - c) to (a + c) - b for signed integers. But we can tell GCC to use this formula using -fwrapv. Using this option fooas for both tests is identical.
But what about multiplication? For both tests fooms and foomu were simplified to three multiplications (a*a*a*a*a*a to (a*a*a)*(a*a*a)). But multiplication can be written as repeated addition so using the formula above I think it can be shown that
((a%n)*(b%n))%n = (a*b)%n
which I think can also show that unsigned modular multiplication is associative as well. But since GCC used only three multiplications for foomu this shows that GCC assumes signed integer multiplication is associative.
This seems like a contradiction to me. For addition signed arithmetic was not associative but for multiplication it is.
Two questions:
Is it true that addition is not associative with signed integers but multiplication is in C/C++?
If signed overflow is used for optimization isn't the fact that GCC not reducing the algebraic expression a failure to optimize? Wouldn't it better better for optimization to use -fwrapv (I understand that a - (b - c) to (a + c) - b is not much of a reduction but I'm worried about more complicated cases)? Does this mean for optimization sometimes using -fwrapv is more efficient and sometimes it's not?

No, multiplication is not associative in signed integers. Consider (0 * x) * x vs. 0 * (x * x) - the latter has potentially undefined behavior while the former is always defined.
The potential for undefined behavior only ever introduces new optimization opportunities, the classic example being optimizing x + 1 > x to true for signed x, an optimization that is not available for unsigned integers.
I don't think you can assume that gcc failing to change a - (b - c) to (a + c) - b represents a missed optimization opportunity; the two calculations compile to the same two instructions on x86-64 (leal and subl), just in a different order.
Indeed, the implementation is entitled to assume that arithmetic is associative, and use that for optimizations, since anything can happen on UB including modulo arithmetic or infinite-range arithmetic. However, you as the programmer are not entitled to assume associativity unless you can guarantee that no intermediate result overflows.
As another example, try (a + a) - a - gcc will optimize this to a for signed a as well as for unsigned.

Algebraic reduction of signed integer expressions can be performed provided it has the same result for any defined set of inputs. So if the expression
a * a * a * a * a * a
is defined -- that is, a is small enough that no signed overflow occurs during the computation -- then any regrouping of the multiplications will produce the same value, because no product of less than six as can overflow.
The same would be true for a + a + a + a + a + a.
Things change if the variables multiplied (or added) are not all the same, or if the additions are intermingled with subtractions. In those cases, regrouping and rearranging the computation could lead to a signed overflow which did not occur in the canonical computation.
For example, take the expression
a - (b - c)
Algebraically, that's equivalent to
(a + c) - b
But the compiler can not do that rearrangement because it is possible that the intermediate value a+c will overflow with inputs which would not cause an overflow in the original. Suppose we had a=INT_MAX-1; b=1; c=2; then a+c results in an overflow, but a - (b - c) is computed as a - (-1), which is INT_MAX, without overflow.
If the compiler can assume that signed overflow does not trap but instead is computed modulo INT_MAX+1, then these rearrangements are possible. The -fwrapv options allows gcc to make that assumption.

How to catch undefined behaviour without executing it?

In my software I am using the input values from the user at run time and performing some mathematical operations. Consider for simplicity below example:
int multiply(const int a, const int b)
{
if(a >= INT_MAX || B >= INT_MAX)
return 0;
else
return a*b;
}
I can check if the input values are greater than the limits, but how do I check if the result will be out of limits? It is quite possible that a = INT_MAX - 1 and b = 2. Since the inputs are perfectly valid, it will execute the undefined code which makes my program meaningless. This means any code executed after this will be random and eventually may result in crash. So how do I protect my program in such cases?

This really comes down to what you actually want to do in this case.
For a machine where long or long long (or int64_t) is a 64-bit value, and int is a 32-bit value, you could do (I'm assuming long is 64 bit here):
long x = static_cast<long>(a) * b;
if (x > MAX_INT || x < MIN_INT)
return 0;
else
return static_cast<int>(x);
By casting one value to long, the other will have to be converted as well. You can cast both if that makes you happier. The overhead here, above a normal 32-bit multiply is a couple of clock-cycles on modern CPU's, and it's unlikely that you can find a safer solution, that is also faster. [You can, in some compilers, add attributes to the if saying that it's unlikely to encourage branch prediction "to get it right" for the common case of returning x]
Obviously, this won't work for values where the type is as big as the biggest integer you can deal with (although you could possibly use floating point, but it may still be a bit dodgy, since the precision of float is not sufficient - could be done using some "safety margin" tho' [e.g. compare to less than LONG_INT_MAX / 2], if you don't need the entire range of integers.). Penalty here is a bit worse tho', especially transitions between float and integer isn't "pleasant".
Another alternative is to actually test the relevant code, with "known invalid values", and as long as the rest of the code is "ok" with it. Make sure you test this with the relevant compiler settings, as changing the compiler options will change the behaviour. Note that your code then has to deal with "what do we do when 65536 * 100000 is a negative number", and your code didn't expect so. Perhaps add something like:
int x = a * b;
if (x < 0) return 0;
[But this only works if you don't expect negative results, of course]
You could also inspect the assembly code generated and understand the architecture of the actual processor [the key here is to understand if "overflow will trap" - which it won't by default in x86, ARM, 68K, 29K. I think MIPS has an option of "trap on overflow"], and determine whether it's likely to cause a problem [1], and add something like
#if (defined(__X86__) || defined(__ARM__))
#error This code needs inspecting for correct behaviour
#endif
return a * b;
One problem with this approach, however, is that even the slightest changes in code, or compiler version may alter the outcome, so it's important to couple this with the testing approach above (and make sure you test the ACTUAL production code, not some hacked up mini-example).
[1] The "undefined behaviour" is undefined to allow C to "work" on processors that have trapping overflows of integer math, as well as the fact that that a * b when it overflows in a signed value is of course hard to determine unless you have a defined math system (two's complement, one's complement, distinct sign bit) - so to avoid "defining" the exact behaviour in these cases, the C standard says "It's undefined". It doesn't mean that it will definitely go bad.

Specifically for the multiplication of a by b the mathematically correct way to detect if it will overflow is to calculate log₂ of both values. If their sum is higher than the log₂ of the highest representable value of the result, then there is overflow.
log₂(a) + log₂(b) < log₂(UINT_MAX)
The difficulty is to calculate quickly the log₂ of an integer. For that, there are several bit twiddling hacks that can be used, like counting bit, counting leading zeros (some processors even have instructions for that). This site has several implementations
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
The simplest implementation could be:
unsigned int log2(unsigned int v)
{
unsigned int r = 0;
while (v >>= 1)
r++;
return r;
}
In your program you only need to check then
if(log2(a) + log2(b) < MYLOG2UINTMAX)
return a*b;
else
printf("Overflow");
The signed case is similar but has to take care of the negative case specifically.
EDIT: My solution is not complete and has an error which makes the test more severe than necessary. The equation works in reality if the log₂ function returns a floating point value. In the implementation I limited thevalue to unsigned integers. This means that completely valid multiplication get refused. Why? Because log2(UINT_MAX) is truncated
log₂(UINT_MAX)=log₂(4294967295)≈31.9999999997 truncated to 31.
We have there for to change the implementation to replace the constant to compare to
#define MYLOG2UINTMAX (CHAR_BIT*sizeof (unsigned int))

You may try this:
if ( b > ULONG_MAX / a ) // Need to check a != 0 before this division
return 0; //a*b invoke UB
else
return a*b;

Is this multiply-divide function correct?

I'm trying to avoid long longs and integer overflow in some calculations, so I came up with the function below to calculate (a * b) / c (order is important due to truncating integer division).
unsigned muldiv(unsigned a, unsigned b, unsigned c)
{
return a * (b / c) + (a * (b % c)) / c;
}
Are there any edge cases for which this won't work as expected?

EDITED: This is correct for a superset of values for which the original obvious logic was correct. It still buys you nothing if c > b and possibly in other conditions. Perhaps you know something about your values of c but this may not help as much as you expect. Some combinations of a, b, c will still overflow.
EDIT: Assuming you're avoiding long long for strict C++98 portability reasons, you can get about 52 bits of precision by promoting your unsigned to doubles that happen to have integral values to do the math. Using double math may in fact be faster than doing three integral divisions.

This fails on quite a few cases. The most obvious is when a is large, so a * (b % c) overflows. You might try swapping a and b in that case, but that still fails if a, b, and c are all large. Consider a = b = 2^25-1 and c = 2^24 with a 32 bit unsigned. The correct result is 2^26-4, but both a * (b % c) and b * (a % c) will overflow. Even (a % c) * (b % c) would overflow.
By far the easisest way to solve this in general is to have a widening multiply, so you can get the intermediate product in higher precision. If you don't have that, you need to synthesize it out of smaller multiplies and divides, which is pretty much the same thing as implementing your own biginteger library.
If you can guarentee that c is small enough that (c-1)*(c-1) will not overflow an unsigned, you could use:
unsigned muldiv(unsigned a, unsigned b, unsigned c) {
return (a/c)*(b/c)*c + (a%c)*(b/c) + (a/c)*(b%c) + (a%c)*(b%c)/c;
}
This will actually give you the "correct" answer for ALL a and b -- (a * b)/c % (UINT_MAX+1)

To avoid overflow you have to pre-divide and then post-multiply by some factor.
The best factor to use is c, as long as one (or both) of a and b is greater than c. This is what Chris Dodd's function does. It has a greatest intermediate of ((a % c) * (b % c)), which, as Chris identifies, is less than or equal to ((c-1)*(c-1)).
If you could have a situation where both a and b are less than c, but (a * b) could still overflow, (which might be the case when c approaches the limit of the word size) then the best factor to use is a large power of two, to turn the multiply and divides into shifts. Try shifting by half the word size.
Note that using pre-divide and then post-multiplying is the equivalent of using longer words when you don't have longer words available. Assuming you don't discard the low order bits but just add them as another term, then you are just using several words instead of one larger one.
I'll let you fill the code in.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js