How does this float square root approximation work?

How does this float square root approximation work? - c++

I found a rather strange but working square root approximation for floats; I really don't get it. Can someone explain me why this code works?
float sqrt(float f)
{
const int result = 0x1fbb4000 + (*(int*)&f >> 1);
return *(float*)&result;
}
I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3%. I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it works.
(nota: I've tagged it both c and c++ since it's both valid-ish (see comments) C and C++ code)

(*(int*)&f >> 1) right-shifts the bitwise representation of f. This almost divides the exponent by two, which is approximately equivalent to taking the square root.1
Why almost? In IEEE-754, the actual exponent is e - 127.2 To divide this by two, we'd need e/2 - 64, but the above approximation only gives us e/2 - 127. So we need to add on 63 to the resulting exponent. This is contributed by bits 30-23 of that magic constant (0x1fbb4000).
I'd imagine the remaining bits of the magic constant have been chosen to minimise the maximum error across the mantissa range, or something like that. However, it's unclear whether it was determined analytically, iteratively, or heuristically.
It's worth pointing out that this approach is somewhat non-portable. It makes (at least) the following assumptions:
The platform uses single-precision IEEE-754 for float.
The endianness of float representation.
That you will be unaffected by undefined behaviour due to the fact this approach violates C/C++'s strict-aliasing rules.
Thus it should be avoided unless you're certain that it gives predictable behaviour on your platform (and indeed, that it provides a useful speedup vs. sqrtf!).
1. sqrt(a^b) = (a^b)^0.5 = a^(b/2)
2. See e.g. https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding

See Oliver Charlesworth’s explanation of why this almost works. I’m addressing an issue raised in the comments.
Since several people have pointed out the non-portability of this, here are some ways you can make it more portable, or at least make the compiler tell you if it won’t work.
First, C++ allows you to check std::numeric_limits<float>::is_iec559 at compile time, such as in a static_assert. You can also check that sizeof(int) == sizeof(float), which will not be true if int is 64-bits, but what you really want to do is use uint32_t, which if it exists will always be exactly 32 bits wide, will have well-defined behavior with shifts and overflow, and will cause a compilation error if your weird architecture has no such integral type. Either way, you should also static_assert() that the types have the same size. Static assertions have no run-time cost and you should always check your preconditions this way if possible.
Unfortunately, the test of whether converting the bits in a float to a uint32_t and shifting is big-endian, little-endian or neither cannot be computed as a compile-time constant expression. Here, I put the run-time check in the part of the code that depends on it, but you might want to put it in the initialization and do it once. In practice, both gcc and clang can optimize this test away at compile time.
You do not want to use the unsafe pointer cast, and there are some systems I’ve worked on in the real world where that could crash the program with a bus error. The maximally-portable way to convert object representations is with memcpy(). In my example below, I type-pun with a union, which works on any actually-existing implementation. (Language lawyers object to it, but no successful compiler will ever break that much legacy code silently.) If you must do a pointer conversion (see below) there is alignas(). But however you do it, the result will be implementation-defined, which is why we check the result of converting and shifting a test value.
Anyway, not that you’re likely to use it on a modern CPU, here’s a gussied-up C++14 version that checks those non-portable assumptions:
#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>
using std::cout;
using std::endl;
using std::size_t;
using std::sqrt;
using std::uint32_t;
template <typename T, typename U>
inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T. Cannot be constexpr
* in C++14 because it reads an inactive union member.
*/
{
static_assert( sizeof(T)==sizeof(U), "" );
union tu_pun {
U u = U();
T t;
};
const tu_pun pun{x};
return pun.t;
}
constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;
const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
const bool is_little_endian = after_rshift == target;
float est_sqrt(const float x)
/* A fast approximation of sqrt(x) that works less well for subnormal numbers.
*/
{
static_assert( std::numeric_limits<float>::is_iec559, "" );
assert(is_little_endian); // Could provide alternative big-endian code.
/* The algorithm relies on the bit representation of normal IEEE floats, so
* a subnormal number as input might be considered a domain error as well?
*/
if ( std::isless(x, 0.0F) || !std::isfinite(x) )
return std::numeric_limits<float>::signaling_NaN();
constexpr uint32_t magic_number = 0x1fbb4000UL;
const uint32_t raw_bits = reinterpret<uint32_t,float>(x);
const uint32_t rejiggered_bits = (raw_bits >> 1U) + magic_number;
return reinterpret<float,uint32_t>(rejiggered_bits);
}
int main(void)
{
static const std::vector<float> test_values{
4.0F, 0.01F, 0.0F, 5e20F, 5e-20F, 1.262738e-38F };
for ( const float& x : test_values ) {
const double gold_standard = sqrt((double)x);
const double estimate = est_sqrt(x);
const double error = estimate - gold_standard;
cout << "The error for (" << estimate << " - " << gold_standard << ") is "
<< error;
if ( gold_standard != 0.0 && std::isfinite(gold_standard) ) {
const double error_pct = error/gold_standard * 100.0;
cout << " (" << error_pct << "%).";
} else
cout << '.';
cout << endl;
}
return EXIT_SUCCESS;
}
Update
Here is an alternative definition of reinterpret<T,U>() that avoids type-punning. You could also implement the type-pun in modern C, where it’s allowed by standard, and call the function as extern "C". I think type-punning is more elegant, type-safe and consistent with the quasi-functional style of this program than memcpy(). I also don’t think you gain much, because you still could have undefined behavior from a hypothetical trap representation. Also, clang++ 3.9.1 -O -S is able to statically analyze the type-punning version, optimize the variable is_little_endian to the constant 0x1, and eliminate the run-time test, but it can only optimize this version down to a single-instruction stub.
But more importantly, this code isn’t guaranteed to work portably on every compiler. For example, some old computers can’t even address exactly 32 bits of memory. But in those cases, it should fail to compile and tell you why. No compiler is just suddenly going to break a huge amount of legacy code for no reason. Although the standard technically gives permission to do that and still say it conforms to C++14, it will only happen on an architecture very different from we expect. And if our assumptions are so invalid that some compiler is going to turn a type-pun between a float and a 32-bit unsigned integer into a dangerous bug, I really doubt the logic behind this code will hold up if we just use memcpy() instead. We want that code to fail at compile time, and to tell us why.
#include <cassert>
#include <cstdint>
#include <cstring>
using std::memcpy;
using std::uint32_t;
template <typename T, typename U> inline T reinterpret(const U &x)
/* Reinterprets the bits of x as a T. Cannot be constexpr
* in C++14 because it modifies a variable.
*/
{
static_assert( sizeof(T)==sizeof(U), "" );
T temp;
memcpy( &temp, &x, sizeof(T) );
return temp;
}
constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;
const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
extern const bool is_little_endian = after_rshift == target;
However, Stroustrup et al., in the C++ Core Guidelines, recommend a reinterpret_cast instead:
#include <cassert>
template <typename T, typename U> inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T. Cannot be constexpr
* in C++14 because it uses reinterpret_cast.
*/
{
static_assert( sizeof(T)==sizeof(U), "" );
const U temp alignas(T) alignas(U) = x;
return *reinterpret_cast<const T*>(&temp);
}
The compilers I tested can also optimize this away to a folded constant. Stroustrup’s reasoning is [sic]:
Accessing the result of an reinterpret_cast to a different type from the objects declared type is still undefined behavior, but at least we can see that something tricky is going on.
Update
From the comments: C++20 introduces std::bit_cast, which converts an object representation to a different type with unspecified, not undefined, behavior. This doesn’t guarantee that your implementation will use the same format of float and int that this code expects, but it doesn’t give the compiler carte blanche to break your program arbitrarily because there’s technically undefined behavior in one line of it. It can also give you a constexpr conversion.

Let y = sqrt(x),
it follows from the properties of logarithms that log(y) = 0.5 * log(x) (1)
Interpreting a normal float as an integer gives INT(x) = Ix = L * (log(x) + B - σ) (2)
where L = 2^N, N the number of bits of the significand, B is the exponent bias, and σ is a free factor to tune the approximation.
Combining (1) and (2) gives: Iy = 0.5 * (Ix + (L * (B - σ)))
Which is written in the code as (*(int*)&x >> 1) + 0x1fbb4000;
Find the σ so that the constant equals 0x1fbb4000 and determine whether it's optimal.

Adding a wiki test harness to test all float.
The approximation is within 4% for many float, but very poor for sub-normal numbers. YMMV
Worst:1.401298e-45 211749.20%
Average:0.63%
Worst:1.262738e-38 3.52%
Average:0.02%
Note that with argument of +/-0.0, the result is not zero.
printf("% e % e\n", sqrtf(+0.0), sqrt_apx(0.0)); // 0.000000e+00 7.930346e-20
printf("% e % e\n", sqrtf(-0.0), sqrt_apx(-0.0)); // -0.000000e+00 -2.698557e+19
Test code
#include <float.h>
#include <limits.h>
#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
float sqrt_apx(float f) {
const int result = 0x1fbb4000 + (*(int*) &f >> 1);
return *(float*) &result;
}
double error_value = 0.0;
double error_worst = 0.0;
double error_sum = 0.0;
unsigned long error_count = 0;
void sqrt_test(float f) {
if (f == 0) return;
volatile float y0 = sqrtf(f);
volatile float y1 = sqrt_apx(f);
double error = (1.0 * y1 - y0) / y0;
error = fabs(error);
if (error > error_worst) {
error_worst = error;
error_value = f;
}
error_sum += error;
error_count++;
}
void sqrt_tests(float f0, float f1) {
error_value = error_worst = error_sum = 0.0;
error_count = 0;
for (;;) {
sqrt_test(f0);
if (f0 == f1) break;
f0 = nextafterf(f0, f1);
}
printf("Worst:%e %.2f%%\n", error_value, error_worst*100.0);
printf("Average:%.2f%%\n", error_sum / error_count);
fflush(stdout);
}
int main() {
sqrt_tests(FLT_TRUE_MIN, FLT_MIN);
sqrt_tests(FLT_MIN, FLT_MAX);
return 0;
}

Related

Efficient division of an int by intmax

I have an integer of type uint32_t and would like to divide it by a maximum value of uint32_t and obtain the result as a float (in range 0..1).
Naturally, I can do the following:
float result = static_cast<float>(static_cast<double>(value) / static_cast<double>(std::numeric_limits<uint32_t>::max()))
This is however quite a lot of conversions on the way, and a the division itself may be expensive.
Is there a way to achieve the above operation faster, without division and excess type conversions? Or maybe I shouldn't worry because modern compilers are able to generate an efficient code already?
Edit: division by MAX+1, effectively giving me a float in range [0..1) would be fine too.
A bit more context:
I use the above transformation in a time-critical loop, with uint32_t being produced from a relatively fast random-number generator (such as pcg). I expect that the conversions/divisions from the above transformation may have some noticable, albeit not overwhelming, negative impact on the performance of my code.

This sounds like a job for:
std::uniform_real_distribution<float> dist(0.f, 1.f);
I would trust that to give you an unbiased conversion to float in the range [0, 1) as efficiently as possible. If you want the range to be [0, 1] you could use this:
std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f))
Here's an example with two instances of a not-so-random number generator that generates min and max for uint32_t:
#include <iostream>
#include <limits>
#include <random>
struct ui32gen {
constexpr ui32gen(uint32_t x) : value(x) {}
uint32_t operator()() { return value; }
static constexpr uint32_t min() { return 0; }
static constexpr uint32_t max() { return std::numeric_limits<uint32_t>::max(); }
uint32_t value;
};
int main() {
ui32gen min(ui32gen::min());
ui32gen max(ui32gen::max());
std::uniform_real_distribution<float> dist(0.f, 1.f);
std::cout << dist(min) << "\n";
std::cout << dist(max) << "\n";
}
Output:
0
1
Is there a way to achieve the operation faster, without division
and excess type conversions?
If you want to manually do something similar to what uniform_real_distribution does (but much faster, and slightly biased towards lower values), you can define a function like this:
// [0, 1) the common range
inline float zero_to_one_exclusive(uint32_t value) {
static const float f_mul =
std::nextafter(1.f / float(std::numeric_limits<uint32_t>::max()), 0.f);
return float(value) * f_mul;
}
It uses multiplication instead of division since that often is a bit faster (than your original suggestion) and only has one type conversion. Here's a comparison of division vs. multiplication.
If you really want the range to be [0, 1], you can do like below, which will also be slightly biased towards lower values compared to what std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f)) would produce:
// [0, 1] the not so common range
inline float zero_to_one_inclusive(uint32_t value) {
static const float f_mul = 1.f/float(std::numeric_limits<uint32_t>::max());
return float(value) * f_mul;
}
Here's a benchmark comparing uniform_real_distribution to zero_to_one_exclusive and zero_to_one_inclusive.

Two of the casts are superfluous. You dont need to cast to float when anyhow you assign to a float. Also it is sufficient to cast one of the operands to avoid integer arithmetics. So we are left with
float result = static_cast<double>(value) / std::numeric_limits<int>::max();
This last cast you cannot avoid (otherwise you would get integer arithmetics).
Or maybe I shouldn't worry because modern compilers are able to
generate an efficient code already?
Definitely a yes and no! Yes, trust the compiler that it knows best to optimize code and write for readability first. And no, dont blindy trust. Look at the output of the compiler. Compare different versions and measure them.
Is there a way to achieve the above operation faster, without division
[...] ?
Probably yes. Dividing by std::numeric_limits<int>::max() is so special, that I wouldn't be too surprised if the compiler comes with some tricks. My first approach would again be to look at the output of the compiler and maybe compare different compilers. Only if the compilers output turns out to be suboptimal I'd bother to enter some manual bit-fiddling.
For further reading this might be of interest: How expensive is it to convert between int and double? . TL;DR: it actually depends on the hardware.

If performance were a real concern I think I'd be inclined to represent this 'integer that is really a fraction' in its own class and perform any conversion only where necessary.
For example:
#include <iostream>
#include <cstdint>
#include <limits>
struct fraction
{
using value_type = std::uint32_t;
constexpr explicit fraction(value_type num = 0) : numerator_(num) {}
static constexpr auto denominator() -> value_type { return std::numeric_limits<value_type>::max(); }
constexpr auto numerator() const -> value_type { return numerator_; }
constexpr auto as_double() const -> double {
return double(numerator()) / denominator();
}
constexpr auto as_float() const -> float {
return float(as_double());
}
private:
value_type numerator_;
};
auto generate() -> std::uint32_t;
int main()
{
auto frac = fraction(generate());
// use/manipulate/display frac here ...
// ... and finally convert to double/float if necessary
std::cout << frac.as_double() << std::endl;
}
However if you look at code gen on godbolt you'll see that the CPU's floating point instructions take care of the conversion. I'd be inclined to measure performance before you run the risk of wasting time on early optimisation.

Why acts std::chrono::duration::operator= not like built-in =?

As described in std::chrono::duration::operator+= the signature is
duration& operator*=(const rep& rhs);
This makes me wonder. I would assume that a duration literal can be used like any other built-in, but it doesn't.
#include <chrono>
#include <iostream>
int main()
{
using namespace std::chrono_literals;
auto m = 10min;
m *= 1.5f;
std::cout << " 150% of 10min: " << m.count() << "min" << std::endl;
int i = 10;
i *= 1.5f;
std::cout << " 150% of 10: " << i << std::endl;
}
Output is
150% of 10min: 10min
150% of 10: 15
Why was the interface choosen that way? To my mind, an interface like
template<typename T>
duration& operator*=(const T& rhs);
would yield more intuitive results.
EDIT:
Thanks for your responses, I know that the implementation behaves that way and how I could handle it. My question is, why is it designed that way.
I would expect the conversion to int take place at the end of the operation. In the following example both operands get promoted to double before the multiplications happens. The intermediate result of 4.5 is converted to int afterwards, so that the result is 4.
int i = 3;
i *= 1.5;
assert(i == 4);
My expectation for std::duration would be that it behaves the same way.

The issue here is
auto m = 10min;
gives you a std::chrono::duration where rep is a signed integer type. When you do
m *= 1.5f;
the 1.5f is converted to the type rep and that means it is truncated to 1, which gives you the same value after multiplication.
To fix this you need to use
auto m = 10.0min;
to get a std::chrono::duration that uses a floating point type for rep and wont truncate 1.5f when you do m *= 1.5f;.

My question is, why is it designed that way.
It was designed this way (ironically) because the integral-based computations are designed to give exact results, or not compile. However in this case the <chrono> library exerts no control over what conversions get applied to arguments prior to binding to the arguments.
As a concrete example, consider the case where m is initialized to 11min, and presume that we had a templated operator*= as you suggest. The exact answer is now 16.5min, but the integral-based type chrono::minutes is not capable of representing this value.
A superior design would be to have this line:
m *= 1.5f; // compile-time error
not compile. That would make the library more self-consistent: Integral-based arithmetic is either exact (or requires duration_cast) or does not compile. This would be possible to implement, and the answer as to why this was not done is simply that I didn't think of it.
If you (or anyone else) feels strongly enough about this to try to standardize a compile-time error for the above statement, I would be willing to speak in favor of such a proposal in committee.
This effort would involve:
An implementation with unit tests.
Fielding it to get a feel for how much code it would break, and ensuring that it does not break code not intended.
Write a paper and submit it to the C++ committee, targeting C++23 (it is too late to target C++20).
The easiest way to do this would be to start with an open-source implementation such as gcc's libstdc++ or llvm's libc++.

Looking at the implementation of operator*=:
_CONSTEXPR17 duration& operator*=(const _Rep& _Right)
{ // multiply rep by _Right
_MyRep *= _Right;
return (*this);
}
the operator takes a const _Rep&. It comes from std::duration which looks like:
template<class _Rep, //<-
class _Period>
class duration
{ // represents a time Duration
//...
So now if we look at the definition of std::chrono::minutes:
using minutes = duration<int, ratio<60>>;
It is clear that _Rep is an int.
So when you call operator*=(const _Rep& _Right) 1.5f is beeing cast to an int - which equals 1 and therefore won't affect any mulitiplications with itself.
So what can you do?
you can split it up into m = m * 1.5f and use std::chrono::duration_cast to cast from std::chrono::duration<float, std::ratio> to std::chrono::duration<int, std::ratio>
m = std::chrono::duration_cast<std::chrono::minutes>(m * 1.5f);
150% of 10min: 15min
if you don't like always casting it, use a float for it as the first template argument:
std::chrono::duration<float, std::ratio<60>> m = 10min;
m *= 1.5f; //> 15min
or even quicker - auto m = 10.0min; m *= 1.5f; as #NathanOliver answered :-)

Rounding large float to int

Problem: I'm looking for a way of rounding some float f to the closest int in general -- especially if the f is large. Mathematically speaking I'd like to compute the following function
where script T denotes the set of ints representable by my machine. In case of ties (eg. .5) r(f) can be defined arbitrarily.
Current Code: Below my current solution including two unsatisfying float examples (in main):
#include <cmath>
#include <iostream>
#include <limits>
template <class T>
T projection(T const min, T t, T const max) {
return std::max(std::min(t, max), min);
}
template <class Out, class In>
Out repr(In in) {
using Limits = std::numeric_limits<Out>;
auto next = [](Out val) {
auto const zero = static_cast<In>(0);
return std::nexttoward(static_cast<In>(val), zero);
};
return projection(next(Limits::lowest()), std::round(in), next(Limits::max()));
};
int main() {
std::cout
<< repr<int>(std::numeric_limits<float>::max()) << " "
<< repr<int>(static_cast<float>(std::numeric_limits<int>::max())) << "\n";
}
On my machine with 32bit ints this prints:
2147483520 2147483520
Short elaboration: For the upper bound, next computes the next smaller float that can be safely static_casted to int (analogously for lower bound). This is necessary as my float examples in main demonstrate: Without next, repr involves undefined behavior of casting (at least) std::numeric_limits<int>::max() + 1 as float to int in which this number is not representable.
The obvious downside of my repr is that it is incorrect in the mathematical sense: For large floats (eg. std::numeric_limits<float>::max()) it doesn't return std::numeric_limits<int>::max().
Questions:
Is this there an easier way to solve the problem (easier in the sense of less manual number crunching and more delegating to std-functions)?
How can repr be made correct (in the mathematical sense) with fully defined behavior only (no undefined and no implementation defined behavior)?
So far I've been talking about int and float but (as templates already suggested) this should only be a start. What about combinations
double and long or
double and long long?

native isnan check in C++

I stumbled upon this code to check for NaN:
/**
* isnan(val) returns true if val is nan.
* We cannot rely on std::isnan or x!=x, because GCC may wrongly optimize it
* away when compiling with -ffast-math (default in RASR).
* This function basically does 3 things:
* - ignore the sign (first bit is dropped with <<1)
* - interpret val as an unsigned integer (union)
* - compares val to the nan-bitmask (ones in the exponent, non-zero significand)
**/
template<typename T>
inline bool isnan(T val) {
if (sizeof(val) == 4) {
union { f32 f; u32 x; } u = { (f32)val };
return (u.x << 1) > 0xff000000u;
} else if (sizeof(val) == 8) {
union { f64 f; u64 x; } u = { (f64)val };
return (u.x << 1) > 0x7ff0000000000000u;
} else {
std::cerr << "isnan is not implemented for sizeof(datatype)=="
<< sizeof(val) << std::endl;
}
}
This looks arch dependent, right? However, I'm not sure about endianess, because no matter about little or big endian, the float and the int are probably stored in the same order.
Also, I wonder whether something like
volatile T x = val;
return std::isnan(x);
would have worked.
This was used with GCC 4.6 in the past.

Also, I wonder whether something like std::isnan((volatile)x) would have worked.
isnan takes its argument by value so the volatile qualifier would have been discarded. In other words, no, this doesn’t work.
The code you’ve posted relies on a specific floating point representation (IEEE). It also exhibits undefined behaviour since it relies on the union hack to retrieve the underlying float representation.
On a note about code review, the function is badly written even if we ignore the potential problems of the previous paragraph (which are justifiable): why does the function use runtime checks rather than compile-time checks and compile time error handling? It would have been better and easier just to offer two overloads.

Templatized branchless int max/min function

I'm trying to write a branchless function to return the MAX or MIN of two integers without resorting to if (or ?:). Using the usual technique I can do this easily enough for a given word size:
inline int32 imax( int32 a, int32 b )
{
// signed for arithmetic shift
int32 mask = a - b;
// mask < 0 means MSB is 1.
return a + ( ( b - a ) & ( mask >> 31 ) );
}
Now, assuming arguendo that I really am writing the kind of application on the kind of in-order processor where this is necessary, my question is whether there is a way to use C++ templates to generalize this to all sizes of int.
The >>31 step only works for int32s, of course, and while I could copy out overloads on the function for int8, int16, and int64, it seems like I should use a template function instead. But how do I get the size of a template argument in bits?
Is there a better way to do it than this? Can I force the mask T to be signed? If T is unsigned the mask-shift step won't work (because it'll be a logical rather than arithmetic shift).
template< typename T >
inline T imax( T a, T b )
{
// how can I force this T to be signed?
T mask = a - b;
// I hope the compiler turns the math below into an immediate constant!
mask = mask >> ( (sizeof(T) * 8) - 1 );
return a + ( ( b - a ) & mask );
}
And, having done the above, can I prevent it from being used for anything but an integer type (eg, no floats or classes)?

EDIT: This answer is from before C++11. Since then, C++11 and later has offered make_signed<T> and much more as part of the standard library
Generally, looks good, but for 100% portability, replace that 8 with CHAR_BIT (or numeric_limits<char>::max()) since it isn't guaranteed that characters are 8-bit.
Any good compiler will be smart enough to merge all of the math constants at compile time.
You can force it to be signed by using a type traits library. which would usually look something like (assuming your numeric_traits library is called numeric_traits):
typename numeric_traits<T>::signed_type x;
An example of a manually rolled numeric_traits header could look like this: http://rafb.net/p/Re7kq478.html (there is plenty of room for additions, but you get the idea).
or better yet, use boost:
typename boost::make_signed<T>::type x;
EDIT: IIRC, signed right shifts don't have to be arithmetic. It is common, and certainly the case with every compiler I've used. But I believe that the standard leaves it up the compiler whether right shifts are arithmetic or not on signed types. In my copy of the draft standard, the following is written:
The value of E1 >> E2 is E1
rightshifted E2 bit positions. If E1
has an unsigned type or if E1 has a
signed type and a nonnegative value,
the value of the result is the
integral part of the quotient of E1
divided by the quantity 2 raised to
the power E2. If E1 has a signed type
and a negative value, the resulting
value is implementation defined.
But as I said, it will work on every compiler I've seen :-p.

Here's another approach for branchless max and min. What's nice about it is that it doesn't use any bit tricks and you don't have to know anything about the type.
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
template <typename T>
inline T imin (T a, T b)
{
return (a > b) * b + (a <= b) * a;
}

tl;dr
To achieve your goals, you're best off just writing this:
template<typename T> T max(T a, T b) { return (a > b) ? a : b; }
Long version
I implemented both the "naive" implementation of max() as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.
Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.
Admittedly, my assembly-fu isn't all that great, so while NaiveMax() had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.
So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.
#include <iostream>
#include <chrono>
using int32 = long;
using uint32 = unsigned long;
constexpr int32 NaiveMax(int32 a, int32 b)
{
return (a > b) ? a : b;
}
constexpr int32 FastMax(int32 a, int32 b)
{
int32 mask = a - b;
mask = mask >> ((sizeof(int32) * 8) - 1);
return a + ((b - a) & mask);
}
int main()
{
int32 resInts[1000] = {};
int32 lotsOfInts[1'000];
for (uint32 i = 0; i < 1000; i++)
{
lotsOfInts[i] = rand();
}
auto naiveTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
}();
auto fastTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
}();
std::cout << "Naive Time: " << naiveTime << std::endl;
std::cout << "Fast Time: " << fastTime << std::endl;
getchar();
return 0;
}
And here's the output I get on my machine:
Naive Time: 2330174
Fast Time: 2492246
I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.
Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.
The Answer
In brief, it would seem that the best way to write a branchless templated max() function is probably to keep it simple:
template<typename T> T max(T a, T b) { return (a > b) ? a : b; }
There are additional upsides to the naive method:
It works for unsigned types.
It even works for floating types.
It expresses exactly what you intend, rather than needing to comment up your code describing what the bit-twiddling is doing.
It is a well known and recognizable pattern, so most compilers will know exactly how to optimize it, making it more portable. (This is a gut hunch of mine, only backed up by personal experience of compilers surprising me a lot. I'll be willing to admit I'm wrong here.)

You may want to look at the Boost.TypeTraits library. For detecting whether a type is signed you can use the is_signed trait. You can also look into enable_if/disable_if for removing overloads for certain types.

I don't know what are the exact conditions for this bit mask trick to work but you can do something like
#include<type_traits>
template<typename T, typename = std::enable_if_t<std::is_integral<T>{}> >
inline T imax( T a, T b )
{
...
}
Other useful candidates are std::is_[un]signed, std::is_fundamental, etc. https://en.cppreference.com/w/cpp/types

In addition to tloch14's answer "tl;dr", one can also use an index into an array. This avoids the unwieldly bitshuffling of the "branchless min/max"; it's also generalizable to all types.
template<typename T> constexpr T OtherFastMax(const T &a, const T &b)
{
const T (&p)[2] = {a, b};
return p[a>b];
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js