Subtract extremely small number from one in C++ - c++

I need to subtract extremely small double number x from 1 i.e. to calculate 1-x in C++ for 0<x<1e-16. Because of machine precision restrictions for small enoughf x I will always get 1-x=1. Simple solution is to switch from double to some more precise format like long. But because of some restrictions I can't switch to more precise number formats.
What is the most efficient way to get accurate value of 1-x where x is an extremely small double if I can't use more precise formats and I need to store the result of the subtraction as a double? In practice I would like to avoid percentage errors greater then 1% (between double representation of 1-x and its actual value).
P.S. I am using Rcpp to calculate the quantiles of standard normal distribution via qnorm function. This function is symmetric around 0.5 being much more accurate for values close to 0. Therefore instead of qnorm(1-(1e-30)) I would like to calculate -qnorm(1e-30) but to derive 1e-30 from 1-(1e-30) I need to deal with a precision problem. The restriction on double is due to the fact that as I know it is not safe to use more precise numeric formats in Rcpp. Note that my inputs to qnorm could be sought of exogeneous so I can't to derive 1-x from x durning some preliminary calculations.

Simple solution is to switch from double to some more precise format like long [presumably, double]
In that case you have no solution. long double is an alias for double on all modern machines. I stand corrected, gcc and icc still support it, only cl has dropped support for it for a long time.
So you have two solutions, and they're not mutually exclusive:
Use an arbitrary precision library instead of the built-in types. They're orders of magnitude slower, but if that's the best your algorithm can work with then that's that.
Use a better algorithm, or at least rearrange your equation variables, to not have this need in the first place. Use distribution and cancellation rules to avoid the problem entirely. Without a more in depth description of your problem we can't help you, but I can tell you with certainty that double is more than enough to allow us to model airplane AI and flight parameters anywhere in the world.

Rather than resorting to an arbitrary precision solution (which, as others have said, would potentially be extremely slow), you could simply create a class that extends the inherent precision of the double type by a factor of (approximately) two. You would then only need to implement the operations that you actually need: in your case, this may only be subtraction (and possibly addition), which are both reasonably easy to achieve. Such code will still be considerably slower than using native types, but likely much faster than libraries that use unnecessary precision.
Such an implementation is available (as open-source) in the QD_Real class, created some time ago by Yozo Hida (a PhD Student, at the time, I believe).
The linked repository contains a lot of code, much of which is likely unnecessary for your use-case. Below, I have shown an extremely trimmed-down version, which allows creation of data with the required precision, shows an implementation of the required operator-() and a test case.
#include <iostream>
class ddreal {
private:
static inline double Plus2(double a, double b, double& err) {
double s = a + b;
double bb = s - a;
err = (a - (s - bb)) + (b - bb);
return s;
}
static inline void Plus3(double& a, double& b, double& c) {
double t3, t2, t1 = Plus2(a, b, t2);
a = Plus2(c, t1, t3);
b = Plus2(t2, t3, c);
}
public:
double x[2];
ddreal() { x[0] = x[1] = 0.0; }
ddreal(double hi) { x[0] = hi; x[1] = 0.0; }
ddreal(double hi, double lo) { x[0] = Plus2(hi, lo, x[1]); }
ddreal& operator -= (ddreal const& b) {
double t1, t2, s2;
x[0] = Plus2(x[0], -b.x[0], s2);
t1 = Plus2(x[1], -b.x[1], t2);
x[1] = Plus2(s2, t1, t1);
t1 += t2;
Plus3(x[0], x[1], t1);
return *this;
}
inline double toDouble() const { return x[0] + x[1]; }
};
inline ddreal operator-(ddreal const& a, ddreal const& b)
{
ddreal retval = a;
return retval -= b;
}
int main()
{
double sdone{ 1.0 };
double sdwee{ 1.0e-42 };
double sdval = sdone - sdwee;
double sdans = sdone - sdval;
std::cout << sdans << "\n"; // Gives zero, as expected
ddreal ddone{ 1.0 };
ddreal ddwee{ 1.0e-42 };
ddreal ddval = ddone - ddwee; // Can actually hold 1 - 1.0e42 ...
ddreal ddans = ddone - ddval;
std::cout << ddans.toDouble() << "\n"; // Gives 1.0e-42
ddreal ddalt{ 1.0, -1.0e-42 }; // Alternative initialization ...
ddreal ddsec = ddone - ddalt;
std::cout << ddsec.toDouble() << "\n"; // Gives 1.0e-42
return 0;
}
Note that I have deliberately neglected error-checking and other overheads that would be needed for a more general implementation. Also, the code I have shown has been 'tweaked' to work more optimally on x86/x64 CPUs, so you may need to delve into the code at the linked GitHub, if you need support for other platforms. (However, I think the code I have shown will work for any platform that conforms strictly to the IEEE-754 Standard.)
I have tested this implementation, extensively, in code I use to generate and display the Mandelbrot Set (and related fractals) at very deep zoom levels, where use of the raw double type fails completely.
Note that, though you may be tempted to 'optimize' some of the seemingly pointless operations, doing so will break the system. Also, this must be compiled using the /fp:precise (or /fp:strict) flags (with MSVC), or the equivalent(s) for other compilers; using /fp:fast will break the code, completely.

Related

Efficient division of an int by intmax

I have an integer of type uint32_t and would like to divide it by a maximum value of uint32_t and obtain the result as a float (in range 0..1).
Naturally, I can do the following:
float result = static_cast<float>(static_cast<double>(value) / static_cast<double>(std::numeric_limits<uint32_t>::max()))
This is however quite a lot of conversions on the way, and a the division itself may be expensive.
Is there a way to achieve the above operation faster, without division and excess type conversions? Or maybe I shouldn't worry because modern compilers are able to generate an efficient code already?
Edit: division by MAX+1, effectively giving me a float in range [0..1) would be fine too.
A bit more context:
I use the above transformation in a time-critical loop, with uint32_t being produced from a relatively fast random-number generator (such as pcg). I expect that the conversions/divisions from the above transformation may have some noticable, albeit not overwhelming, negative impact on the performance of my code.
This sounds like a job for:
std::uniform_real_distribution<float> dist(0.f, 1.f);
I would trust that to give you an unbiased conversion to float in the range [0, 1) as efficiently as possible. If you want the range to be [0, 1] you could use this:
std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f))
Here's an example with two instances of a not-so-random number generator that generates min and max for uint32_t:
#include <iostream>
#include <limits>
#include <random>
struct ui32gen {
constexpr ui32gen(uint32_t x) : value(x) {}
uint32_t operator()() { return value; }
static constexpr uint32_t min() { return 0; }
static constexpr uint32_t max() { return std::numeric_limits<uint32_t>::max(); }
uint32_t value;
};
int main() {
ui32gen min(ui32gen::min());
ui32gen max(ui32gen::max());
std::uniform_real_distribution<float> dist(0.f, 1.f);
std::cout << dist(min) << "\n";
std::cout << dist(max) << "\n";
}
Output:
0
1
Is there a way to achieve the operation faster, without division
and excess type conversions?
If you want to manually do something similar to what uniform_real_distribution does (but much faster, and slightly biased towards lower values), you can define a function like this:
// [0, 1) the common range
inline float zero_to_one_exclusive(uint32_t value) {
static const float f_mul =
std::nextafter(1.f / float(std::numeric_limits<uint32_t>::max()), 0.f);
return float(value) * f_mul;
}
It uses multiplication instead of division since that often is a bit faster (than your original suggestion) and only has one type conversion. Here's a comparison of division vs. multiplication.
If you really want the range to be [0, 1], you can do like below, which will also be slightly biased towards lower values compared to what std::uniform_real_distribution<float> dist(0.f, std::nextafter(1.f, 2.f)) would produce:
// [0, 1] the not so common range
inline float zero_to_one_inclusive(uint32_t value) {
static const float f_mul = 1.f/float(std::numeric_limits<uint32_t>::max());
return float(value) * f_mul;
}
Here's a benchmark comparing uniform_real_distribution to zero_to_one_exclusive and zero_to_one_inclusive.
Two of the casts are superfluous. You dont need to cast to float when anyhow you assign to a float. Also it is sufficient to cast one of the operands to avoid integer arithmetics. So we are left with
float result = static_cast<double>(value) / std::numeric_limits<int>::max();
This last cast you cannot avoid (otherwise you would get integer arithmetics).
Or maybe I shouldn't worry because modern compilers are able to
generate an efficient code already?
Definitely a yes and no! Yes, trust the compiler that it knows best to optimize code and write for readability first. And no, dont blindy trust. Look at the output of the compiler. Compare different versions and measure them.
Is there a way to achieve the above operation faster, without division
[...] ?
Probably yes. Dividing by std::numeric_limits<int>::max() is so special, that I wouldn't be too surprised if the compiler comes with some tricks. My first approach would again be to look at the output of the compiler and maybe compare different compilers. Only if the compilers output turns out to be suboptimal I'd bother to enter some manual bit-fiddling.
For further reading this might be of interest: How expensive is it to convert between int and double? . TL;DR: it actually depends on the hardware.
If performance were a real concern I think I'd be inclined to represent this 'integer that is really a fraction' in its own class and perform any conversion only where necessary.
For example:
#include <iostream>
#include <cstdint>
#include <limits>
struct fraction
{
using value_type = std::uint32_t;
constexpr explicit fraction(value_type num = 0) : numerator_(num) {}
static constexpr auto denominator() -> value_type { return std::numeric_limits<value_type>::max(); }
constexpr auto numerator() const -> value_type { return numerator_; }
constexpr auto as_double() const -> double {
return double(numerator()) / denominator();
}
constexpr auto as_float() const -> float {
return float(as_double());
}
private:
value_type numerator_;
};
auto generate() -> std::uint32_t;
int main()
{
auto frac = fraction(generate());
// use/manipulate/display frac here ...
// ... and finally convert to double/float if necessary
std::cout << frac.as_double() << std::endl;
}
However if you look at code gen on godbolt you'll see that the CPU's floating point instructions take care of the conversion. I'd be inclined to measure performance before you run the risk of wasting time on early optimisation.

Why acts std::chrono::duration::operator*= not like built-in *=?

As described in std::chrono::duration::operator+= the signature is
duration& operator*=(const rep& rhs);
This makes me wonder. I would assume that a duration literal can be used like any other built-in, but it doesn't.
#include <chrono>
#include <iostream>
int main()
{
using namespace std::chrono_literals;
auto m = 10min;
m *= 1.5f;
std::cout << " 150% of 10min: " << m.count() << "min" << std::endl;
int i = 10;
i *= 1.5f;
std::cout << " 150% of 10: " << i << std::endl;
}
Output is
150% of 10min: 10min
150% of 10: 15
Why was the interface choosen that way? To my mind, an interface like
template<typename T>
duration& operator*=(const T& rhs);
would yield more intuitive results.
EDIT:
Thanks for your responses, I know that the implementation behaves that way and how I could handle it. My question is, why is it designed that way.
I would expect the conversion to int take place at the end of the operation. In the following example both operands get promoted to double before the multiplications happens. The intermediate result of 4.5 is converted to int afterwards, so that the result is 4.
int i = 3;
i *= 1.5;
assert(i == 4);
My expectation for std::duration would be that it behaves the same way.
The issue here is
auto m = 10min;
gives you a std::chrono::duration where rep is a signed integer type. When you do
m *= 1.5f;
the 1.5f is converted to the type rep and that means it is truncated to 1, which gives you the same value after multiplication.
To fix this you need to use
auto m = 10.0min;
to get a std::chrono::duration that uses a floating point type for rep and wont truncate 1.5f when you do m *= 1.5f;.
My question is, why is it designed that way.
It was designed this way (ironically) because the integral-based computations are designed to give exact results, or not compile. However in this case the <chrono> library exerts no control over what conversions get applied to arguments prior to binding to the arguments.
As a concrete example, consider the case where m is initialized to 11min, and presume that we had a templated operator*= as you suggest. The exact answer is now 16.5min, but the integral-based type chrono::minutes is not capable of representing this value.
A superior design would be to have this line:
m *= 1.5f; // compile-time error
not compile. That would make the library more self-consistent: Integral-based arithmetic is either exact (or requires duration_cast) or does not compile. This would be possible to implement, and the answer as to why this was not done is simply that I didn't think of it.
If you (or anyone else) feels strongly enough about this to try to standardize a compile-time error for the above statement, I would be willing to speak in favor of such a proposal in committee.
This effort would involve:
An implementation with unit tests.
Fielding it to get a feel for how much code it would break, and ensuring that it does not break code not intended.
Write a paper and submit it to the C++ committee, targeting C++23 (it is too late to target C++20).
The easiest way to do this would be to start with an open-source implementation such as gcc's libstdc++ or llvm's libc++.
Looking at the implementation of operator*=:
_CONSTEXPR17 duration& operator*=(const _Rep& _Right)
{ // multiply rep by _Right
_MyRep *= _Right;
return (*this);
}
the operator takes a const _Rep&. It comes from std::duration which looks like:
template<class _Rep, //<-
class _Period>
class duration
{ // represents a time Duration
//...
So now if we look at the definition of std::chrono::minutes:
using minutes = duration<int, ratio<60>>;
It is clear that _Rep is an int.
So when you call operator*=(const _Rep& _Right) 1.5f is beeing cast to an int - which equals 1 and therefore won't affect any mulitiplications with itself.
So what can you do?
you can split it up into m = m * 1.5f and use std::chrono::duration_cast to cast from std::chrono::duration<float, std::ratio> to std::chrono::duration<int, std::ratio>
m = std::chrono::duration_cast<std::chrono::minutes>(m * 1.5f);
150% of 10min: 15min
if you don't like always casting it, use a float for it as the first template argument:
std::chrono::duration<float, std::ratio<60>> m = 10min;
m *= 1.5f; //> 15min
or even quicker - auto m = 10.0min; m *= 1.5f; as #NathanOliver answered :-)

Range analysis of floating point values?

I have an image processing program which uses floating point calculations. However, I need to port it to a processor which does not have floating point support in it. So, I have to change the program to use fixed point calculations. For that I need proper scaling of those floating point numbers, for which I need to know the range of all values, including intermediate values of the floating point calculations.
Is there a method where I just run the program and it automatically give me the range of all the floating point calculations in the program? Trying to figure out the ranges manually would be too cumbersome, so if there is some tool for doing it, that would be awesome!
You could use some "measuring" replacement for your floating type, along these lines (live example):
template<typename T>
class foo
{
T val;
using lim = std::numeric_limits<int>;
static int& min_val() { static int e = lim::max(); return e; }
static int& max_val() { static int e = lim::min(); return e; }
static void sync_min(T e) { if (e < min_val()) min_val() = int(e); }
static void sync_max(T e) { if (e > max_val()) max_val() = int(e); }
static void sync(T v)
{
v = std::abs(v);
T e = v == 0 ? T(1) : std::log10(v);
sync_min(std::floor(e)); sync_max(std::ceil(e));
}
public:
foo(T v = T()) : val(v) { sync(v); }
foo& operator=(T v) { val = v; sync(v); return *this; }
template<typename U> foo(U v) : foo(T(v)) {}
template<typename U> foo& operator=(U v) { return *this = T(v); }
operator T&() { return val; }
operator const T&() const { return val; }
static int min() { return min_val(); }
static int max() { return max_val(); }
};
to be used like
int main ()
{
using F = foo<float>;
F x;
for (F e = -10.2; e <= 30.4; e += .2)
x = std::pow(10, e);
std::cout << F::min() << " " << F::max() << std::endl; // -11 31
}
This means you need to define an alias (say, Float) for your floating type (float or double) and use it consistently throughout your program. This may be inconvenient but it may prove beneficial eventually (because then your program is more generic). If your code is already templated on the floating type, even better.
After this parametrization, you can switch your program to "measuring" or "release" mode by defining Float to be either foo<T> or T, where T is your float or double.
The good thing is that you don't need external tools, your own code carries out the measurements. The bad thing is that, as currently designed, it won't catch all intermediate results. You would have to define all (e.g. arithmetic) operators on foo for this. This can be done but needs some more work.
It is not true that you cannot use floating point code on hardware that does not support floating point - the compiler will provide software routines to perform floating point operations - they just may be rather slow - but if it is fast enough for your application , that is the path of least resistance.
It is probably simplest to implement a fixed point data type class and have its member functions detect over/underflow as a debug option (because the checking will otherwise slow your code).
I suggest you look at Anthony Williams' fixed-Point math C++ library. It is in C++ and defines a fixed class with extensive function and operator overloading, so it can largely be used simply by replacing float or double in your existing code with fixed. It uses int64_t as the underlying integer data type, with 34 integer bits and 28 fractional bits (34Q28), so is good for about 8 decimal places and a wider range than int32_t.
It does not have the under/overflow checking I suggested, but it is a good starting point for you to add your own.
On 32bit ARM this library performs about 5 times faster than software-floating point and is comparable in performance to ARM's VFP unit for C code.
Note that the sqrt() function in this library has poor precision performance for very small values as it looses lower-order bits in intermediate calculations that can be preserved. It can be improved by replacing it with the code the version I presented in this question.
For self-contained C programs, you can use Frama-C's value analysis to obtain ranges for the floating-point variables, for instance h below:
And variable g computed from h:
There is a specification language to describe the ranges of the inputs (information without which it is difficult to say anything informative). In the example above, I used that language to specify what function float_interval was expected to do:
/*# ensures \is_finite(\result) && l <= \result <= u ; */
float float_interval(float l, float u);
Frama-C is easiest to install on Linux, with Debian and Ubuntu binary packages for a recent (but usually not the latest) version available from within the distribution.
If you could post your code, it would help telling if if this approach is realistic. If your code is C++, for instance (your question does not say, it is tagged with several language tags), then the current version of Frama-C will be no help as it only accepts C programs.

Simple way to compare doubles

I am writing a numerical code that needs to make extensive (and possibly fast) comparisons among double precision numbers. My solution to compare two numbers A and B involves shifting A to the left (or right) by an epsilon and checking whether the result is bigger (or smaller) than B. If so, the two doubles are the same. (Extra coding needs to be done for negative or zero numbers).
This is the comparing function:
#define S_
inline double s_l (double x){
if(x>0){return 0.999999999*x;}
else if(x<0){return 1.00000001*x;}
else {return x-0.000000000001;}
}
inline double s_r (double x){
if(x>0){return 1.00000001*x;}
else if(x<0){return 0.999999999*x;}
else{return x+0.000000000001;}
}
inline bool s_equal (double x,double y){
if(x==y){return true;}
else if(x<y && s_r(x)>y){return true;}
else if(x>y && s_l(x)<y){return true;}
else{return false;}
}
#endif
Since this is part of a MonteCarlo algorithm and s_equal(x,y) is called millions of times, I wonder if there is any better or faster way to code this, understandable at a simple level.
I do something like abs( (x-y)/x ) < 1.0e-10 .
You need to divide by x in case both values are huge or tiny.
I was surprised to find a significant speedup by avoiding all the double-precision math:
#define S_L(x) (x)+((x)<0?1024:-1024)
#define S_R(x) (x)+((x)<0?-1024:1024)
#define S_EQUAL(x,y) (S_L(x)<(y) && S_R(x)>(y))
double foo;
double bar;
long *pfoo;
long *pbar;
pfoo = (long*)&foo;
pbar = (long*)&bar;
double result1 = S_R(*pfoo);
double result2 = S_L(*pbar);
bool result3 = S_EQUAL(*pfoo, *pbar);
(In testing, I operated on randomly-generated doubles between -1M and 1M, executing each operation 100M times with different input for each iteration. Each operation was timed in an independent loop, comparing system times - not wall times. Including loop overhead and generation of random numbers, this solution was about 25% faster.)
A word of warning: there are lots of dependencies here on your hardware, the range of your doubles, the behavior of your optimizer, etc., etc. Such pitfalls are commonplace when you start second-guessing your compiler. I was shocked to see how much faster this was for me, since I'd always been told that integer and floating point units are kept so separate on hardware that the transport of bits from one to the other always requires a hardware memory operation. Who knows how well this will work for you.
You will likely have to play with the magic numbers a bit (the 1024s) to get it to do about what you want it to - if it's even possible.
If you're using the C++11, then you could use the new math library functions, such as:
bool isgreater(float x, float y)
More documentation on std::isgreater can be had here.
Otherwise, there's always is_equal in boost. Also, SO already has a bunch of related (not sure if same) questions such as the ones here, here and here.

Templatized branchless int max/min function

I'm trying to write a branchless function to return the MAX or MIN of two integers without resorting to if (or ?:). Using the usual technique I can do this easily enough for a given word size:
inline int32 imax( int32 a, int32 b )
{
// signed for arithmetic shift
int32 mask = a - b;
// mask < 0 means MSB is 1.
return a + ( ( b - a ) & ( mask >> 31 ) );
}
Now, assuming arguendo that I really am writing the kind of application on the kind of in-order processor where this is necessary, my question is whether there is a way to use C++ templates to generalize this to all sizes of int.
The >>31 step only works for int32s, of course, and while I could copy out overloads on the function for int8, int16, and int64, it seems like I should use a template function instead. But how do I get the size of a template argument in bits?
Is there a better way to do it than this? Can I force the mask T to be signed? If T is unsigned the mask-shift step won't work (because it'll be a logical rather than arithmetic shift).
template< typename T >
inline T imax( T a, T b )
{
// how can I force this T to be signed?
T mask = a - b;
// I hope the compiler turns the math below into an immediate constant!
mask = mask >> ( (sizeof(T) * 8) - 1 );
return a + ( ( b - a ) & mask );
}
And, having done the above, can I prevent it from being used for anything but an integer type (eg, no floats or classes)?
EDIT: This answer is from before C++11. Since then, C++11 and later has offered make_signed<T> and much more as part of the standard library
Generally, looks good, but for 100% portability, replace that 8 with CHAR_BIT (or numeric_limits<char>::max()) since it isn't guaranteed that characters are 8-bit.
Any good compiler will be smart enough to merge all of the math constants at compile time.
You can force it to be signed by using a type traits library. which would usually look something like (assuming your numeric_traits library is called numeric_traits):
typename numeric_traits<T>::signed_type x;
An example of a manually rolled numeric_traits header could look like this: http://rafb.net/p/Re7kq478.html (there is plenty of room for additions, but you get the idea).
or better yet, use boost:
typename boost::make_signed<T>::type x;
EDIT: IIRC, signed right shifts don't have to be arithmetic. It is common, and certainly the case with every compiler I've used. But I believe that the standard leaves it up the compiler whether right shifts are arithmetic or not on signed types. In my copy of the draft standard, the following is written:
The value of E1 >> E2 is E1
rightshifted E2 bit positions. If E1
has an unsigned type or if E1 has a
signed type and a nonnegative value,
the value of the result is the
integral part of the quotient of E1
divided by the quantity 2 raised to
the power E2. If E1 has a signed type
and a negative value, the resulting
value is implementation defined.
But as I said, it will work on every compiler I've seen :-p.
Here's another approach for branchless max and min. What's nice about it is that it doesn't use any bit tricks and you don't have to know anything about the type.
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
template <typename T>
inline T imin (T a, T b)
{
return (a > b) * b + (a <= b) * a;
}
tl;dr
To achieve your goals, you're best off just writing this:
template<typename T> T max(T a, T b) { return (a > b) ? a : b; }
Long version
I implemented both the "naive" implementation of max() as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.
Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.
Admittedly, my assembly-fu isn't all that great, so while NaiveMax() had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.
So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.
#include <iostream>
#include <chrono>
using int32 = long;
using uint32 = unsigned long;
constexpr int32 NaiveMax(int32 a, int32 b)
{
return (a > b) ? a : b;
}
constexpr int32 FastMax(int32 a, int32 b)
{
int32 mask = a - b;
mask = mask >> ((sizeof(int32) * 8) - 1);
return a + ((b - a) & mask);
}
int main()
{
int32 resInts[1000] = {};
int32 lotsOfInts[1'000];
for (uint32 i = 0; i < 1000; i++)
{
lotsOfInts[i] = rand();
}
auto naiveTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
}();
auto fastTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
}();
std::cout << "Naive Time: " << naiveTime << std::endl;
std::cout << "Fast Time: " << fastTime << std::endl;
getchar();
return 0;
}
And here's the output I get on my machine:
Naive Time: 2330174
Fast Time: 2492246
I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.
Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.
The Answer
In brief, it would seem that the best way to write a branchless templated max() function is probably to keep it simple:
template<typename T> T max(T a, T b) { return (a > b) ? a : b; }
There are additional upsides to the naive method:
It works for unsigned types.
It even works for floating types.
It expresses exactly what you intend, rather than needing to comment up your code describing what the bit-twiddling is doing.
It is a well known and recognizable pattern, so most compilers will know exactly how to optimize it, making it more portable. (This is a gut hunch of mine, only backed up by personal experience of compilers surprising me a lot. I'll be willing to admit I'm wrong here.)
You may want to look at the Boost.TypeTraits library. For detecting whether a type is signed you can use the is_signed trait. You can also look into enable_if/disable_if for removing overloads for certain types.
I don't know what are the exact conditions for this bit mask trick to work but you can do something like
#include<type_traits>
template<typename T, typename = std::enable_if_t<std::is_integral<T>{}> >
inline T imax( T a, T b )
{
...
}
Other useful candidates are std::is_[un]signed, std::is_fundamental, etc. https://en.cppreference.com/w/cpp/types
In addition to tloch14's answer "tl;dr", one can also use an index into an array. This avoids the unwieldly bitshuffling of the "branchless min/max"; it's also generalizable to all types.
template<typename T> constexpr T OtherFastMax(const T &a, const T &b)
{
const T (&p)[2] = {a, b};
return p[a>b];
}