Best way to "Clip" (saturate) a variable [duplicate]

Best way to "Clip" (saturate) a variable [duplicate] - c++

Is there a more efficient way to clamp real numbers than using if statements or ternary operators?
I want to do this both for doubles and for a 32-bit fixpoint implementation (16.16). I'm not asking for code that can handle both cases; they will be handled in separate functions.
Obviously, I can do something like:
double clampedA;
double a = calculate();
clampedA = a > MY_MAX ? MY_MAX : a;
clampedA = a < MY_MIN ? MY_MIN : a;
or
double a = calculate();
double clampedA = a;
if(clampedA > MY_MAX)
clampedA = MY_MAX;
else if(clampedA < MY_MIN)
clampedA = MY_MIN;
The fixpoint version would use functions/macros for comparisons.
This is done in a performance-critical part of the code, so I'm looking for an as efficient way to do it as possible (which I suspect would involve bit-manipulation)
EDIT: It has to be standard/portable C, platform-specific functionality is not of any interest here. Also, MY_MIN and MY_MAX are the same type as the value I want clamped (doubles in the examples above).

Both GCC and clang generate beautiful assembly for the following simple, straightforward, portable code:
double clamp(double d, double min, double max) {
const double t = d < min ? min : d;
return t > max ? max : t;
}
> gcc -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
GCC-generated assembly:
maxsd %xmm0, %xmm1 # d, min
movapd %xmm2, %xmm0 # max, max
minsd %xmm1, %xmm0 # min, max
ret
> clang -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
Clang-generated assembly:
maxsd %xmm0, %xmm1
minsd %xmm1, %xmm2
movaps %xmm2, %xmm0
ret
Three instructions (not counting the ret), no branches. Excellent.
This was tested with GCC 4.7 and clang 3.2 on Ubuntu 13.04 with a Core i3 M 350.
On a side note, the straightforward C++ code calling std::min and std::max generated the same assembly.
This is for doubles. And for int, both GCC and clang generate assembly with five instructions (not counting the ret) and no branches. Also excellent.
I don't currently use fixed-point, so I will not give an opinion on fixed-point.

Old question, but I was working on this problem today (with doubles/floats).
The best approach is to use SSE MINSS/MAXSS for floats and SSE2 MINSD/MAXSD for doubles. These are branchless and take one clock cycle each, and are easy to use thanks to compiler intrinsics. They confer more than an order of magnitude increase in performance compared with clamping with std::min/max.
You may find that surprising. I certainly did! Unfortunately VC++ 2010 uses simple comparisons for std::min/max even when /arch:SSE2 and /FP:fast are enabled. I can't speak for other compilers.
Here's the necessary code to do this in VC++:
#include <mmintrin.h>
float minss ( float a, float b )
{
// Branchless SSE min.
_mm_store_ss( &a, _mm_min_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float maxss ( float a, float b )
{
// Branchless SSE max.
_mm_store_ss( &a, _mm_max_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float clamp ( float val, float minval, float maxval )
{
// Branchless SSE clamp.
// return minss( maxss(val,minval), maxval );
_mm_store_ss( &val, _mm_min_ss( _mm_max_ss(_mm_set_ss(val),_mm_set_ss(minval)), _mm_set_ss(maxval) ) );
return val;
}
The double precision code is the same except with xxx_sd instead.
Edit: Initially I wrote the clamp function as commented. But looking at the assembler output I noticed that the VC++ compiler wasn't smart enough to cull the redundant move. One less instruction. :)

If your processor has a fast instruction for absolute value (as the x86 does), you can do a branchless min and max which will be faster than an if statement or ternary operation.
min(a,b) = (a + b - abs(a-b)) / 2
max(a,b) = (a + b + abs(a-b)) / 2
If one of the terms is zero (as is often the case when you're clamping) the code simplifies a bit further:
max(a,0) = (a + abs(a)) / 2
When you're combining both operations you can replace the two /2 into a single /4 or *0.25 to save a step.
The following code is over 3x faster than ternary on my Athlon II X2, when using the optimization for FMIN=0.
double clamp(double value)
{
double temp = value + FMAX - abs(value-FMAX);
#if FMIN == 0
return (temp + abs(temp)) * 0.25;
#else
return (temp + (2.0*FMIN) + abs(temp-(2.0*FMIN))) * 0.25;
#endif
}

Ternary operator is really the way to go, because most compilers are able to compile them into a native hardware operation that uses a conditional move instead of a branch (and thus avoids the mispredict penalty and pipeline bubbles and so on). Bit-manipulation is likely to cause a load-hit-store.
In particular, PPC and x86 with SSE2 have a hardware op that could be expressed as an intrinsic something like this:
double fsel( double a, double b, double c ) {
return a >= 0 ? b : c;
}
The advantage is that it does this inside the pipeline, without causing a branch. In fact, if your compiler uses the intrinsic, you can use it to implement your clamp directly:
inline double clamp ( double a, double min, double max )
{
a = fsel( a - min , a, min );
return fsel( a - max, max, a );
}
I strongly suggest you avoid bit-manipulation of doubles using integer operations. On most modern CPUs there is no direct means of moving data between double and int registers other than by taking a round trip to the dcache. This will cause a data hazard called a load-hit-store which basically empties out the CPU pipeline until the memory write has completed (usually around 40 cycles or so).
The exception to this is if the double values are already in memory and not in a register: in that case there is no danger of a load-hit-store. However your example indicates you've just calculated the double and returned it from a function which means it's likely to still be in XMM1.

For the 16.16 representation, the simple ternary is unlikely to be bettered speed-wise.
And for doubles, because you need it standard/portable C, bit-fiddling of any kind will end badly.
Even if a bit-fiddle was possible (which I doubt), you'd be relying on the binary representation of doubles. THIS (and their size) IS IMPLEMENTATION-DEPENDENT.
Possibly you could "guess" this using sizeof(double) and then comparing the layout of various double values against their common binary representations, but I think you're on a hiding to nothing.
The best rule is TELL THE COMPILER WHAT YOU WANT (ie ternary), and let it optimise for you.
EDIT: Humble pie time. I just tested quinmars idea (below), and it works - if you have IEEE-754 floats. This gave a speedup of about 20% on the code below. IObviously non-portable, but I think there may be a standardised way of asking your compiler if it uses IEEE754 float formats with a #IF...?
double FMIN = 3.13;
double FMAX = 300.44;
double FVAL[10] = {-100, 0.23, 1.24, 3.00, 3.5, 30.5, 50 ,100.22 ,200.22, 30000};
uint64 Lfmin = *(uint64 *)&FMIN;
uint64 Lfmax = *(uint64 *)&FMAX;
DWORD start = GetTickCount();
for (int j=0; j<10000000; ++j)
{
uint64 * pfvalue = (uint64 *)&FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < Lfmin) ? Lfmin : (*pfvalue > Lfmax) ? Lfmax : *pfvalue;
}
volatile DWORD hacktime = GetTickCount() - start;
for (int j=0; j<10000000; ++j)
{
double * pfvalue = &FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < FMIN) ? FMIN : (*pfvalue > FMAX) ? FMAX : *pfvalue;
}
volatile DWORD normaltime = GetTickCount() - (start + hacktime);

The bits of IEEE 754 floating point are ordered in a way that if you compare the bits interpreted as an integer you get the same results as if you would compare them as floats directly. So if you find or know a way to clamp integers you can use it for (IEEE 754) floats as well. Sorry, I don't know a faster way.
If you have the floats stored in an arrays you can consider to use some CPU extensions like SSE3, as rkj said. You can take a look at liboil it does all the dirty work for you. Keeps your program portable and uses faster cpu instructions if possible. (I'm not sure tho how OS/compiler-independent liboil is).

Rather than testing and branching, I normally use this format for clamping:
clampedA = fmin(fmax(a,MY_MIN),MY_MAX);
Although I have never done any performance analysis on the compiled code.

Realistically, no decent compiler will make a difference between an if() statement and a ?: expression. The code is simple enough that they'll be able to spot the possible paths. That said, your two examples are not identical. The equivalent code using ?: would be
a = (a > MAX) ? MAX : ((a < MIN) ? MIN : a);
as that avoid the A < MIN test when a > MAX. Now that could make a difference, as the compiler otherwise would have to spot the relation between the two tests.
If clamping is rare, you can test the need to clamp with a single test:
if (abs(a - (MAX+MIN)/2) > ((MAX-MIN)/2)) ...
E.g. with MIN=6 and MAX=10, this will first shift a down by 8, then check if it lies between -2 and +2. Whether this saves anything depends a lot on the relative cost of branching.

Here's a possibly faster implementation similar to #Roddy's answer:
typedef int64_t i_t;
typedef double f_t;
static inline
i_t i_tmin(i_t x, i_t y) {
return (y + ((x - y) & -(x < y))); // min(x, y)
}
static inline
i_t i_tmax(i_t x, i_t y) {
return (x - ((x - y) & -(x < y))); // max(x, y)
}
f_t clip_f_t(f_t f, f_t fmin, f_t fmax)
{
#ifndef TERNARY
assert(sizeof(i_t) == sizeof(f_t));
//assert(not (fmin < 0 and (f < 0 or is_negative_zero(f))));
//XXX assume IEEE-754 compliant system (lexicographically ordered floats)
//XXX break strict-aliasing rules
const i_t imin = *(i_t*)&fmin;
const i_t imax = *(i_t*)&fmax;
const i_t i = *(i_t*)&f;
const i_t iclipped = i_tmin(imax, i_tmax(i, imin));
#ifndef INT_TERNARY
return *(f_t *)&iclipped;
#else /* INT_TERNARY */
return i < imin ? fmin : (i > imax ? fmax : f);
#endif /* INT_TERNARY */
#else /* TERNARY */
return fmin > f ? fmin : (fmax < f ? fmax : f);
#endif /* TERNARY */
}
See Compute the minimum (min) or maximum (max) of two integers without branching and Comparing floating point numbers
The IEEE float and double formats were
designed so that the numbers are
“lexicographically ordered”, which –
in the words of IEEE architect William
Kahan means “if two floating-point
numbers in the same format are ordered
( say x < y ), then they are ordered
the same way when their bits are
reinterpreted as Sign-Magnitude
integers.”
A test program:
/** gcc -std=c99 -fno-strict-aliasing -O2 -lm -Wall *.c -o clip_double && clip_double */
#include <assert.h>
#include <iso646.h> // not, and
#include <math.h> // isnan()
#include <stdbool.h> // bool
#include <stdint.h> // int64_t
#include <stdio.h>
static
bool is_negative_zero(f_t x)
{
return x == 0 and 1/x < 0;
}
static inline
f_t range(f_t low, f_t f, f_t hi)
{
return fmax(low, fmin(f, hi));
}
static const f_t END = 0./0.;
#define TOSTR(f, fmin, fmax, ff) ((f) == (fmin) ? "min" : \
((f) == (fmax) ? "max" : \
(is_negative_zero(ff) ? "-0.": \
((f) == (ff) ? "f" : #f))))
static int test(f_t p[], f_t fmin, f_t fmax, f_t (*fun)(f_t, f_t, f_t))
{
assert(isnan(END));
int failed_count = 0;
for ( ; ; ++p) {
const f_t clipped = fun(*p, fmin, fmax), expected = range(fmin, *p, fmax);
if(clipped != expected and not (isnan(clipped) and isnan(expected))) {
failed_count++;
fprintf(stderr, "error: got: %s, expected: %s\t(min=%g, max=%g, f=%g)\n",
TOSTR(clipped, fmin, fmax, *p),
TOSTR(expected, fmin, fmax, *p), fmin, fmax, *p);
}
if (isnan(*p))
break;
}
return failed_count;
}
int main(void)
{
int failed_count = 0;
f_t arr[] = { -0., -1./0., 0., 1./0., 1., -1., 2,
2.1, -2.1, -0.1, END};
f_t minmax[][2] = { -1, 1, // min, max
0, 2, };
for (int i = 0; i < (sizeof(minmax) / sizeof(*minmax)); ++i)
failed_count += test(arr, minmax[i][0], minmax[i][1], clip_f_t);
return failed_count & 0xFF;
}
In console:
$ gcc -std=c99 -fno-strict-aliasing -O2 -lm *.c -o clip_double && ./clip_double
It prints:
error: got: min, expected: -0. (min=-1, max=1, f=0)
error: got: f, expected: min (min=-1, max=1, f=-1.#INF)
error: got: f, expected: min (min=-1, max=1, f=-2.1)
error: got: min, expected: f (min=-1, max=1, f=-0.1)

I tried the SSE approach to this myself, and the assembly output looked quite a bit cleaner, so I was encouraged at first, but after timing it thousands of times, it was actually quite a bit slower. It does indeed look like the VC++ compiler isn't smart enough to know what you're really intending, and it appears to move things back and forth between the XMM registers and memory when it shouldn't. That said, I don't know why the compiler isn't smart enough to use the SSE min/max instructions on the ternary operator when it seems to use SSE instructions for all floating point calculations anyway. On the other hand, if you're compiling for PowerPC, you can use the fsel intrinsic on the FP registers, and it's way faster.

As pointed out above, fmin/fmax functions work well (in gcc, with -ffast-math). Although gfortran has patterns to use IA instructions corresponding to max/min, g++ does not. In icc one must use instead std::min/max, because icc doesn't allow short-cutting the specification of how fmin/fmax work with non-finite operands.

My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated
template <typename T>
inline T clamp(T val, T lo, T hi) {
return std::max(lo, std::min(hi, val));
}

If I understand properly, you want to limit a value "a" to a range between MY_MIN and MY_MAX. The type of "a" is a double. You did not specify the type of MY_MIN or MY_MAX.
The simple expression:
clampedA = (a > MY_MAX)? MY_MAX : (a < MY_MIN)? MY_MIN : a;
should do the trick.
I think there may be a small optimization to be made if MY_MAX and MY_MIN happen to be integers:
int b = (int)a;
clampedA = (b > MY_MAX)? (double)MY_MAX : (b < MY_MIN)? (double)MY_MIN : a;
By changing to integer comparisons, it is possible you might get a slight speed advantage.

If you want to use fast absolute value instructions, check out this snipped of code I found in minicomputer, which clamps a float to the range [0,1]
clamped = 0.5*(fabs(x)-fabs(x-1.0f) + 1.0f);
(I simplified the code a bit). We can think about it as taking two values, one reflected to be >0
fabs(x)
and the other reflected about 1.0 to be <1.0
1.0-fabs(x-1.0)
And we take the average of them. If it is in range, then both values will be the same as x, so their average will again be x. If it is out of range, then one of the values will be x, and the other will be x flipped over the "boundary" point, so their average will be precisely the boundary point.

Related

Dividing two integers and rounding up the result, without using floating point

I need to divide two numbers and round it up. Are there any better way to do this?
int myValue = (int) ceil( (float)myIntNumber / myOtherInt );
I find an overkill to have to cast two different time. (the extern int cast is just to shut down the warning)
Note I have to cast internally to float otherwise
int a = ceil(256/11); //> Should be 24, but it is 23
^example

Assuming that both myIntNumber and myOtherInt are positive, you could do:
int myValue = (myIntNumber + myOtherInt - 1) / myOtherInt;

With help from DyP, came up with the following branchless formula:
int idiv_ceil ( int numerator, int denominator )
{
return numerator / denominator
+ (((numerator < 0) ^ (denominator > 0)) && (numerator%denominator));
}
It avoids floating-point conversions and passes a basic suite of unit tests, as shown here:
http://ideone.com/3OrviU
Here's another version that avoids the modulo operator.
int idiv_ceil ( int numerator, int denominator )
{
int truncated = numerator / denominator;
return truncated + (((numerator < 0) ^ (denominator > 0)) &&
(numerator - truncated*denominator));
}
http://ideone.com/Z41G5q
The first one will be faster on processors where IDIV returns both quotient and remainder (and the compiler is smart enough to use that).

Maybe it is just easier to do a:
int result = dividend / divisor;
if(dividend % divisor != 0)
result++;

Benchmarks
Since a lot of different methods are shown in the answers and none of the answers actually prove any advantages in terms of performance I tried to benchmark them myself. My plan was to write an answer that contains a short table and a definite answer which method is the fastest.
Unfortunately it wasn't that simple. (It never is.) It seems that the performance of the rounding formulas depend on the used data type, compiler and optimization level.
In one case there is an increase of speed by 7.5× from one method to another. So the impact can be significant for some people.
TL;DR
For long integers the naive version using a type cast to float and std::ceil was actually the fastest. This was interesting for me personally since I intended to use it with size_t which is usually defined as unsigned long.
For ordinary ints it depends on your optimization level. For lower levels #Jwodder's solution performs best. For the highest levels std::ceil was the optimal one. With one exception: For the clang/unsigned int combination #Jwodder's was still better.
The solutions from the accepted answer never really outperformed the other two. You should keep in mind however that #Jwodder's solution doesn't work with negatives.
Results are at the bottom.
Implementations
To recap here are the four methods I benchmarked and compared:
#Jwodder's version (Jwodder)
template<typename T>
inline T divCeilJwodder(const T& numerator, const T& denominator)
{
return (numerator + denominator - 1) / denominator;
}
#Ben Voigt's version using modulo (VoigtModulo)
template<typename T>
inline T divCeilVoigtModulo(const T& numerator, const T& denominator)
{
return numerator / denominator + (((numerator < 0) ^ (denominator > 0))
&& (numerator%denominator));
}
#Ben Voigt's version without using modulo (VoigtNoModulo)
inline T divCeilVoigtNoModulo(const T& numerator, const T& denominator)
{
T truncated = numerator / denominator;
return truncated + (((numerator < 0) ^ (denominator > 0))
&& (numerator - truncated*denominator));
}
OP's implementation (TypeCast)
template<typename T>
inline T divCeilTypeCast(const T& numerator, const T& denominator)
{
return (int)std::ceil((double)numerator / denominator);
}
Methodology
In a single batch the division is performed 100 million times. Ten batches are calculated for each combination of Compiler/Optimization level, used data type and used implementation. The values shown below are the averages of all 10 batches in milliseconds. The errors that are given are standard deviations.
The whole source code that was used can be found here. Also you might find this script useful which compiles and executes the source with different compiler flags.
The whole benchmark was performed on a i7-7700K. The used compiler versions were GCC 10.2.0 and clang 11.0.1.
Results
Now without further ado here are the results:
DataTypeAlgorithm
GCC-O0
-O1
-O2
-O3
-Os
-Ofast
-Og
clang-O0
-O1
-O2
-O3
-Ofast
-Os
-Oz
unsigned
Jwodder
264.1 ± 0.9 🏆
175.2 ± 0.9 🏆
153.5 ± 0.7 🏆
175.2 ± 0.5 🏆
153.3 ± 0.5
153.4 ± 0.8
175.5 ± 0.6 🏆
329.4 ± 1.3 🏆
220.0 ± 1.3 🏆
146.2 ± 0.6 🏆
146.2 ± 0.6 🏆
146.0 ± 0.5 🏆
153.2 ± 0.3 🏆
153.5 ± 0.6 🏆
VoigtModulo
528.5 ± 2.5
306.5 ± 1.0
175.8 ± 0.7
175.2 ± 0.5 🏆
175.6 ± 0.7
175.4 ± 0.6
352.0 ± 1.0
588.9 ± 6.4
408.7 ± 1.5
164.8 ± 1.0
164.0 ± 0.4
164.1 ± 0.4
175.2 ± 0.5
175.8 ± 0.9
VoigtNoModulo
375.3 ± 1.5
175.7 ± 1.3 🏆
192.5 ± 1.4
197.6 ± 1.9
200.6 ± 7.2
176.1 ± 1.5
197.9 ± 0.5
541.0 ± 1.8
263.1 ± 0.9
186.4 ± 0.6
186.4 ± 1.2
187.2 ± 1.1
197.2 ± 0.8
197.1 ± 0.7
TypeCast
348.5 ± 2.7
231.9 ± 3.9
234.4 ± 1.3
226.6 ± 1.0
137.5 ± 0.8 🏆
138.7 ± 1.7 🏆
243.8 ± 1.4
591.2 ± 2.4
591.3 ± 2.6
155.8 ± 1.9
155.9 ± 1.6
155.9 ± 2.4
214.6 ± 1.9
213.6 ± 1.1
unsigned long
Jwodder
658.6 ± 2.5
546.3 ± 0.9
549.3 ± 1.8
549.1 ± 2.8
540.6 ± 3.4
548.8 ± 1.3
486.1 ± 1.1
638.1 ± 1.8
613.3 ± 2.1
190.0 ± 0.8 🏆
182.7 ± 0.5
182.4 ± 0.5
496.2 ± 1.3
554.1 ± 1.0
VoigtModulo
1,169.0 ± 2.9
1,015.9 ± 4.4
550.8 ± 2.0
504.0 ± 1.4
550.3 ± 1.2
550.5 ± 1.3
1,020.1 ± 2.9
1,259.0 ± 9.0
1,136.5 ± 4.2
187.0 ± 3.4 🏆
199.7 ± 6.1
197.6 ± 1.0
549.4 ± 1.7
506.8 ± 4.4
VoigtNoModulo
768.1 ± 1.7
559.1 ± 1.8
534.4 ± 1.6
533.7 ± 1.5
559.5 ± 1.7
534.3 ± 1.5
571.5 ± 1.3
879.5 ± 10.8
617.8 ± 2.1
223.4 ± 1.3
231.3 ± 1.3
231.4 ± 1.1
594.6 ± 1.9
572.2 ± 0.8
TypeCast
353.3 ± 2.5 🏆
267.5 ± 1.7 🏆
248.0 ± 1.6 🏆
243.8 ± 1.2 🏆
154.2 ± 0.8 🏆
154.1 ± 1.0 🏆
263.8 ± 1.8 🏆
365.5 ± 1.6 🏆
316.9 ± 1.8 🏆
189.7 ± 2.1 🏆
156.3 ± 1.8 🏆
157.0 ± 2.2 🏆
155.1 ± 0.9 🏆
176.5 ± 1.2 🏆
int
Jwodder
307.9 ± 1.3 🏆
175.4 ± 0.9 🏆
175.3 ± 0.5 🏆
175.4 ± 0.6 🏆
175.2 ± 0.5
175.1 ± 0.6
175.1 ± 0.5 🏆
307.4 ± 1.2 🏆
219.6 ± 0.6 🏆
146.0 ± 0.3 🏆
153.5 ± 0.5
153.6 ± 0.8
175.4 ± 0.7 🏆
175.2 ± 0.5 🏆
VoigtModulo
528.5 ± 1.9
351.9 ± 4.6
175.3 ± 0.6 🏆
175.2 ± 0.4 🏆
197.1 ± 0.6
175.2 ± 0.8
373.5 ± 1.1
598.7 ± 5.1
460.6 ± 1.3
175.4 ± 0.4
164.3 ± 0.9
164.0 ± 0.4
176.3 ± 1.6 🏆
460.0 ± 0.8
VoigtNoModulo
398.0 ± 2.5
241.0 ± 0.7
199.4 ± 5.1
219.2 ± 1.0
175.9 ± 1.2
197.7 ± 1.2
242.9 ± 3.0
543.5 ± 2.3
350.6 ± 1.0
186.6 ± 1.2
185.7 ± 0.3
186.3 ± 1.1
197.1 ± 0.6
373.3 ± 1.6
TypeCast
338.8 ± 4.9
228.1 ± 0.9
230.3 ± 0.8
229.5 ± 9.4
153.8 ± 0.4 🏆
138.3 ± 1.0 🏆
241.1 ± 1.1
590.0 ± 2.1
589.9 ± 0.8
155.2 ± 2.4
149.4 ± 1.6 🏆
148.4 ± 1.2 🏆
214.6 ± 2.2
211.7 ± 1.6
long
Jwodder
758.1 ± 1.8
600.6 ± 0.9
601.5 ± 2.2
601.5 ± 2.8
581.2 ± 1.9
600.6 ± 1.8
586.3 ± 3.6
745.9 ± 3.6
685.8 ± 2.2
183.1 ± 1.0
182.5 ± 0.5
182.6 ± 0.6
553.2 ± 1.5
488.0 ± 0.8
VoigtModulo
1,360.8 ± 6.1
1,202.0 ± 2.1
600.0 ± 2.4
600.0 ± 3.0
607.0 ± 6.8
599.0 ± 1.6
1,187.2 ± 2.6
1,439.6 ± 6.7
1,346.5 ± 2.9
197.9 ± 0.7
208.2 ± 0.6
208.0 ± 0.4
548.9 ± 1.4
1,326.4 ± 3.0
VoigtNoModulo
844.5 ± 6.9
647.3 ± 1.3
628.9 ± 1.8
627.9 ± 1.6
629.1 ± 2.4
629.6 ± 4.4
668.2 ± 2.7
1,019.5 ± 3.2
715.1 ± 8.2
224.3 ± 4.8
219.0 ± 1.0
219.0 ± 0.6
561.7 ± 2.5
769.4 ± 9.3
TypeCast
366.1 ± 0.8 🏆
246.2 ± 1.1 🏆
245.3 ± 1.8 🏆
244.6 ± 1.1 🏆
154.6 ± 1.6 🏆
154.3 ± 0.5 🏆
257.4 ± 1.5 🏆
591.8 ± 4.1 🏆
590.4 ± 1.3 🏆
154.5 ± 1.3 🏆
135.4 ± 8.3 🏆
132.9 ± 0.7 🏆
132.8 ± 1.2 🏆
177.4 ± 2.3 🏆
Now I can finally get on with my life :P

Integer division with round-up.
Only 1 division executed per call, no % or * or conversion to/from floating point, works for positive and negative int. See note (1).
n (numerator) = OPs myIntNumber;
d (denominator) = OPs myOtherInt;
The following approach is simple. int division rounds toward 0. For negative quotients, this is a round up so nothing special is needed. For positive quotients, add d-1 to effect a round up, then perform an unsigned division.
Note (1) The usual divide by 0 blows things up and MININT/-1 fails as expected on 2's compliment machines.
int IntDivRoundUp(int n, int d) {
// If n and d are the same sign ...
if ((n < 0) == (d < 0)) {
// If n (and d) are negative ...
if (n < 0) {
n = -n;
d = -d;
}
// Unsigned division rounds down. Adding d-1 to n effects a round up.
return (((unsigned) n) + ((unsigned) d) - 1)/((unsigned) d);
}
else {
return n/d;
}
}
[Edit: test code removed, see earlier rev as needed]

Just use
int ceil_of_division = ((dividend-1)/divisor)+1;
For example:
for (int i=0;i<20;i++)
std::cout << i << "/8 = " << ((i-1)/8)+1 << std::endl;

A small hack is to do:
int divideUp(int a, int b) {
result = (a-1)/b + 1;
}
// Proof:
a = b*N + k (always)
if k == 0, then
(a-1) == b*N - 1
(a-1)/b == N - 1
(a-1)/b + 1 == N ---> Good !
if k > 0, then
(a-1) == b*N + l
(a-1)/b == N
(a-1)/b + 1 == N+1 ---> Good !

Instead of using the ceil function before casting to int, you can add a constant which is very nearly (but not quite) equal to 1 - this way, nearly anything (except a value which is exactly or incredibly close to an actual integer) will be increased by one before it is truncated.
Example:
#define EPSILON (0.9999)
int myValue = (int)(((float)myIntNumber)/myOtherInt + EPSILON);
EDIT: after seeing your response to the other post, I want to clarify that this will round up, not away from zero - negative numbers will become less negative, and positive numbers will become more positive.

Optimization Headache - removing if's from Look Up Table

I'm trying to optimize the following piece of code, which is a bottleneck in my application.
What it does: It takes the double values value1 and value2 and tries to find the maximum including a correctional factor. If the difference between both values is greater than 5.0 (the LUT is scaled by factor 10), I can just take the max value of those two. If the difference is smaller than 5.0, I can use the correctional factor from the LUT.
Does anyone have an idea what could be a better style for this piece of code? I don't know where I'm losing time - is it the large number of ifs or the multiplication by 10?
double value1, value2;
// Lookup Table scaled by 10 for (ln(1+exp(-abs(x)))), which is almost 0 for x > 5 and symmetrical around 0. LUT[0] is x=0.0, LUT[40] is x=4.0.
const logValue LUT[50] = { ... }
if (value1 > value2)
{
if (value1 - value2 >= 5.0)
{
return value1;
}
else
{
return value1 + LUT[(uint8)((value1 - value2) * 10)];
}
}
else
{
if (value2 - value1 >= 5.0)
{
return value2;
}
else
{
return value2 + LUT[(uint8)((value2 - value1) * 10)];
}
}

A couple of minutes of playing with Excel produces an approximation equation that looks like it might have the accuracy you need, so you can do away with the lookup table altogether. You still need one condition to make sure the parameters to the equation remain within the range that it was optimized for.
double diff = abs(value1 - value2);
double dmax = (value1 + value2 + diff) * 0.5; // same as (min+max+(max-min))/2
if (diff > 5.0)
return dmax;
return dmax + 4.473865638/(2.611112371+diff) + 0.088190879*diff + -1.015046114;
P.S. I'm not guaranteeing this is faster, only that it's a different enough approach to be worth benchmarking.
P.P.S. It's possible to change the constraints to come up with slightly different constants, there are a lot of variations. Here's another set I did where the difference between your table and the formula will always be less than 0.008, also each value will be less than the one preceeding.
return dmax + 3.441318133/(2.296924445+diff) + 0.065529678*diff + -0.797081529;
Edit: I tested this code (second formula) with 100 passes against a million random numbers between 0 and 10, along with the original code from the question, MSalters currently accepted answer, and a brute force implementation max(value1,value2)+log(1.0+exp(-abs(value1-value2))). I tried it on a dual-core AMD Athlon and an Intel quad-core i7 and the results were roughly consistent. Here's a typical run:
Original: 1.32 seconds.
MSalters: 1.13 seconds.
Mine: 0.67 seconds.
Brute force: 4.50 seconds.
Processors have gotten unbelievably fast over the years, so fast now that they can do a couple of floating point multiplies and divides faster than they can look up a value in memory. Not only is this approach faster on a modern x86, it's also more accurate; the approximation errors in the equation are much less than the step errors caused by truncating the input for the lookup.
Naturally results can still vary based on your processor and compiler; benchmarking is still required for your own particular target.

It probably goes down both paths equally enough that you're causing a lot of pipe-lining problems for your processor.
Have you tried profiling?
I'd also suggest trying to use the standard library and see if that helps (for example if it's able to use and processor-specific instructions):
double diff = std::fabs(value1 - value2);
double maxv = std::max(value1, value2);
return (diff >= 5.0) ? maxv : maxv + LUT[(uint8)((diff) * 10)];

I'd probably have written the code a bit different to handle the value2<value1 case:
if (value2 < value1) std::swap(value1, value2);
assert(value1 <= value2); // Assertion corrected
int diff = int((value2 - value1) * 10.0);
if (diff >= 50) diff = 49; // Integer comparison iso floating point
return value2 + LUT[diff];

I am going to assume when the function is called, you'll most likely get the part where you have to use the look up table, rather then the >=5.0 parts. In this case, it's better to guide the compiler towards this.
double maxval = value1;
double difference_scaled = (value1-value2)*10;
if (difference < 0)
{
difference = -difference;
maxval = value2;
}
if (difference < 50)
return maxval+LUT[(int)difference_scaled];
else
return maxval;
Try this and let me know if that improves your program's performance.

The only reason this code would be the bottleneck in your application is because you are calling it many many times. Are you sure you need to? Perhaps the algorithm higher up in the code can be changed to use the comparison less?

I've done some very quick tests but please profile the code yourself to verify the effect.
Changing the LUT[] to a static variable got me a 600% speed up (from 3.5s to 0.6s). Which is close to the absolute minimum of test I used (0.4s). See if that works and re-profile to determine if any further optimization is needed.
For reference, the I was simply timing the execution of this loop (100 million iterations of the inner loop) in VC++ 2010:
int Counter = 0;
for (double j = 0; j < 10; j += 0.001)
{
for (double i = 0; i < 10; i += 0.001)
{
++Counter;
Value1 += TestFunc1(i, j);
}
}

You are computing value1 - value2 quite a few times in your function. Just do it once.
That cast to a uint8_t may also be also problematic. A far as performance is concerned, the best integral type to use as a conversion from double to integer is an int, as is the best integral type to use an array index is an int.
max_value = value1;
diff = value1 - value2;
if (diff < 0.0) {
max_value = value2;
diff = -diff;
}
if (diff >= 5.0) {
return max_value;
}
else {
return max_value + LUT[(int)(diff * 10.0)];
}
Note that the above guarantees that the LUT index will be between 0 (inclusive) and 50 (exclusive). There's no need for a uint8_t here.
Edit
After some playing around with some variations, this is a fairly fast LUT-based approximation to log(exp(value1)+exp(value2)):
#include <stdint.h>
// intptr_t *happens* to be fastest on my machine. YMMV.
typedef intptr_t IndexType;
double log_sum_exp (double value1, double value2, double *LUT) {
double diff = value1 - value2;
if (diff < 0.0) {
value1 = value2;
diff = -diff;
}
IndexType idx = diff * 10.0;
if (idx < 50) {
value1 += LUT[idx];
}
return value1;
}
The integral type IndexType is one of the keys to speeding things up. I tested with clang and g++, and both indicated that casting to an intptr_t (long on my computer) and using a intptr_t as an index into the LUT is faster than other integral types. It is considerably faster than some types. For example, unsigned long long and uint8_t are incredibly poor choices on my computer.
The type is not just a hint, at least with the compilers I used. Those compilers did exactly what the code told it to do with regard to the conversion from floating point type to integral type, regardless of the optimization level.
Another speed bump results from comparing the integral type to 50 as opposed to comparing the floating point type to 5.0.
One last speed bump: Not all compilers are created equal. On my computer (YMMV), g++ -O3 generates considerably slower code (25% slower with this problem!) than does clang -O3, which in turn generates code that is a bit slower than that generated by clang -O4.
I also played with a rational function approximation approach (similar to Mark Ransom's answer), but the above obviously does not use such an approach.

How can I safely average two unsigned ints in C++?

Using integer math alone, I'd like to "safely" average two unsigned ints in C++.
What I mean by "safely" is avoiding overflows (and anything else that can be thought of).
For instance, averaging 200 and 5000 is easy:
unsigned int a = 200;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2600 as intended
But in the case of 4294967295 and 5000 then:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a + b) / 2; // Equals: 2499 instead of 2147486147
The best I've come up with is:
unsigned int a = 4294967295;
unsigned int b = 5000;
unsigned int average = (a / 2) + (b / 2); // Equals: 2147486147 as expected
Are there better ways?

Your last approach seems promising. You can improve on that by manually considering the lowest bits of a and b:
unsigned int average = (a / 2) + (b / 2) + (a & b & 1);
This gives the correct results in case both a and b are odd.

If you know ahead of time which one is higher, a very efficient way is possible. Otherwise you're better off using one of the other strategies, instead of conditionally swapping to use this.
unsigned int average = low + ((high - low) / 2);
Here's a related article: http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

Your method is not correct if both numbers are odd eg 5 and 7, average is 6 but your method #3 returns 5.
Try this:
average = (a>>1) + (b>>1) + (a & b & 1)
with math operators only:
average = a/2 + b/2 + (a%2) * (b%2)

And the correct answer is...
(A&B)+((A^B)>>1)

If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.
Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm). But here we go anyway:
// works for 64-bit long as well on x86-64, and doesn't depend on calling convention
unsigned average(unsigned x, unsigned y)
{
unsigned result;
asm("add %[x], %[res]\n\t"
"rcr %[res]"
: [res] "=r" (result) // output
: [y] "%0"(y), // input: in the same reg as results output. Commutative with next operand
[x] "rme"(x) // input: reg, mem, or immediate
: // no clobbers. ("cc" is implicit on x86)
);
return result;
}
The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand). Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.
As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm. clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.
RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency. It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links). It's still better than #sellibitze's version, but worse than #Sheldon's order-dependent version. (See code on Godbolt)
But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.

What you have is fine, with the minor detail that it will claim that the average of 3 and 3 is 2. I'm guessing that you don't want that; fortunately, there's an easy fix:
unsigned int average = a/2 + b/2 + (a & b & 1);
This just bumps the average back up in the case that both divisions were truncated.

In C++20, you can use std::midpoint:
template <class T>
constexpr T midpoint(T a, T b) noexcept;
The paper P0811R3 that introduced std::midpoint recommended this snippet (slightly adopted to work with C++11):
#include <type_traits>
template <typename Integer>
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = std::make_unsigned<Integer>::type;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}
For completeness, here is the unmodified C++20 implementation from the paper:
constexpr Integer midpoint(Integer a, Integer b) noexcept {
using U = make_unsigned_t<Integer>;
return a>b ? a-(U(a)-b)/2 : a+(U(b)-a)/2;
}

If the code is for an embedded micro, and if speed is critical, assembly language may be helpful. On many microcontrollers, the result of the add would naturally go into the carry flag, and instructions exist to shift it back into a register. On an ARM, the average operation (source and dest. in registers) could be done in two instructions; any C-language equivalent would likely yield at least 5, and probably a fair bit more than that.
Incidentally, on machines with shorter word sizes, the differences can be even more substantial. On an 8-bit PIC-18 series, averaging two 32-bit numbers would take twelve instructions. Doing the shifts, add, and correction, would take 5 instructions for each shift, eight for the add, and eight for the correction, so 26 (not quite a 2.5x difference, but probably more significant in absolute terms).

int[] array = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
decimal avg = 0;
for (int i = 0; i < array.Length; i++){
avg = (array[i] - avg) / (i+1) + avg;
}
expects avg == 5.0 for this test

(((a&b << 1) + (a^b)) >> 1) is also a nice way.
Courtesy: http://www.ragestorm.net/blogs/?p=29

round() for float in C++

I need a simple floating point rounding function, thus:
double round(double);
round(0.1) = 0
round(-0.1) = 0
round(-0.9) = -1
I can find ceil() and floor() in the math.h - but not round().
Is it present in the standard C++ library under another name, or is it missing??

Editor's Note: The following answer provides a simplistic solution that contains several implementation flaws (see Shafik Yaghmour's answer for a full explanation). Note that C++11 includes std::round, std::lround, and std::llround as builtins already.
There's no round() in the C++98 standard library. You can write one yourself though. The following is an implementation of round-half-up:
double round(double d)
{
return floor(d + 0.5);
}
The probable reason there is no round function in the C++98 standard library is that it can in fact be implemented in different ways. The above is one common way but there are others such as round-to-even, which is less biased and generally better if you're going to do a lot of rounding; it's a bit more complex to implement though.

The C++03 standard relies on the C90 standard for what the standard calls the Standard C Library which is covered in the draft C++03 standard (closest publicly available draft standard to C++03 is N1804) section 1.2 Normative references:
The library described in clause 7 of ISO/IEC 9899:1990 and clause 7 of
ISO/IEC 9899/Amd.1:1995 is hereinafter called the Standard C
Library.1)
If we go to the C documentation for round, lround, llround on cppreference we can see that round and related functions are part of C99 and thus won't be available in C++03 or prior.
In C++11 this changes since C++11 relies on the C99 draft standard for C standard library and therefore provides std::round and for integral return types std::lround, std::llround :
#include <iostream>
#include <cmath>
int main()
{
std::cout << std::round( 0.4 ) << " " << std::lround( 0.4 ) << " " << std::llround( 0.4 ) << std::endl ;
std::cout << std::round( 0.5 ) << " " << std::lround( 0.5 ) << " " << std::llround( 0.5 ) << std::endl ;
std::cout << std::round( 0.6 ) << " " << std::lround( 0.6 ) << " " << std::llround( 0.6 ) << std::endl ;
}
Another option also from C99 would be std::trunc which:
Computes nearest integer not greater in magnitude than arg.
#include <iostream>
#include <cmath>
int main()
{
std::cout << std::trunc( 0.4 ) << std::endl ;
std::cout << std::trunc( 0.9 ) << std::endl ;
std::cout << std::trunc( 1.1 ) << std::endl ;
}
If you need to support non C++11 applications your best bet would be to use boost round, iround, lround, llround or boost trunc.
Rolling your own version of round is hard
Rolling your own is probably not worth the effort as Harder than it looks: rounding float to nearest integer, part 1, Rounding float to nearest integer, part 2 and Rounding float to nearest integer, part 3 explain:
For example a common roll your implementation using std::floor and adding 0.5 does not work for all inputs:
double myround(double d)
{
return std::floor(d + 0.5);
}
One input this will fail for is 0.49999999999999994, (see it live).
Another common implementation involves casting a floating point type to an integral type, which can invoke undefined behavior in the case where the integral part can not be represented in the destination type. We can see this from the draft C++ standard section 4.9 Floating-integral conversions which says (emphasis mine):
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion truncates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type.[...]
For example:
float myround(float f)
{
return static_cast<float>( static_cast<unsigned int>( f ) ) ;
}
Given std::numeric_limits<unsigned int>::max() is 4294967295 then the following call:
myround( 4294967296.5f )
will cause overflow, (see it live).
We can see how difficult this really is by looking at this answer to Concise way to implement round() in C? which referencing newlibs version of single precision float round. It is a very long function for something which seems simple. It seems unlikely that anyone without intimate knowledge of floating point implementations could correctly implement this function:
float roundf(x)
{
int signbit;
__uint32_t w;
/* Most significant word, least significant word. */
int exponent_less_127;
GET_FLOAT_WORD(w, x);
/* Extract sign bit. */
signbit = w & 0x80000000;
/* Extract exponent field. */
exponent_less_127 = (int)((w & 0x7f800000) >> 23) - 127;
if (exponent_less_127 < 23)
{
if (exponent_less_127 < 0)
{
w &= 0x80000000;
if (exponent_less_127 == -1)
/* Result is +1.0 or -1.0. */
w |= ((__uint32_t)127 << 23);
}
else
{
unsigned int exponent_mask = 0x007fffff >> exponent_less_127;
if ((w & exponent_mask) == 0)
/* x has an integral value. */
return x;
w += 0x00400000 >> exponent_less_127;
w &= ~exponent_mask;
}
}
else
{
if (exponent_less_127 == 128)
/* x is NaN or infinite. */
return x + x;
else
return x;
}
SET_FLOAT_WORD(x, w);
return x;
}
On the other hand if none of the other solutions are usable newlib could potentially be an option since it is a well tested implementation.

Boost offers a simple set of rounding functions.
#include <boost/math/special_functions/round.hpp>
double a = boost::math::round(1.5); // Yields 2.0
int b = boost::math::iround(1.5); // Yields 2 as an integer
For more information, see the Boost documentation.
Edit: Since C++11, there are std::round, std::lround, and std::llround.

It may be worth noting that if you wanted an integer result from the rounding you don't need to pass it through either ceil or floor. I.e.,
int round_int( double r ) {
return (r > 0.0) ? (r + 0.5) : (r - 0.5);
}

It's available since C++11 in cmath (according to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf)
#include <cmath>
#include <iostream>
int main(int argc, char** argv) {
std::cout << "round(0.5):\t" << round(0.5) << std::endl;
std::cout << "round(-0.5):\t" << round(-0.5) << std::endl;
std::cout << "round(1.4):\t" << round(1.4) << std::endl;
std::cout << "round(-1.4):\t" << round(-1.4) << std::endl;
std::cout << "round(1.6):\t" << round(1.6) << std::endl;
std::cout << "round(-1.6):\t" << round(-1.6) << std::endl;
return 0;
}
Output:
round(0.5): 1
round(-0.5): -1
round(1.4): 1
round(-1.4): -1
round(1.6): 2
round(-1.6): -2

It's usually implemented as floor(value + 0.5).
Edit: and it's probably not called round since there are at least three rounding algorithms I know of: round to zero, round to closest integer, and banker's rounding. You are asking for round to closest integer.

There are 2 problems we are looking at:
rounding conversions
type conversion.
Rounding conversions mean rounding ± float/double to nearest floor/ceil float/double.
May be your problem ends here.
But if you are expected to return Int/Long, you need to perform type conversion, and thus "Overflow" problem might hit your solution. SO, do a check for error in your function
long round(double x) {
assert(x >= LONG_MIN-0.5);
assert(x <= LONG_MAX+0.5);
if (x >= 0)
return (long) (x+0.5);
return (long) (x-0.5);
}
#define round(x) ((x) < LONG_MIN-0.5 || (x) > LONG_MAX+0.5 ?\
error() : ((x)>=0?(long)((x)+0.5):(long)((x)-0.5))
from : http://www.cs.tut.fi/~jkorpela/round.html

A certain type of rounding is also implemented in Boost:
#include <iostream>
#include <boost/numeric/conversion/converter.hpp>
template<typename T, typename S> T round2(const S& x) {
typedef boost::numeric::conversion_traits<T, S> Traits;
typedef boost::numeric::def_overflow_handler OverflowHandler;
typedef boost::numeric::RoundEven<typename Traits::source_type> Rounder;
typedef boost::numeric::converter<T, S, Traits, OverflowHandler, Rounder> Converter;
return Converter::convert(x);
}
int main() {
std::cout << round2<int, double>(0.1) << ' ' << round2<int, double>(-0.1) << ' ' << round2<int, double>(-0.9) << std::endl;
}
Note that this works only if you do a to-integer conversion.

You could round to n digits precision with:
double round( double x )
{
const double sd = 1000; //for accuracy to 3 decimal places
return int(x*sd + (x<0? -0.5 : 0.5))/sd;
}

These days it shouldn't be a problem to use a C++11 compiler which includes a C99/C++11 math library. But then the question becomes: which rounding function do you pick?
C99/C++11 round() is often not actually the rounding function you want. It uses a funky rounding mode that rounds away from 0 as a tie-break on half-way cases (+-xxx.5000). If you do specifically want that rounding mode, or you're targeting a C++ implementation where round() is faster than rint(), then use it (or emulate its behaviour with one of the other answers on this question which took it at face value and carefully reproduced that specific rounding behaviour.)
round()'s rounding is different from the IEEE754 default round to nearest mode with even as a tie-break. Nearest-even avoids statistical bias in the average magnitude of numbers, but does bias towards even numbers.
There are two math library rounding functions that use the current default rounding mode: std::nearbyint() and std::rint(), both added in C99/C++11, so they're available any time std::round() is. The only difference is that nearbyint never raises FE_INEXACT.
Prefer rint() for performance reasons: gcc and clang both inline it more easily, but gcc never inlines nearbyint() (even with -ffast-math)
gcc/clang for x86-64 and AArch64
I put some test functions on Matt Godbolt's Compiler Explorer, where you can see source + asm output (for multiple compilers). For more about reading compiler output, see this Q&A, and Matt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”,
In FP code, it's usually a big win to inline small functions. Especially on non-Windows, where the standard calling convention has no call-preserved registers, so the compiler can't keep any FP values in XMM registers across a call. So even if you don't really know asm, you can still easily see whether it's just a tail-call to the library function or whether it inlined to one or two math instructions. Anything that inlines to one or two instructions is better than a function call (for this particular task on x86 or ARM).
On x86, anything that inlines to SSE4.1 roundsd can auto-vectorize with SSE4.1 roundpd (or AVX vroundpd). (FP->integer conversions are also available in packed SIMD form, except for FP->64-bit integer which requires AVX512.)
std::nearbyint():
x86 clang: inlines to a single insn with -msse4.1.
x86 gcc: inlines to a single insn only with -msse4.1 -ffast-math, and only on gcc 5.4 and earlier. Later gcc never inlines it (maybe they didn't realize that one of the immediate bits can suppress the inexact exception? That's what clang uses, but older gcc uses the same immediate as for rint when it does inline it)
AArch64 gcc6.3: inlines to a single insn by default.
std::rint:
x86 clang: inlines to a single insn with -msse4.1
x86 gcc7: inlines to a single insn with -msse4.1. (Without SSE4.1, inlines to several instructions)
x86 gcc6.x and earlier: inlines to a single insn with -ffast-math -msse4.1.
AArch64 gcc: inlines to a single insn by default
std::round:
x86 clang: doesn't inline
x86 gcc: inlines to multiple instructions with -ffast-math -msse4.1, requiring two vector constants.
AArch64 gcc: inlines to a single instruction (HW support for this rounding mode as well as IEEE default and most others.)
std::floor / std::ceil / std::trunc
x86 clang: inlines to a single insn with -msse4.1
x86 gcc7.x: inlines to a single insn with -msse4.1
x86 gcc6.x and earlier: inlines to a single insn with -ffast-math -msse4.1
AArch64 gcc: inlines by default to a single instruction
Rounding to int / long / long long:
You have two options here: use lrint (like rint but returns long, or long long for llrint), or use an FP->FP rounding function and then convert to an integer type the normal way (with truncation). Some compilers optimize one way better than the other.
long l = lrint(x);
int i = (int)rint(x);
Note that int i = lrint(x) converts float or double -> long first, and then truncates the integer to int. This makes a difference for out-of-range integers: Undefined Behaviour in C++, but well-defined for the x86 FP -> int instructions (which the compiler will emit unless it sees the UB at compile time while doing constant propagation, then it's allowed to make code that breaks if it's ever executed).
On x86, an FP->integer conversion that overflows the integer produces INT_MIN or LLONG_MIN (a bit-pattern of 0x8000000 or the 64-bit equivalent, with just the sign-bit set). Intel calls this the "integer indefinite" value. (See the cvttsd2si manual entry, the SSE2 instruction that converts (with truncation) scalar double to signed integer. It's available with 32-bit or 64-bit integer destination (in 64-bit mode only). There's also a cvtsd2si (convert with current rounding mode), which is what we'd like the compiler to emit, but unfortunately gcc and clang won't do that without -ffast-math.
Also beware that FP to/from unsigned int / long is less efficient on x86 (without AVX512). Conversion to 32-bit unsigned on a 64-bit machine is pretty cheap; just convert to 64-bit signed and truncate. But otherwise it's significantly slower.
x86 clang with/without -ffast-math -msse4.1: (int/long)rint inlines to roundsd / cvttsd2si. (missed optimization to cvtsd2si). lrint doesn't inline at all.
x86 gcc6.x and earlier without -ffast-math: neither way inlines
x86 gcc7 without -ffast-math: (int/long)rint rounds and converts separately (with 2 total instructions of SSE4.1 is enabled, otherwise with a bunch of code inlined for rint without roundsd). lrint doesn't inline.
x86 gcc with -ffast-math: all ways inline to cvtsd2si (optimal), no need for SSE4.1.
AArch64 gcc6.3 without -ffast-math: (int/long)rint inlines to 2 instructions. lrint doesn't inline
AArch64 gcc6.3 with -ffast-math: (int/long)rint compiles to a call to lrint. lrint doesn't inline. This may be a missed optimization unless the two instructions we get without -ffast-math are very slow.

If you ultimately want to convert the double output of your round() function to an int, then the accepted solutions of this question will look something like:
int roundint(double r) {
return (int)((r > 0.0) ? floor(r + 0.5) : ceil(r - 0.5));
}
This clocks in at around 8.88 ns on my machine when passed in uniformly random values.
The below is functionally equivalent, as far as I can tell, but clocks in at 2.48 ns on my machine, for a significant performance advantage:
int roundint (double r) {
int tmp = static_cast<int> (r);
tmp += (r-tmp>=.5) - (r-tmp<=-.5);
return tmp;
}
Among the reasons for the better performance is the skipped branching.

Beware of floor(x+0.5). Here is what can happen for odd numbers in range [2^52,2^53]:
-bash-3.2$ cat >test-round.c <<END
#include <math.h>
#include <stdio.h>
int main() {
double x=5000000000000001.0;
double y=round(x);
double z=floor(x+0.5);
printf(" x =%f\n",x);
printf("round(x) =%f\n",y);
printf("floor(x+0.5)=%f\n",z);
return 0;
}
END
-bash-3.2$ gcc test-round.c
-bash-3.2$ ./a.out
x =5000000000000001.000000
round(x) =5000000000000001.000000
floor(x+0.5)=5000000000000002.000000
This is http://bugs.squeak.org/view.php?id=7134. Use a solution like the one of #konik.
My own robust version would be something like:
double round(double x)
{
double truncated,roundedFraction;
double fraction = modf(x, &truncated);
modf(2.0*fraction, &roundedFraction);
return truncated + roundedFraction;
}
Another reason to avoid floor(x+0.5) is given here.

There is no need to implement anything, so I'm not sure why so many answers involve defines, functions, or methods.
In C99
We have the following and and header <tgmath.h> for type-generic macros.
#include <math.h>
double round (double x);
float roundf (float x);
long double roundl (long double x);
If you cannot compile this, you have probably left out the math library. A command similar to this works on every C compiler I have (several).
gcc -lm -std=c99 ...
In C++11
We have the following and additional overloads in #include <cmath> that rely on IEEE double precision floating point.
#include <math.h>
double round (double x);
float round (float x);
long double round (long double x);
double round (T x);
There are equivalents in the std namespace too.
If you cannot compile this, you may be using C compilation instead of C++. The following basic command produces neither errors nor warnings with g++ 6.3.1, x86_64-w64-mingw32-g++ 6.3.0, clang-x86_64++ 3.8.0, and Visual C++ 2015 Community.
g++ -std=c++11 -Wall
With Ordinal Division
When dividing two ordinal numbers, where T is short, int, long, or another ordinal, the rounding expression is this.
T roundedQuotient = (2 * integerNumerator + 1)
/ (2 * integerDenominator);
Accuracy
There is no doubt that odd looking inaccuracies appear in floating point operations, but this is only when the numbers appear, and has little to do with rounding.
The source is not just the number of significant digits in the mantissa of the IEEE representation of a floating point number, it is related to our decimal thinking as humans.
Ten is the product of five and two, and 5 and 2 are relatively prime. Therefore the IEEE floating point standards cannot possibly be represented perfectly as decimal numbers for all binary digital representations.
This is not an issue with the rounding algorithms. It is mathematical reality that should be considered during the selection of types and the design of computations, data entry, and display of numbers. If an application displays the digits that show these decimal-binary conversion issues, then the application is visually expressing accuracy that does not exist in digital reality and should be changed.

Function double round(double) with the use of the modf function:
double round(double x)
{
using namespace std;
if ((numeric_limits<double>::max() - 0.5) <= x)
return numeric_limits<double>::max();
if ((-1*std::numeric_limits<double>::max() + 0.5) > x)
return (-1*std::numeric_limits<double>::max());
double intpart;
double fractpart = modf(x, &intpart);
if (fractpart >= 0.5)
return (intpart + 1);
else if (fractpart >= -0.5)
return intpart;
else
return (intpart - 1);
}
To be compile clean, includes "math.h" and "limits" are necessary. The function works according to a following rounding schema:
round of 5.0 is 5.0
round of 3.8 is 4.0
round of 2.3 is 2.0
round of 1.5 is 2.0
round of 0.501 is 1.0
round of 0.5 is 1.0
round of 0.499 is 0.0
round of 0.01 is 0.0
round of 0.0 is 0.0
round of -0.01 is -0.0
round of -0.499 is -0.0
round of -0.5 is -0.0
round of -0.501 is -1.0
round of -1.5 is -1.0
round of -2.3 is -2.0
round of -3.8 is -4.0
round of -5.0 is -5.0

If you need to be able to compile code in environments that support the C++11 standard, but also need to be able to compile that same code in environments that don't support it, you could use a function macro to choose between std::round() and a custom function for each system. Just pass -DCPP11 or /DCPP11 to the C++11-compliant compiler (or use its built-in version macros), and make a header like this:
// File: rounding.h
#include <cmath>
#ifdef CPP11
#define ROUND(x) std::round(x)
#else /* CPP11 */
inline double myRound(double x) {
return (x >= 0.0 ? std::floor(x + 0.5) : std::ceil(x - 0.5));
}
#define ROUND(x) myRound(x)
#endif /* CPP11 */
For a quick example, see http://ideone.com/zal709 .
This approximates std::round() in environments that aren't C++11-compliant, including preservation of the sign bit for -0.0. It may cause a slight performance hit, however, and will likely have issues with rounding certain known "problem" floating-point values such as 0.49999999999999994 or similar values.
Alternatively, if you have access to a C++11-compliant compiler, you could just grab std::round() from its <cmath> header, and use it to make your own header that defines the function if it's not already defined. Note that this may not be an optimal solution, however, especially if you need to compile for multiple platforms.

Based on Kalaxy's response, the following is a templated solution that rounds any floating point number to the nearest integer type based on natural rounding. It also throws an error in debug mode if the value is out of range of the integer type, thereby serving roughly as a viable library function.
// round a floating point number to the nearest integer
template <typename Arg>
int Round(Arg arg)
{
#ifndef NDEBUG
// check that the argument can be rounded given the return type:
if (
(Arg)std::numeric_limits<int>::max() < arg + (Arg) 0.5) ||
(Arg)std::numeric_limits<int>::lowest() > arg - (Arg) 0.5)
)
{
throw std::overflow_error("out of bounds");
}
#endif
return (arg > (Arg) 0.0) ? (int)(r + (Arg) 0.5) : (int)(r - (Arg) 0.5);
}

As pointed out in comments and other answers, the ISO C++ standard library did not add round() until ISO C++11, when this function was pulled in by reference to the ISO C99 standard math library.
For positive operands in [½, ub] round(x) == floor (x + 0.5), where ub is 223 for float when mapped to IEEE-754 (2008) binary32, and 252 for double when it is mapped to IEEE-754 (2008) binary64. The numbers 23 and 52 correspond to the number of stored mantissa bits in these two floating-point formats. For positive operands in [+0, ½) round(x) == 0, and for positive operands in (ub, +∞] round(x) == x. As the function is symmetric about the x-axis, negative arguments x can be handled according to round(-x) == -round(x).
This leads to the compact code below. It compiles into a reasonable number of machine instructions across various platforms. I observed the most compact code on GPUs, where my_roundf() requires about a dozen instructions. Depending on processor architecture and toolchain, this floating-point based approach could be either faster or slower than the integer-based implementation from newlib referenced in a different answer.
I tested my_roundf() exhaustively against the newlib roundf() implementation using Intel compiler version 13, with both /fp:strict and /fp:fast. I also checked that the newlib version matches the roundf() in the mathimf library of the Intel compiler. Exhaustive testing is not possible for double-precision round(), however the code is structurally identical to the single-precision implementation.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float my_roundf (float x)
{
const float half = 0.5f;
const float one = 2 * half;
const float lbound = half;
const float ubound = 1L << 23;
float a, f, r, s, t;
s = (x < 0) ? (-one) : one;
a = x * s;
t = (a < lbound) ? x : s;
f = (a < lbound) ? 0 : floorf (a + half);
r = (a > ubound) ? x : (t * f);
return r;
}
double my_round (double x)
{
const double half = 0.5;
const double one = 2 * half;
const double lbound = half;
const double ubound = 1ULL << 52;
double a, f, r, s, t;
s = (x < 0) ? (-one) : one;
a = x * s;
t = (a < lbound) ? x : s;
f = (a < lbound) ? 0 : floor (a + half);
r = (a > ubound) ? x : (t * f);
return r;
}
uint32_t float_as_uint (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof(r));
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
float newlib_roundf (float x)
{
uint32_t w;
int exponent_less_127;
w = float_as_uint(x);
/* Extract exponent field. */
exponent_less_127 = (int)((w & 0x7f800000) >> 23) - 127;
if (exponent_less_127 < 23) {
if (exponent_less_127 < 0) {
/* Extract sign bit. */
w &= 0x80000000;
if (exponent_less_127 == -1) {
/* Result is +1.0 or -1.0. */
w |= ((uint32_t)127 << 23);
}
} else {
uint32_t exponent_mask = 0x007fffff >> exponent_less_127;
if ((w & exponent_mask) == 0) {
/* x has an integral value. */
return x;
}
w += 0x00400000 >> exponent_less_127;
w &= ~exponent_mask;
}
} else {
if (exponent_less_127 == 128) {
/* x is NaN or infinite so raise FE_INVALID by adding */
return x + x;
} else {
return x;
}
}
x = uint_as_float (w);
return x;
}
int main (void)
{
uint32_t argi, resi, refi;
float arg, res, ref;
argi = 0;
do {
arg = uint_as_float (argi);
ref = newlib_roundf (arg);
res = my_roundf (arg);
resi = float_as_uint (res);
refi = float_as_uint (ref);
if (resi != refi) { // check for identical bit pattern
printf ("!!!! arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
}
argi++;
} while (argi);
return EXIT_SUCCESS;
}

I use the following implementation of round in asm for x86 architecture and MS VS specific C++:
__forceinline int Round(const double v)
{
int r;
__asm
{
FLD v
FISTP r
FWAIT
};
return r;
}
UPD: to return double value
__forceinline double dround(const double v)
{
double r;
__asm
{
FLD v
FRNDINT
FSTP r
FWAIT
};
return r;
}
Output:
dround(0.1): 0.000000000000000
dround(-0.1): -0.000000000000000
dround(0.9): 1.000000000000000
dround(-0.9): -1.000000000000000
dround(1.1): 1.000000000000000
dround(-1.1): -1.000000000000000
dround(0.49999999999999994): 0.000000000000000
dround(-0.49999999999999994): -0.000000000000000
dround(0.5): 0.000000000000000
dround(-0.5): -0.000000000000000

Since C++ 11 simply:
#include <cmath>
std::round(1.1)
or to get int
static_cast<int>(std::round(1.1))

round_f for ARM with math
static inline float round_f(float value)
{
float rep;
asm volatile ("vrinta.f32 %0,%1" : "=t"(rep) : "t"(value));
return rep;
}
round_f for ARM without math
union f__raw {
struct {
uint32_t massa :23;
uint32_t order :8;
uint32_t sign :1;
};
int32_t i_raw;
float f_raw;
};
float round_f(float value)
{
union f__raw raw;
int32_t exx;
uint32_t ex_mask;
raw.f_raw = value;
exx = raw.order - 126;
if (exx < 0) {
raw.i_raw &= 0x80000000;
} else if (exx < 24) {
ex_mask = 0x00ffffff >> exx;
raw.i_raw += 0x00800000 >> exx;
if (exx == 0) ex_mask >>= 1;
raw.i_raw &= ~ex_mask;
};
return raw.f_raw;
};

Best way to rounding off a floating value by "n" decimal places, is as following with in O(1) time:-
We have to round off the value by 3 places i.e. n=3.So,
float a=47.8732355;
printf("%.3f",a);

// Convert the float to a string
// We might use stringstream, but it looks like it truncates the float to only
//5 decimal points (maybe that's what you want anyway =P)
float MyFloat = 5.11133333311111333;
float NewConvertedFloat = 0.0;
string FirstString = " ";
string SecondString = " ";
stringstream ss (stringstream::in | stringstream::out);
ss << MyFloat;
FirstString = ss.str();
// Take out how ever many decimal places you want
// (this is a string it includes the point)
SecondString = FirstString.substr(0,5);
//whatever precision decimal place you want
// Convert it back to a float
stringstream(SecondString) >> NewConvertedFloat;
cout << NewConvertedFloat;
system("pause");
It might be an inefficient dirty way of conversion but heck, it works lol. And it's good, because it applies to the actual float. Not just affecting the output visually.

I did this:
#include <cmath.h>
using namespace std;
double roundh(double number, int place){
/* place = decimal point. Putting in 0 will make it round to whole
number. putting in 1 will round to the
tenths digit.
*/
number *= 10^place;
int istack = (int)floor(number);
int out = number-istack;
if (out < 0.5){
floor(number);
number /= 10^place;
return number;
}
if (out > 0.4) {
ceil(number);
number /= 10^place;
return number;
}
}

What is the most effective way for float and double comparison?

What would be the most efficient way to compare two double or two float values?
Simply doing this is not correct:
bool CompareDoubles1 (double A, double B)
{
return A == B;
}
But something like:
bool CompareDoubles2 (double A, double B)
{
diff = A - B;
return (diff < EPSILON) && (-diff < EPSILON);
}
Seems to waste processing.
Does anyone know a smarter float comparer?

Be extremely careful using any of the other suggestions. It all depends on context.
I have spent a long time tracing bugs in a system that presumed a==b if |a-b|<epsilon. The underlying problems were:
The implicit presumption in an algorithm that if a==b and b==c then a==c.
Using the same epsilon for lines measured in inches and lines measured in mils (.001 inch). That is a==b but 1000a!=1000b. (This is why AlmostEqual2sComplement asks for the epsilon or max ULPS).
The use of the same epsilon for both the cosine of angles and the length of lines!
Using such a compare function to sort items in a collection. (In this case using the builtin C++ operator == for doubles produced correct results.)
Like I said: it all depends on context and the expected size of a and b.
By the way, std::numeric_limits<double>::epsilon() is the "machine epsilon". It is the difference between 1.0 and the next value representable by a double. I guess that it could be used in the compare function but only if the expected values are less than 1. (This is in response to #cdv's answer...)
Also, if you basically have int arithmetic in doubles (here we use doubles to hold int values in certain cases) your arithmetic will be correct. For example 4.0/2.0 will be the same as 1.0+1.0. This is as long as you do not do things that result in fractions (4.0/3.0) or do not go outside of the size of an int.

The comparison with an epsilon value is what most people do (even in game programming).
You should change your implementation a little though:
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
Edit: Christer has added a stack of great info on this topic on a recent blog post. Enjoy.

Comparing floating point numbers for depends on the context. Since even changing the order of operations can produce different results, it is important to know how "equal" you want the numbers to be.
Comparing floating point numbers by Bruce Dawson is a good place to start when looking at floating point comparison.
The following definitions are from The art of computer programming by Knuth:
bool approximatelyEqual(float a, float b, float epsilon)
{
return fabs(a - b) <= ( (fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
bool essentiallyEqual(float a, float b, float epsilon)
{
return fabs(a - b) <= ( (fabs(a) > fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
bool definitelyGreaterThan(float a, float b, float epsilon)
{
return (a - b) > ( (fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
bool definitelyLessThan(float a, float b, float epsilon)
{
return (b - a) > ( (fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
Of course, choosing epsilon depends on the context, and determines how equal you want the numbers to be.
Another method of comparing floating point numbers is to look at the ULP (units in last place) of the numbers. While not dealing specifically with comparisons, the paper What every computer scientist should know about floating point numbers is a good resource for understanding how floating point works and what the pitfalls are, including what ULP is.

I found that the Google C++ Testing Framework contains a nice cross-platform template-based implementation of AlmostEqual2sComplement which works on both doubles and floats. Given that it is released under the BSD license, using it in your own code should be no problem, as long as you retain the license. I extracted the below code from http://code.google.com/p/googletest/source/browse/trunk/include/gtest/internal/gtest-internal.h https://github.com/google/googletest/blob/master/googletest/include/gtest/internal/gtest-internal.h and added the license on top.
Be sure to #define GTEST_OS_WINDOWS to some value (or to change the code where it's used to something that fits your codebase - it's BSD licensed after all).
Usage example:
double left = // something
double right = // something
const FloatingPoint<double> lhs(left), rhs(right);
if (lhs.AlmostEquals(rhs)) {
//they're equal!
}
Here's the code:
// Copyright 2005, Google Inc.
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following disclaimer
// in the documentation and/or other materials provided with the
// distribution.
// * Neither the name of Google Inc. nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Authors: wan#google.com (Zhanyong Wan), eefacm#gmail.com (Sean Mcafee)
//
// The Google C++ Testing Framework (Google Test)
// This template class serves as a compile-time function from size to
// type. It maps a size in bytes to a primitive type with that
// size. e.g.
//
// TypeWithSize<4>::UInt
//
// is typedef-ed to be unsigned int (unsigned integer made up of 4
// bytes).
//
// Such functionality should belong to STL, but I cannot find it
// there.
//
// Google Test uses this class in the implementation of floating-point
// comparison.
//
// For now it only handles UInt (unsigned int) as that's all Google Test
// needs. Other types can be easily added in the future if need
// arises.
template <size_t size>
class TypeWithSize {
public:
// This prevents the user from using TypeWithSize<N> with incorrect
// values of N.
typedef void UInt;
};
// The specialization for size 4.
template <>
class TypeWithSize<4> {
public:
// unsigned int has size 4 in both gcc and MSVC.
//
// As base/basictypes.h doesn't compile on Windows, we cannot use
// uint32, uint64, and etc here.
typedef int Int;
typedef unsigned int UInt;
};
// The specialization for size 8.
template <>
class TypeWithSize<8> {
public:
#if GTEST_OS_WINDOWS
typedef __int64 Int;
typedef unsigned __int64 UInt;
#else
typedef long long Int; // NOLINT
typedef unsigned long long UInt; // NOLINT
#endif // GTEST_OS_WINDOWS
};
// This template class represents an IEEE floating-point number
// (either single-precision or double-precision, depending on the
// template parameters).
//
// The purpose of this class is to do more sophisticated number
// comparison. (Due to round-off error, etc, it's very unlikely that
// two floating-points will be equal exactly. Hence a naive
// comparison by the == operation often doesn't work.)
//
// Format of IEEE floating-point:
//
// The most-significant bit being the leftmost, an IEEE
// floating-point looks like
//
// sign_bit exponent_bits fraction_bits
//
// Here, sign_bit is a single bit that designates the sign of the
// number.
//
// For float, there are 8 exponent bits and 23 fraction bits.
//
// For double, there are 11 exponent bits and 52 fraction bits.
//
// More details can be found at
// http://en.wikipedia.org/wiki/IEEE_floating-point_standard.
//
// Template parameter:
//
// RawType: the raw floating-point type (either float or double)
template <typename RawType>
class FloatingPoint {
public:
// Defines the unsigned integer type that has the same size as the
// floating point number.
typedef typename TypeWithSize<sizeof(RawType)>::UInt Bits;
// Constants.
// # of bits in a number.
static const size_t kBitCount = 8*sizeof(RawType);
// # of fraction bits in a number.
static const size_t kFractionBitCount =
std::numeric_limits<RawType>::digits - 1;
// # of exponent bits in a number.
static const size_t kExponentBitCount = kBitCount - 1 - kFractionBitCount;
// The mask for the sign bit.
static const Bits kSignBitMask = static_cast<Bits>(1) << (kBitCount - 1);
// The mask for the fraction bits.
static const Bits kFractionBitMask =
~static_cast<Bits>(0) >> (kExponentBitCount + 1);
// The mask for the exponent bits.
static const Bits kExponentBitMask = ~(kSignBitMask | kFractionBitMask);
// How many ULP's (Units in the Last Place) we want to tolerate when
// comparing two numbers. The larger the value, the more error we
// allow. A 0 value means that two numbers must be exactly the same
// to be considered equal.
//
// The maximum error of a single floating-point operation is 0.5
// units in the last place. On Intel CPU's, all floating-point
// calculations are done with 80-bit precision, while double has 64
// bits. Therefore, 4 should be enough for ordinary use.
//
// See the following article for more details on ULP:
// http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm.
static const size_t kMaxUlps = 4;
// Constructs a FloatingPoint from a raw floating-point number.
//
// On an Intel CPU, passing a non-normalized NAN (Not a Number)
// around may change its bits, although the new value is guaranteed
// to be also a NAN. Therefore, don't expect this constructor to
// preserve the bits in x when x is a NAN.
explicit FloatingPoint(const RawType& x) { u_.value_ = x; }
// Static methods
// Reinterprets a bit pattern as a floating-point number.
//
// This function is needed to test the AlmostEquals() method.
static RawType ReinterpretBits(const Bits bits) {
FloatingPoint fp(0);
fp.u_.bits_ = bits;
return fp.u_.value_;
}
// Returns the floating-point number that represent positive infinity.
static RawType Infinity() {
return ReinterpretBits(kExponentBitMask);
}
// Non-static methods
// Returns the bits that represents this number.
const Bits &bits() const { return u_.bits_; }
// Returns the exponent bits of this number.
Bits exponent_bits() const { return kExponentBitMask & u_.bits_; }
// Returns the fraction bits of this number.
Bits fraction_bits() const { return kFractionBitMask & u_.bits_; }
// Returns the sign bit of this number.
Bits sign_bit() const { return kSignBitMask & u_.bits_; }
// Returns true iff this is NAN (not a number).
bool is_nan() const {
// It's a NAN if the exponent bits are all ones and the fraction
// bits are not entirely zeros.
return (exponent_bits() == kExponentBitMask) && (fraction_bits() != 0);
}
// Returns true iff this number is at most kMaxUlps ULP's away from
// rhs. In particular, this function:
//
// - returns false if either number is (or both are) NAN.
// - treats really large numbers as almost equal to infinity.
// - thinks +0.0 and -0.0 are 0 DLP's apart.
bool AlmostEquals(const FloatingPoint& rhs) const {
// The IEEE standard says that any comparison operation involving
// a NAN must return false.
if (is_nan() || rhs.is_nan()) return false;
return DistanceBetweenSignAndMagnitudeNumbers(u_.bits_, rhs.u_.bits_)
<= kMaxUlps;
}
private:
// The data type used to store the actual floating-point number.
union FloatingPointUnion {
RawType value_; // The raw floating-point number.
Bits bits_; // The bits that represent the number.
};
// Converts an integer from the sign-and-magnitude representation to
// the biased representation. More precisely, let N be 2 to the
// power of (kBitCount - 1), an integer x is represented by the
// unsigned number x + N.
//
// For instance,
//
// -N + 1 (the most negative number representable using
// sign-and-magnitude) is represented by 1;
// 0 is represented by N; and
// N - 1 (the biggest number representable using
// sign-and-magnitude) is represented by 2N - 1.
//
// Read http://en.wikipedia.org/wiki/Signed_number_representations
// for more details on signed number representations.
static Bits SignAndMagnitudeToBiased(const Bits &sam) {
if (kSignBitMask & sam) {
// sam represents a negative number.
return ~sam + 1;
} else {
// sam represents a positive number.
return kSignBitMask | sam;
}
}
// Given two numbers in the sign-and-magnitude representation,
// returns the distance between them as an unsigned number.
static Bits DistanceBetweenSignAndMagnitudeNumbers(const Bits &sam1,
const Bits &sam2) {
const Bits biased1 = SignAndMagnitudeToBiased(sam1);
const Bits biased2 = SignAndMagnitudeToBiased(sam2);
return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
}
FloatingPointUnion u_;
};
EDIT: This post is 4 years old. It's probably still valid, and the code is nice, but some people found improvements. Best go get the latest version of AlmostEquals right from the Google Test source code, and not the one I pasted up here.

For a more in depth approach read Comparing floating point numbers. Here is the code snippet from that link:
// Usable AlmostEqual function
bool AlmostEqual2sComplement(float A, float B, int maxUlps)
{
// Make sure maxUlps is non-negative and small enough that the
// default NAN won't compare as equal to anything.
assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
int aInt = *(int*)&A;
// Make aInt lexicographically ordered as a twos-complement int
if (aInt < 0)
aInt = 0x80000000 - aInt;
// Make bInt lexicographically ordered as a twos-complement int
int bInt = *(int*)&B;
if (bInt < 0)
bInt = 0x80000000 - bInt;
int intDiff = abs(aInt - bInt);
if (intDiff <= maxUlps)
return true;
return false;
}

Realizing this is an old thread but this article is one of the most straight forward ones I have found on comparing floating point numbers and if you want to explore more it has more detailed references as well and it the main site covers a complete range of issues dealing with floating point numbers The Floating-Point Guide :Comparison.
We can find a somewhat more practical article in Floating-point tolerances revisited and notes there is absolute tolerance test, which boils down to this in C++:
bool absoluteToleranceCompare(double x, double y)
{
return std::fabs(x - y) <= std::numeric_limits<double>::epsilon() ;
}
and relative tolerance test:
bool relativeToleranceCompare(double x, double y)
{
double maxXY = std::max( std::fabs(x) , std::fabs(y) ) ;
return std::fabs(x - y) <= std::numeric_limits<double>::epsilon()*maxXY ;
}
The article notes that the absolute test fails when x and y are large and fails in the relative case when they are small. Assuming he absolute and relative tolerance is the same a combined test would look like this:
bool combinedToleranceCompare(double x, double y)
{
double maxXYOne = std::max( { 1.0, std::fabs(x) , std::fabs(y) } ) ;
return std::fabs(x - y) <= std::numeric_limits<double>::epsilon()*maxXYOne ;
}

I ended up spending quite some time going through material in this great thread. I doubt everyone wants to spend so much time so I would highlight the summary of what I learned and the solution I implemented.
Quick Summary
Is 1e-8 approximately same as 1e-16? If you are looking at noisy sensor data then probably yes but if you are doing molecular simulation then may be not! Bottom line: You always need to think of tolerance value in context of specific function call and not just make it generic app-wide hard-coded constant.
For general library functions, it's still nice to have parameter with default tolerance. A typical choice is numeric_limits::epsilon() which is same as FLT_EPSILON in float.h. This is however problematic because epsilon for comparing values like 1.0 is not same as epsilon for values like 1E9. The FLT_EPSILON is defined for 1.0.
The obvious implementation to check if number is within tolerance is fabs(a-b) <= epsilon however this doesn't work because default epsilon is defined for 1.0. We need to scale epsilon up or down in terms of a and b.
There are two solution to this problem: either you set epsilon proportional to max(a,b) or you can get next representable numbers around a and then see if b falls into that range. The former is called "relative" method and later is called ULP method.
Both methods actually fails anyway when comparing with 0. In this case, application must supply correct tolerance.
Utility Functions Implementation (C++11)
//implements relative method - do not use for comparing with zero
//use this most of the time, tolerance needs to be meaningful in your context
template<typename TReal>
static bool isApproximatelyEqual(TReal a, TReal b, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
TReal diff = std::fabs(a - b);
if (diff <= tolerance)
return true;
if (diff < std::fmax(std::fabs(a), std::fabs(b)) * tolerance)
return true;
return false;
}
//supply tolerance that is meaningful in your context
//for example, default tolerance may not work if you are comparing double with float
template<typename TReal>
static bool isApproximatelyZero(TReal a, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
if (std::fabs(a) <= tolerance)
return true;
return false;
}
//use this when you want to be on safe side
//for example, don't start rover unless signal is above 1
template<typename TReal>
static bool isDefinitelyLessThan(TReal a, TReal b, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
TReal diff = a - b;
if (diff < tolerance)
return true;
if (diff < std::fmax(std::fabs(a), std::fabs(b)) * tolerance)
return true;
return false;
}
template<typename TReal>
static bool isDefinitelyGreaterThan(TReal a, TReal b, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
TReal diff = a - b;
if (diff > tolerance)
return true;
if (diff > std::fmax(std::fabs(a), std::fabs(b)) * tolerance)
return true;
return false;
}
//implements ULP method
//use this when you are only concerned about floating point precision issue
//for example, if you want to see if a is 1.0 by checking if its within
//10 closest representable floating point numbers around 1.0.
template<typename TReal>
static bool isWithinPrecisionInterval(TReal a, TReal b, unsigned int interval_size = 1)
{
TReal min_a = a - (a - std::nextafter(a, std::numeric_limits<TReal>::lowest())) * interval_size;
TReal max_a = a + (std::nextafter(a, std::numeric_limits<TReal>::max()) - a) * interval_size;
return min_a <= b && max_a >= b;
}

The portable way to get epsilon in C++ is
#include <limits>
std::numeric_limits<double>::epsilon()
Then the comparison function becomes
#include <cmath>
#include <limits>
bool AreSame(double a, double b) {
return std::fabs(a - b) < std::numeric_limits<double>::epsilon();
}

The code you wrote is bugged :
return (diff < EPSILON) && (-diff > EPSILON);
The correct code would be :
return (diff < EPSILON) && (diff > -EPSILON);
(...and yes this is different)
I wonder if fabs wouldn't make you lose lazy evaluation in some case. I would say it depends on the compiler. You might want to try both. If they are equivalent in average, take the implementation with fabs.
If you have some info on which of the two float is more likely to be bigger than then other, you can play on the order of the comparison to take better advantage of the lazy evaluation.
Finally you might get better result by inlining this function. Not likely to improve much though...
Edit: OJ, thanks for correcting your code. I erased my comment accordingly

`return fabs(a - b) < EPSILON;
This is fine if:
the order of magnitude of your inputs don't change much
very small numbers of opposite signs can be treated as equal
But otherwise it'll lead you into trouble. Double precision numbers have a resolution of about 16 decimal places. If the two numbers you are comparing are larger in magnitude than EPSILON*1.0E16, then you might as well be saying:
return a==b;
I'll examine a different approach that assumes you need to worry about the first issue and assume the second is fine your application. A solution would be something like:
#define VERYSMALL (1.0E-150)
#define EPSILON (1.0E-8)
bool AreSame(double a, double b)
{
double absDiff = fabs(a - b);
if (absDiff < VERYSMALL)
{
return true;
}
double maxAbs = max(fabs(a) - fabs(b));
return (absDiff/maxAbs) < EPSILON;
}
This is expensive computationally, but it is sometimes what is called for. This is what we have to do at my company because we deal with an engineering library and inputs can vary by a few dozen orders of magnitude.
Anyway, the point is this (and applies to practically every programming problem): Evaluate what your needs are, then come up with a solution to address your needs -- don't assume the easy answer will address your needs. If after your evaluation you find that fabs(a-b) < EPSILON will suffice, perfect -- use it! But be aware of its shortcomings and other possible solutions too.

As others have pointed out, using a fixed-exponent epsilon (such as 0.0000001) will be useless for values away from the epsilon value. For example, if your two values are 10000.000977 and 10000, then there are NO 32-bit floating-point values between these two numbers -- 10000 and 10000.000977 are as close as you can possibly get without being bit-for-bit identical. Here, an epsilon of less than 0.0009 is meaningless; you might as well use the straight equality operator.
Likewise, as the two values approach epsilon in size, the relative error grows to 100%.
Thus, trying to mix a fixed point number such as 0.00001 with floating-point values (where the exponent is arbitrary) is a pointless exercise. This will only ever work if you can be assured that the operand values lie within a narrow domain (that is, close to some specific exponent), and if you properly select an epsilon value for that specific test. If you pull a number out of the air ("Hey! 0.00001 is small, so that must be good!"), you're doomed to numerical errors. I've spent plenty of time debugging bad numerical code where some poor schmuck tosses in random epsilon values to make yet another test case work.
If you do numerical programming of any kind and believe you need to reach for fixed-point epsilons, READ BRUCE'S ARTICLE ON COMPARING FLOATING-POINT NUMBERS.
Comparing Floating Point Numbers

Here's proof that using std::numeric_limits::epsilon() is not the answer — it fails for values greater than one:
Proof of my comment above:
#include <stdio.h>
#include <limits>
double ItoD (__int64 x) {
// Return double from 64-bit hexadecimal representation.
return *(reinterpret_cast<double*>(&x));
}
void test (__int64 ai, __int64 bi) {
double a = ItoD(ai), b = ItoD(bi);
bool close = std::fabs(a-b) < std::numeric_limits<double>::epsilon();
printf ("%.16f and %.16f %s close.\n", a, b, close ? "are " : "are not");
}
int main()
{
test (0x3fe0000000000000L,
0x3fe0000000000001L);
test (0x3ff0000000000000L,
0x3ff0000000000001L);
}
Running yields this output:
0.5000000000000000 and 0.5000000000000001 are close.
1.0000000000000000 and 1.0000000000000002 are not close.
Note that in the second case (one and just larger than one), the two input values are as close as they can possibly be, and still compare as not close. Thus, for values greater than 1.0, you might as well just use an equality test. Fixed epsilons will not save you when comparing floating-point values.

Qt implements two functions, maybe you can learn from them:
static inline bool qFuzzyCompare(double p1, double p2)
{
return (qAbs(p1 - p2) <= 0.000000000001 * qMin(qAbs(p1), qAbs(p2)));
}
static inline bool qFuzzyCompare(float p1, float p2)
{
return (qAbs(p1 - p2) <= 0.00001f * qMin(qAbs(p1), qAbs(p2)));
}
And you may need the following functions, since
Note that comparing values where either p1 or p2 is 0.0 will not work,
nor does comparing values where one of the values is NaN or infinity.
If one of the values is always 0.0, use qFuzzyIsNull instead. If one
of the values is likely to be 0.0, one solution is to add 1.0 to both
values.
static inline bool qFuzzyIsNull(double d)
{
return qAbs(d) <= 0.000000000001;
}
static inline bool qFuzzyIsNull(float f)
{
return qAbs(f) <= 0.00001f;
}

Unfortunately, even your "wasteful" code is incorrect. EPSILON is the smallest value that could be added to 1.0 and change its value. The value 1.0 is very important — larger numbers do not change when added to EPSILON. Now, you can scale this value to the numbers you are comparing to tell whether they are different or not. The correct expression for comparing two doubles is:
if (fabs(a - b) <= DBL_EPSILON * fmax(fabs(a), fabs(b)))
{
// ...
}
This is at a minimum. In general, though, you would want to account for noise in your calculations and ignore a few of the least significant bits, so a more realistic comparison would look like:
if (fabs(a - b) <= 16 * DBL_EPSILON * fmax(fabs(a), fabs(b)))
{
// ...
}
If comparison performance is very important to you and you know the range of your values, then you should use fixed-point numbers instead.

General-purpose comparison of floating-point numbers is generally meaningless. How to compare really depends on a problem at hand. In many problems, numbers are sufficiently discretized to allow comparing them within a given tolerance. Unfortunately, there are just as many problems, where such trick doesn't really work. For one example, consider working with a Heaviside (step) function of a number in question (digital stock options come to mind) when your observations are very close to the barrier. Performing tolerance-based comparison wouldn't do much good, as it would effectively shift the issue from the original barrier to two new ones. Again, there is no general-purpose solution for such problems and the particular solution might require going as far as changing the numerical method in order to achieve stability.

You have to do this processing for floating point comparison, since float's can't be perfectly compared like integer types. Here are functions for the various comparison operators.
Floating Point Equal to (==)
I also prefer the subtraction technique rather than relying on fabs() or abs(), but I'd have to speed profile it on various architectures from 64-bit PC to ATMega328 microcontroller (Arduino) to really see if it makes much of a performance difference.
So, let's forget about all this absolute value stuff and just do some subtraction and comparison!
Modified from Microsoft's example here:
/// #brief See if two floating point numbers are approximately equal.
/// #param[in] a number 1
/// #param[in] b number 2
/// #param[in] epsilon A small value such that if the difference between the two numbers is
/// smaller than this they can safely be considered to be equal.
/// #return true if the two numbers are approximately equal, and false otherwise
bool is_float_eq(float a, float b, float epsilon) {
return ((a - b) < epsilon) && ((b - a) < epsilon);
}
bool is_double_eq(double a, double b, double epsilon) {
return ((a - b) < epsilon) && ((b - a) < epsilon);
}
Example usage:
constexpr float EPSILON = 0.0001; // 1e-4
is_float_eq(1.0001, 0.99998, EPSILON);
I'm not entirely sure, but it seems to me some of the criticisms of the epsilon-based approach, as described in the comments below this highly-upvoted answer, can be resolved by using a variable epsilon, scaled according to the floating point values being compared, like this:
float a = 1.0001;
float b = 0.99998;
float epsilon = std::max(std::fabs(a), std::fabs(b)) * 1e-4;
is_float_eq(a, b, epsilon);
This way, the epsilon value scales with the floating point values and is therefore never so small of a value that it becomes insignificant.
For completeness, let's add the rest:
Greater than (>), and less than (<):
/// #brief See if floating point number `a` is > `b`
/// #param[in] a number 1
/// #param[in] b number 2
/// #param[in] epsilon a small value such that if `a` is > `b` by this amount, `a` is considered
/// to be definitively > `b`
/// #return true if `a` is definitively > `b`, and false otherwise
bool is_float_gt(float a, float b, float epsilon) {
return a > b + epsilon;
}
bool is_double_gt(double a, double b, double epsilon) {
return a > b + epsilon;
}
/// #brief See if floating point number `a` is < `b`
/// #param[in] a number 1
/// #param[in] b number 2
/// #param[in] epsilon a small value such that if `a` is < `b` by this amount, `a` is considered
/// to be definitively < `b`
/// #return true if `a` is definitively < `b`, and false otherwise
bool is_float_lt(float a, float b, float epsilon) {
return a < b - epsilon;
}
bool is_double_lt(double a, double b, double epsilon) {
return a < b - epsilon;
}
Greater than or equal to (>=), and less than or equal to (<=)
/// #brief Returns true if `a` is definitively >= `b`, and false otherwise
bool is_float_ge(float a, float b, float epsilon) {
return a > b - epsilon;
}
bool is_double_ge(double a, double b, double epsilon) {
return a > b - epsilon;
}
/// #brief Returns true if `a` is definitively <= `b`, and false otherwise
bool is_float_le(float a, float b, float epsilon) {
return a < b + epsilon;
}
bool is_double_le(double a, double b, double epsilon) {
return a < b + epsilon;
}
Additional improvements:
A good default value for epsilon in C++ is std::numeric_limits<T>::epsilon(), which evaluates to either 0 or FLT_EPSILON, DBL_EPSILON, or LDBL_EPSILON. See here: https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon. You can also see the float.h header for FLT_EPSILON, DBL_EPSILON, and LDBL_EPSILON.
See https://en.cppreference.com/w/cpp/header/cfloat and
https://www.cplusplus.com/reference/cfloat/
You could template the functions instead, to handle all floating point types: float, double, and long double, with type checks for these types via a static_assert() inside the template.
Scaling the epsilon value is a good idea to ensure it works for really large and really small a and b values. This article recommends and explains it: http://realtimecollisiondetection.net/blog/?p=89. So, you should scale epsilon by a scaling value equal to max(1.0, abs(a), abs(b)), as that article explains. Otherwise, as a and/or b increase in magnitude, the epsilon would eventually become so small relative to those values that it becomes lost in the floating point error. So, we scale it to become larger in magnitude like they are. However, using 1.0 as the smallest allowed scaling factor for epsilon also ensures that for really small-magnitude a and b values, epsilon itself doesn't get scaled so small that it also becomes lost in the floating point error. So, we limit the minimum scaling factor to 1.0.
If you want to "encapsulate" the above functions into a class, don't. Instead, wrap them up in a namespace if you like in order to namespace them. Ex: if you put all of the stand-alone functions into a namespace called float_comparison, then you could access the is_eq() function like this, for instance: float_comparison::is_eq(1.0, 1.5);.
It might also be nice to add comparisons against zero, not just comparisons between two values.
So, here is a better type of solution with the above improvements in place:
namespace float_comparison {
/// Scale the epsilon value to become large for large-magnitude a or b,
/// but no smaller than 1.0, per the explanation above, to ensure that
/// epsilon doesn't ever fall out in floating point error as a and/or b
/// increase in magnitude.
template<typename T>
static constexpr T scale_epsilon(T a, T b, T epsilon =
std::numeric_limits<T>::epsilon()) noexcept
{
static_assert(std::is_floating_point_v<T>, "Floating point comparisons "
"require type float, double, or long double.");
T scaling_factor;
// Special case for when a or b is infinity
if (std::isinf(a) || std::isinf(b))
{
scaling_factor = 0;
}
else
{
scaling_factor = std::max({(T)1.0, std::abs(a), std::abs(b)});
}
T epsilon_scaled = scaling_factor * std::abs(epsilon);
return epsilon_scaled;
}
// Compare two values
/// Equal: returns true if a is approximately == b, and false otherwise
template<typename T>
static constexpr bool is_eq(T a, T b, T epsilon =
std::numeric_limits<T>::epsilon()) noexcept
{
static_assert(std::is_floating_point_v<T>, "Floating point comparisons "
"require type float, double, or long double.");
// test `a == b` first to see if both a and b are either infinity
// or -infinity
return a == b || std::abs(a - b) <= scale_epsilon(a, b, epsilon);
}
/*
etc. etc.:
is_eq()
is_ne()
is_lt()
is_le()
is_gt()
is_ge()
*/
// Compare against zero
/// Equal: returns true if a is approximately == 0, and false otherwise
template<typename T>
static constexpr bool is_eq_zero(T a, T epsilon =
std::numeric_limits<T>::epsilon()) noexcept
{
static_assert(std::is_floating_point_v<T>, "Floating point comparisons "
"require type float, double, or long double.");
return is_eq(a, (T)0.0, epsilon);
}
/*
etc. etc.:
is_eq_zero()
is_ne_zero()
is_lt_zero()
is_le_zero()
is_gt_zero()
is_ge_zero()
*/
} // namespace float_comparison
See also:
The macro forms of some of the functions above in my repo here: utilities.h.
UPDATE 29 NOV 2020: it's a work-in-progress, and I'm going to make it a separate answer when ready, but I've produced a better, scaled-epsilon version of all of the functions in C in this file here: utilities.c. Take a look.
ADDITIONAL READING I need to do now have done: Floating-point tolerances revisited, by Christer Ericson. VERY USEFUL ARTICLE! It talks about scaling epsilon in order to ensure it never falls out in floating point error, even for really large-magnitude a and/or b values!

My class based on previously posted answers. Very similar to Google's code but I use a bias which pushes all NaN values above 0xFF000000. That allows a faster check for NaN.
This code is meant to demonstrate the concept, not be a general solution. Google's code already shows how to compute all the platform specific values and I didn't want to duplicate all that. I've done limited testing on this code.
typedef unsigned int U32;
// Float Memory Bias (unsigned)
// ----- ------ ---------------
// NaN 0xFFFFFFFF 0xFF800001
// NaN 0xFF800001 0xFFFFFFFF
// -Infinity 0xFF800000 0x00000000 ---
// -3.40282e+038 0xFF7FFFFF 0x00000001 |
// -1.40130e-045 0x80000001 0x7F7FFFFF |
// -0.0 0x80000000 0x7F800000 |--- Valid <= 0xFF000000.
// 0.0 0x00000000 0x7F800000 | NaN > 0xFF000000
// 1.40130e-045 0x00000001 0x7F800001 |
// 3.40282e+038 0x7F7FFFFF 0xFEFFFFFF |
// Infinity 0x7F800000 0xFF000000 ---
// NaN 0x7F800001 0xFF000001
// NaN 0x7FFFFFFF 0xFF7FFFFF
//
// Either value of NaN returns false.
// -Infinity and +Infinity are not "close".
// -0 and +0 are equal.
//
class CompareFloat{
public:
union{
float m_f32;
U32 m_u32;
};
static bool CompareFloat::IsClose( float A, float B, U32 unitsDelta = 4 )
{
U32 a = CompareFloat::GetBiased( A );
U32 b = CompareFloat::GetBiased( B );
if ( (a > 0xFF000000) || (b > 0xFF000000) )
{
return( false );
}
return( (static_cast<U32>(abs( a - b ))) < unitsDelta );
}
protected:
static U32 CompareFloat::GetBiased( float f )
{
U32 r = ((CompareFloat*)&f)->m_u32;
if ( r & 0x80000000 )
{
return( ~r - 0x007FFFFF );
}
return( r + 0x7F800000 );
}
};

I'd be very wary of any of these answers that involves floating point subtraction (e.g., fabs(a-b) < epsilon). First, the floating point numbers become more sparse at greater magnitudes and at high enough magnitudes where the spacing is greater than epsilon, you might as well just be doing a == b. Second, subtracting two very close floating point numbers (as these will tend to be, given that you're looking for near equality) is exactly how you get catastrophic cancellation.
While not portable, I think grom's answer does the best job of avoiding these issues.

There are actually cases in numerical software where you want to check whether two floating point numbers are exactly equal. I posted this on a similar question
https://stackoverflow.com/a/10973098/1447411
So you can not say that "CompareDoubles1" is wrong in general.

In terms of the scale of quantities:
If epsilon is the small fraction of the magnitude of quantity (i.e. relative value) in some certain physical sense and A and B types is comparable in the same sense, than I think, that the following is quite correct:
#include <limits>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <cassert>
template< typename A, typename B >
inline
bool close_enough(A const & a, B const & b,
typename std::common_type< A, B >::type const & epsilon)
{
using std::isless;
assert(isless(0, epsilon)); // epsilon is a part of the whole quantity
assert(isless(epsilon, 1));
using std::abs;
auto const delta = abs(a - b);
auto const x = abs(a);
auto const y = abs(b);
// comparable generally and |a - b| < eps * (|a| + |b|) / 2
return isless(epsilon * y, x) && isless(epsilon * x, y) && isless((delta + delta) / (x + y), epsilon);
}
int main()
{
std::cout << std::boolalpha << close_enough(0.9, 1.0, 0.1) << std::endl;
std::cout << std::boolalpha << close_enough(1.0, 1.1, 0.1) << std::endl;
std::cout << std::boolalpha << close_enough(1.1, 1.2, 0.01) << std::endl;
std::cout << std::boolalpha << close_enough(1.0001, 1.0002, 0.01) << std::endl;
std::cout << std::boolalpha << close_enough(1.0, 0.01, 0.1) << std::endl;
return EXIT_SUCCESS;
}

I use this code:
bool AlmostEqual(double v1, double v2)
{
return (std::fabs(v1 - v2) < std::fabs(std::min(v1, v2)) * std::numeric_limits<double>::epsilon());
}

Found another interesting implementation on: https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon
#include <cmath>
#include <limits>
#include <iomanip>
#include <iostream>
#include <type_traits>
#include <algorithm>
template<class T>
typename std::enable_if<!std::numeric_limits<T>::is_integer, bool>::type
almost_equal(T x, T y, int ulp)
{
// the machine epsilon has to be scaled to the magnitude of the values used
// and multiplied by the desired precision in ULPs (units in the last place)
return std::fabs(x-y) <= std::numeric_limits<T>::epsilon() * std::fabs(x+y) * ulp
// unless the result is subnormal
|| std::fabs(x-y) < std::numeric_limits<T>::min();
}
int main()
{
double d1 = 0.2;
double d2 = 1 / std::sqrt(5) / std::sqrt(5);
std::cout << std::fixed << std::setprecision(20)
<< "d1=" << d1 << "\nd2=" << d2 << '\n';
if(d1 == d2)
std::cout << "d1 == d2\n";
else
std::cout << "d1 != d2\n";
if(almost_equal(d1, d2, 2))
std::cout << "d1 almost equals d2\n";
else
std::cout << "d1 does not almost equal d2\n";
}

In a more generic way:
template <typename T>
bool compareNumber(const T& a, const T& b) {
return std::abs(a - b) < std::numeric_limits<T>::epsilon();
}
Note:
As pointed out by #SirGuy, this approach is flawed.
I am leaving this answer here as an example not to follow.

I use this code. Unlike the above answers this allows one to
give a abs_relative_error that is explained in the comments of the code.
The first version compares complex numbers, so that the error
can be explained in terms of the angle between two "vectors"
of the same length in the complex plane (which gives a little
insight). Then from there the correct formula for two real
numbers follows.
https://github.com/CarloWood/ai-utils/blob/master/almost_equal.h
The latter then is
template<class T>
typename std::enable_if<std::is_floating_point<T>::value, bool>::type
almost_equal(T x, T y, T const abs_relative_error)
{
return 2 * std::abs(x - y) <= abs_relative_error * std::abs(x + y);
}
where abs_relative_error is basically (twice) the absolute value of what comes closest to being defined in the literature: a relative error. But that is just the choice of the name.
What it really is seen most clearly in the complex plane I think. If |x| = 1, and y lays in a circle around x with diameter abs_relative_error, then the two are considered equal.

I use the following function for floating-point numbers comparison:
bool approximatelyEqual(double a, double b)
{
return fabs(a - b) <= ((fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * std::numeric_limits<double>::epsilon());
}

It depends on how precise you want the comparison to be. If you want to compare for exactly the same number, then just go with ==. (You almost never want to do this unless you actually want exactly the same number.) On any decent platform you can also do the following:
diff= a - b; return fabs(diff)<EPSILON;
as fabs tends to be pretty fast. By pretty fast I mean it is basically a bitwise AND, so it better be fast.
And integer tricks for comparing doubles and floats are nice but tend to make it more difficult for the various CPU pipelines to handle effectively. And it's definitely not faster on certain in-order architectures these days due to using the stack as a temporary storage area for values that are being used frequently. (Load-hit-store for those who care.)

/// testing whether two doubles are almost equal. We consider two doubles
/// equal if the difference is within the range [0, epsilon).
///
/// epsilon: a positive number (supposed to be small)
///
/// if either x or y is 0, then we are comparing the absolute difference to
/// epsilon.
/// if both x and y are non-zero, then we are comparing the relative difference
/// to epsilon.
bool almost_equal(double x, double y, double epsilon)
{
double diff = x - y;
if (x != 0 && y != 0){
diff = diff/y;
}
if (diff < epsilon && -1.0*diff < epsilon){
return true;
}
return false;
}
I used this function for my small project and it works, but note the following:
Double precision error can create a surprise for you. Let's say epsilon = 1.0e-6, then 1.0 and 1.000001 should NOT be considered equal according to the above code, but on my machine the function considers them to be equal, this is because 1.000001 can not be precisely translated to a binary format, it is probably 1.0000009xxx. I test it with 1.0 and 1.0000011 and this time I get the expected result.

You cannot compare two double with a fixed EPSILON. Depending on the value of double, EPSILON varies.
A better double comparison would be:
bool same(double a, double b)
{
return std::nextafter(a, std::numeric_limits<double>::lowest()) <= b
&& std::nextafter(a, std::numeric_limits<double>::max()) >= b;
}

My way may not be correct but useful
Convert both float to strings and then do string compare
bool IsFlaotEqual(float a, float b, int decimal)
{
TCHAR form[50] = _T("");
_stprintf(form, _T("%%.%df"), decimal);
TCHAR a1[30] = _T(""), a2[30] = _T("");
_stprintf(a1, form, a);
_stprintf(a2, form, b);
if( _tcscmp(a1, a2) == 0 )
return true;
return false;
}
operator overlaoding can also be done

I write this for java, but maybe you find it useful. It uses longs instead of doubles, but takes care of NaNs, subnormals, etc.
public static boolean equal(double a, double b) {
final long fm = 0xFFFFFFFFFFFFFL; // fraction mask
final long sm = 0x8000000000000000L; // sign mask
final long cm = 0x8000000000000L; // most significant decimal bit mask
long c = Double.doubleToLongBits(a), d = Double.doubleToLongBits(b);
int ea = (int) (c >> 52 & 2047), eb = (int) (d >> 52 & 2047);
if (ea == 2047 && (c & fm) != 0 || eb == 2047 && (d & fm) != 0) return false; // NaN
if (c == d) return true; // identical - fast check
if (ea == 0 && eb == 0) return true; // ±0 or subnormals
if ((c & sm) != (d & sm)) return false; // different signs
if (abs(ea - eb) > 1) return false; // b > 2*a or a > 2*b
d <<= 12; c <<= 12;
if (ea < eb) c = c >> 1 | sm;
else if (ea > eb) d = d >> 1 | sm;
c -= d;
return c < 65536 && c > -65536; // don't use abs(), because:
// There is a posibility c=0x8000000000000000 which cannot be converted to positive
}
public static boolean zero(double a) { return (Double.doubleToLongBits(a) >> 52 & 2047) < 3; }
Keep in mind that after a number of floating-point operations, number can be very different from what we expect. There is no code to fix that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Best way to "Clip" (saturate) a variable [duplicate] - c++

Rather than testing and branching, I normally use this format for clamping: clampedA = fmin(fmax(a,MY_MIN),MY_MAX); Although I have never done any performance analysis on the compiled code.

My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated template <typename T> inline T clamp(T val, T lo, T hi) { return std::max(lo, std::min(hi, val)); }

Related

Dividing two integers and rounding up the result, without using floating point

Optimization Headache - removing if's from Look Up Table

How can I safely average two unsigned ints in C++?

round() for float in C++

What is the most effective way for float and double comparison?

Categories

Resources