round() for float in C++

round() for float in C++ - c++

I need a simple floating point rounding function, thus:
double round(double);
round(0.1) = 0
round(-0.1) = 0
round(-0.9) = -1
I can find ceil() and floor() in the math.h - but not round().
Is it present in the standard C++ library under another name, or is it missing??

Editor's Note: The following answer provides a simplistic solution that contains several implementation flaws (see Shafik Yaghmour's answer for a full explanation). Note that C++11 includes std::round, std::lround, and std::llround as builtins already.
There's no round() in the C++98 standard library. You can write one yourself though. The following is an implementation of round-half-up:
double round(double d)
{
return floor(d + 0.5);
}
The probable reason there is no round function in the C++98 standard library is that it can in fact be implemented in different ways. The above is one common way but there are others such as round-to-even, which is less biased and generally better if you're going to do a lot of rounding; it's a bit more complex to implement though.

The C++03 standard relies on the C90 standard for what the standard calls the Standard C Library which is covered in the draft C++03 standard (closest publicly available draft standard to C++03 is N1804) section 1.2 Normative references:
The library described in clause 7 of ISO/IEC 9899:1990 and clause 7 of
ISO/IEC 9899/Amd.1:1995 is hereinafter called the Standard C
Library.1)
If we go to the C documentation for round, lround, llround on cppreference we can see that round and related functions are part of C99 and thus won't be available in C++03 or prior.
In C++11 this changes since C++11 relies on the C99 draft standard for C standard library and therefore provides std::round and for integral return types std::lround, std::llround :
#include <iostream>
#include <cmath>
int main()
{
std::cout << std::round( 0.4 ) << " " << std::lround( 0.4 ) << " " << std::llround( 0.4 ) << std::endl ;
std::cout << std::round( 0.5 ) << " " << std::lround( 0.5 ) << " " << std::llround( 0.5 ) << std::endl ;
std::cout << std::round( 0.6 ) << " " << std::lround( 0.6 ) << " " << std::llround( 0.6 ) << std::endl ;
}
Another option also from C99 would be std::trunc which:
Computes nearest integer not greater in magnitude than arg.
#include <iostream>
#include <cmath>
int main()
{
std::cout << std::trunc( 0.4 ) << std::endl ;
std::cout << std::trunc( 0.9 ) << std::endl ;
std::cout << std::trunc( 1.1 ) << std::endl ;
}
If you need to support non C++11 applications your best bet would be to use boost round, iround, lround, llround or boost trunc.
Rolling your own version of round is hard
Rolling your own is probably not worth the effort as Harder than it looks: rounding float to nearest integer, part 1, Rounding float to nearest integer, part 2 and Rounding float to nearest integer, part 3 explain:
For example a common roll your implementation using std::floor and adding 0.5 does not work for all inputs:
double myround(double d)
{
return std::floor(d + 0.5);
}
One input this will fail for is 0.49999999999999994, (see it live).
Another common implementation involves casting a floating point type to an integral type, which can invoke undefined behavior in the case where the integral part can not be represented in the destination type. We can see this from the draft C++ standard section 4.9 Floating-integral conversions which says (emphasis mine):
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion truncates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type.[...]
For example:
float myround(float f)
{
return static_cast<float>( static_cast<unsigned int>( f ) ) ;
}
Given std::numeric_limits<unsigned int>::max() is 4294967295 then the following call:
myround( 4294967296.5f )
will cause overflow, (see it live).
We can see how difficult this really is by looking at this answer to Concise way to implement round() in C? which referencing newlibs version of single precision float round. It is a very long function for something which seems simple. It seems unlikely that anyone without intimate knowledge of floating point implementations could correctly implement this function:
float roundf(x)
{
int signbit;
__uint32_t w;
/* Most significant word, least significant word. */
int exponent_less_127;
GET_FLOAT_WORD(w, x);
/* Extract sign bit. */
signbit = w & 0x80000000;
/* Extract exponent field. */
exponent_less_127 = (int)((w & 0x7f800000) >> 23) - 127;
if (exponent_less_127 < 23)
{
if (exponent_less_127 < 0)
{
w &= 0x80000000;
if (exponent_less_127 == -1)
/* Result is +1.0 or -1.0. */
w |= ((__uint32_t)127 << 23);
}
else
{
unsigned int exponent_mask = 0x007fffff >> exponent_less_127;
if ((w & exponent_mask) == 0)
/* x has an integral value. */
return x;
w += 0x00400000 >> exponent_less_127;
w &= ~exponent_mask;
}
}
else
{
if (exponent_less_127 == 128)
/* x is NaN or infinite. */
return x + x;
else
return x;
}
SET_FLOAT_WORD(x, w);
return x;
}
On the other hand if none of the other solutions are usable newlib could potentially be an option since it is a well tested implementation.

Boost offers a simple set of rounding functions.
#include <boost/math/special_functions/round.hpp>
double a = boost::math::round(1.5); // Yields 2.0
int b = boost::math::iround(1.5); // Yields 2 as an integer
For more information, see the Boost documentation.
Edit: Since C++11, there are std::round, std::lround, and std::llround.

It may be worth noting that if you wanted an integer result from the rounding you don't need to pass it through either ceil or floor. I.e.,
int round_int( double r ) {
return (r > 0.0) ? (r + 0.5) : (r - 0.5);
}

It's available since C++11 in cmath (according to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf)
#include <cmath>
#include <iostream>
int main(int argc, char** argv) {
std::cout << "round(0.5):\t" << round(0.5) << std::endl;
std::cout << "round(-0.5):\t" << round(-0.5) << std::endl;
std::cout << "round(1.4):\t" << round(1.4) << std::endl;
std::cout << "round(-1.4):\t" << round(-1.4) << std::endl;
std::cout << "round(1.6):\t" << round(1.6) << std::endl;
std::cout << "round(-1.6):\t" << round(-1.6) << std::endl;
return 0;
}
Output:
round(0.5): 1
round(-0.5): -1
round(1.4): 1
round(-1.4): -1
round(1.6): 2
round(-1.6): -2

It's usually implemented as floor(value + 0.5).
Edit: and it's probably not called round since there are at least three rounding algorithms I know of: round to zero, round to closest integer, and banker's rounding. You are asking for round to closest integer.

There are 2 problems we are looking at:
rounding conversions
type conversion.
Rounding conversions mean rounding ± float/double to nearest floor/ceil float/double.
May be your problem ends here.
But if you are expected to return Int/Long, you need to perform type conversion, and thus "Overflow" problem might hit your solution. SO, do a check for error in your function
long round(double x) {
assert(x >= LONG_MIN-0.5);
assert(x <= LONG_MAX+0.5);
if (x >= 0)
return (long) (x+0.5);
return (long) (x-0.5);
}
#define round(x) ((x) < LONG_MIN-0.5 || (x) > LONG_MAX+0.5 ?\
error() : ((x)>=0?(long)((x)+0.5):(long)((x)-0.5))
from : http://www.cs.tut.fi/~jkorpela/round.html

A certain type of rounding is also implemented in Boost:
#include <iostream>
#include <boost/numeric/conversion/converter.hpp>
template<typename T, typename S> T round2(const S& x) {
typedef boost::numeric::conversion_traits<T, S> Traits;
typedef boost::numeric::def_overflow_handler OverflowHandler;
typedef boost::numeric::RoundEven<typename Traits::source_type> Rounder;
typedef boost::numeric::converter<T, S, Traits, OverflowHandler, Rounder> Converter;
return Converter::convert(x);
}
int main() {
std::cout << round2<int, double>(0.1) << ' ' << round2<int, double>(-0.1) << ' ' << round2<int, double>(-0.9) << std::endl;
}
Note that this works only if you do a to-integer conversion.

You could round to n digits precision with:
double round( double x )
{
const double sd = 1000; //for accuracy to 3 decimal places
return int(x*sd + (x<0? -0.5 : 0.5))/sd;
}

These days it shouldn't be a problem to use a C++11 compiler which includes a C99/C++11 math library. But then the question becomes: which rounding function do you pick?
C99/C++11 round() is often not actually the rounding function you want. It uses a funky rounding mode that rounds away from 0 as a tie-break on half-way cases (+-xxx.5000). If you do specifically want that rounding mode, or you're targeting a C++ implementation where round() is faster than rint(), then use it (or emulate its behaviour with one of the other answers on this question which took it at face value and carefully reproduced that specific rounding behaviour.)
round()'s rounding is different from the IEEE754 default round to nearest mode with even as a tie-break. Nearest-even avoids statistical bias in the average magnitude of numbers, but does bias towards even numbers.
There are two math library rounding functions that use the current default rounding mode: std::nearbyint() and std::rint(), both added in C99/C++11, so they're available any time std::round() is. The only difference is that nearbyint never raises FE_INEXACT.
Prefer rint() for performance reasons: gcc and clang both inline it more easily, but gcc never inlines nearbyint() (even with -ffast-math)
gcc/clang for x86-64 and AArch64
I put some test functions on Matt Godbolt's Compiler Explorer, where you can see source + asm output (for multiple compilers). For more about reading compiler output, see this Q&A, and Matt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”,
In FP code, it's usually a big win to inline small functions. Especially on non-Windows, where the standard calling convention has no call-preserved registers, so the compiler can't keep any FP values in XMM registers across a call. So even if you don't really know asm, you can still easily see whether it's just a tail-call to the library function or whether it inlined to one or two math instructions. Anything that inlines to one or two instructions is better than a function call (for this particular task on x86 or ARM).
On x86, anything that inlines to SSE4.1 roundsd can auto-vectorize with SSE4.1 roundpd (or AVX vroundpd). (FP->integer conversions are also available in packed SIMD form, except for FP->64-bit integer which requires AVX512.)
std::nearbyint():
x86 clang: inlines to a single insn with -msse4.1.
x86 gcc: inlines to a single insn only with -msse4.1 -ffast-math, and only on gcc 5.4 and earlier. Later gcc never inlines it (maybe they didn't realize that one of the immediate bits can suppress the inexact exception? That's what clang uses, but older gcc uses the same immediate as for rint when it does inline it)
AArch64 gcc6.3: inlines to a single insn by default.
std::rint:
x86 clang: inlines to a single insn with -msse4.1
x86 gcc7: inlines to a single insn with -msse4.1. (Without SSE4.1, inlines to several instructions)
x86 gcc6.x and earlier: inlines to a single insn with -ffast-math -msse4.1.
AArch64 gcc: inlines to a single insn by default
std::round:
x86 clang: doesn't inline
x86 gcc: inlines to multiple instructions with -ffast-math -msse4.1, requiring two vector constants.
AArch64 gcc: inlines to a single instruction (HW support for this rounding mode as well as IEEE default and most others.)
std::floor / std::ceil / std::trunc
x86 clang: inlines to a single insn with -msse4.1
x86 gcc7.x: inlines to a single insn with -msse4.1
x86 gcc6.x and earlier: inlines to a single insn with -ffast-math -msse4.1
AArch64 gcc: inlines by default to a single instruction
Rounding to int / long / long long:
You have two options here: use lrint (like rint but returns long, or long long for llrint), or use an FP->FP rounding function and then convert to an integer type the normal way (with truncation). Some compilers optimize one way better than the other.
long l = lrint(x);
int i = (int)rint(x);
Note that int i = lrint(x) converts float or double -> long first, and then truncates the integer to int. This makes a difference for out-of-range integers: Undefined Behaviour in C++, but well-defined for the x86 FP -> int instructions (which the compiler will emit unless it sees the UB at compile time while doing constant propagation, then it's allowed to make code that breaks if it's ever executed).
On x86, an FP->integer conversion that overflows the integer produces INT_MIN or LLONG_MIN (a bit-pattern of 0x8000000 or the 64-bit equivalent, with just the sign-bit set). Intel calls this the "integer indefinite" value. (See the cvttsd2si manual entry, the SSE2 instruction that converts (with truncation) scalar double to signed integer. It's available with 32-bit or 64-bit integer destination (in 64-bit mode only). There's also a cvtsd2si (convert with current rounding mode), which is what we'd like the compiler to emit, but unfortunately gcc and clang won't do that without -ffast-math.
Also beware that FP to/from unsigned int / long is less efficient on x86 (without AVX512). Conversion to 32-bit unsigned on a 64-bit machine is pretty cheap; just convert to 64-bit signed and truncate. But otherwise it's significantly slower.
x86 clang with/without -ffast-math -msse4.1: (int/long)rint inlines to roundsd / cvttsd2si. (missed optimization to cvtsd2si). lrint doesn't inline at all.
x86 gcc6.x and earlier without -ffast-math: neither way inlines
x86 gcc7 without -ffast-math: (int/long)rint rounds and converts separately (with 2 total instructions of SSE4.1 is enabled, otherwise with a bunch of code inlined for rint without roundsd). lrint doesn't inline.
x86 gcc with -ffast-math: all ways inline to cvtsd2si (optimal), no need for SSE4.1.
AArch64 gcc6.3 without -ffast-math: (int/long)rint inlines to 2 instructions. lrint doesn't inline
AArch64 gcc6.3 with -ffast-math: (int/long)rint compiles to a call to lrint. lrint doesn't inline. This may be a missed optimization unless the two instructions we get without -ffast-math are very slow.

If you ultimately want to convert the double output of your round() function to an int, then the accepted solutions of this question will look something like:
int roundint(double r) {
return (int)((r > 0.0) ? floor(r + 0.5) : ceil(r - 0.5));
}
This clocks in at around 8.88 ns on my machine when passed in uniformly random values.
The below is functionally equivalent, as far as I can tell, but clocks in at 2.48 ns on my machine, for a significant performance advantage:
int roundint (double r) {
int tmp = static_cast<int> (r);
tmp += (r-tmp>=.5) - (r-tmp<=-.5);
return tmp;
}
Among the reasons for the better performance is the skipped branching.

Beware of floor(x+0.5). Here is what can happen for odd numbers in range [2^52,2^53]:
-bash-3.2$ cat >test-round.c <<END
#include <math.h>
#include <stdio.h>
int main() {
double x=5000000000000001.0;
double y=round(x);
double z=floor(x+0.5);
printf(" x =%f\n",x);
printf("round(x) =%f\n",y);
printf("floor(x+0.5)=%f\n",z);
return 0;
}
END
-bash-3.2$ gcc test-round.c
-bash-3.2$ ./a.out
x =5000000000000001.000000
round(x) =5000000000000001.000000
floor(x+0.5)=5000000000000002.000000
This is http://bugs.squeak.org/view.php?id=7134. Use a solution like the one of #konik.
My own robust version would be something like:
double round(double x)
{
double truncated,roundedFraction;
double fraction = modf(x, &truncated);
modf(2.0*fraction, &roundedFraction);
return truncated + roundedFraction;
}
Another reason to avoid floor(x+0.5) is given here.

There is no need to implement anything, so I'm not sure why so many answers involve defines, functions, or methods.
In C99
We have the following and and header <tgmath.h> for type-generic macros.
#include <math.h>
double round (double x);
float roundf (float x);
long double roundl (long double x);
If you cannot compile this, you have probably left out the math library. A command similar to this works on every C compiler I have (several).
gcc -lm -std=c99 ...
In C++11
We have the following and additional overloads in #include <cmath> that rely on IEEE double precision floating point.
#include <math.h>
double round (double x);
float round (float x);
long double round (long double x);
double round (T x);
There are equivalents in the std namespace too.
If you cannot compile this, you may be using C compilation instead of C++. The following basic command produces neither errors nor warnings with g++ 6.3.1, x86_64-w64-mingw32-g++ 6.3.0, clang-x86_64++ 3.8.0, and Visual C++ 2015 Community.
g++ -std=c++11 -Wall
With Ordinal Division
When dividing two ordinal numbers, where T is short, int, long, or another ordinal, the rounding expression is this.
T roundedQuotient = (2 * integerNumerator + 1)
/ (2 * integerDenominator);
Accuracy
There is no doubt that odd looking inaccuracies appear in floating point operations, but this is only when the numbers appear, and has little to do with rounding.
The source is not just the number of significant digits in the mantissa of the IEEE representation of a floating point number, it is related to our decimal thinking as humans.
Ten is the product of five and two, and 5 and 2 are relatively prime. Therefore the IEEE floating point standards cannot possibly be represented perfectly as decimal numbers for all binary digital representations.
This is not an issue with the rounding algorithms. It is mathematical reality that should be considered during the selection of types and the design of computations, data entry, and display of numbers. If an application displays the digits that show these decimal-binary conversion issues, then the application is visually expressing accuracy that does not exist in digital reality and should be changed.

Function double round(double) with the use of the modf function:
double round(double x)
{
using namespace std;
if ((numeric_limits<double>::max() - 0.5) <= x)
return numeric_limits<double>::max();
if ((-1*std::numeric_limits<double>::max() + 0.5) > x)
return (-1*std::numeric_limits<double>::max());
double intpart;
double fractpart = modf(x, &intpart);
if (fractpart >= 0.5)
return (intpart + 1);
else if (fractpart >= -0.5)
return intpart;
else
return (intpart - 1);
}
To be compile clean, includes "math.h" and "limits" are necessary. The function works according to a following rounding schema:
round of 5.0 is 5.0
round of 3.8 is 4.0
round of 2.3 is 2.0
round of 1.5 is 2.0
round of 0.501 is 1.0
round of 0.5 is 1.0
round of 0.499 is 0.0
round of 0.01 is 0.0
round of 0.0 is 0.0
round of -0.01 is -0.0
round of -0.499 is -0.0
round of -0.5 is -0.0
round of -0.501 is -1.0
round of -1.5 is -1.0
round of -2.3 is -2.0
round of -3.8 is -4.0
round of -5.0 is -5.0

If you need to be able to compile code in environments that support the C++11 standard, but also need to be able to compile that same code in environments that don't support it, you could use a function macro to choose between std::round() and a custom function for each system. Just pass -DCPP11 or /DCPP11 to the C++11-compliant compiler (or use its built-in version macros), and make a header like this:
// File: rounding.h
#include <cmath>
#ifdef CPP11
#define ROUND(x) std::round(x)
#else /* CPP11 */
inline double myRound(double x) {
return (x >= 0.0 ? std::floor(x + 0.5) : std::ceil(x - 0.5));
}
#define ROUND(x) myRound(x)
#endif /* CPP11 */
For a quick example, see http://ideone.com/zal709 .
This approximates std::round() in environments that aren't C++11-compliant, including preservation of the sign bit for -0.0. It may cause a slight performance hit, however, and will likely have issues with rounding certain known "problem" floating-point values such as 0.49999999999999994 or similar values.
Alternatively, if you have access to a C++11-compliant compiler, you could just grab std::round() from its <cmath> header, and use it to make your own header that defines the function if it's not already defined. Note that this may not be an optimal solution, however, especially if you need to compile for multiple platforms.

Based on Kalaxy's response, the following is a templated solution that rounds any floating point number to the nearest integer type based on natural rounding. It also throws an error in debug mode if the value is out of range of the integer type, thereby serving roughly as a viable library function.
// round a floating point number to the nearest integer
template <typename Arg>
int Round(Arg arg)
{
#ifndef NDEBUG
// check that the argument can be rounded given the return type:
if (
(Arg)std::numeric_limits<int>::max() < arg + (Arg) 0.5) ||
(Arg)std::numeric_limits<int>::lowest() > arg - (Arg) 0.5)
)
{
throw std::overflow_error("out of bounds");
}
#endif
return (arg > (Arg) 0.0) ? (int)(r + (Arg) 0.5) : (int)(r - (Arg) 0.5);
}

As pointed out in comments and other answers, the ISO C++ standard library did not add round() until ISO C++11, when this function was pulled in by reference to the ISO C99 standard math library.
For positive operands in [½, ub] round(x) == floor (x + 0.5), where ub is 223 for float when mapped to IEEE-754 (2008) binary32, and 252 for double when it is mapped to IEEE-754 (2008) binary64. The numbers 23 and 52 correspond to the number of stored mantissa bits in these two floating-point formats. For positive operands in [+0, ½) round(x) == 0, and for positive operands in (ub, +∞] round(x) == x. As the function is symmetric about the x-axis, negative arguments x can be handled according to round(-x) == -round(x).
This leads to the compact code below. It compiles into a reasonable number of machine instructions across various platforms. I observed the most compact code on GPUs, where my_roundf() requires about a dozen instructions. Depending on processor architecture and toolchain, this floating-point based approach could be either faster or slower than the integer-based implementation from newlib referenced in a different answer.
I tested my_roundf() exhaustively against the newlib roundf() implementation using Intel compiler version 13, with both /fp:strict and /fp:fast. I also checked that the newlib version matches the roundf() in the mathimf library of the Intel compiler. Exhaustive testing is not possible for double-precision round(), however the code is structurally identical to the single-precision implementation.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float my_roundf (float x)
{
const float half = 0.5f;
const float one = 2 * half;
const float lbound = half;
const float ubound = 1L << 23;
float a, f, r, s, t;
s = (x < 0) ? (-one) : one;
a = x * s;
t = (a < lbound) ? x : s;
f = (a < lbound) ? 0 : floorf (a + half);
r = (a > ubound) ? x : (t * f);
return r;
}
double my_round (double x)
{
const double half = 0.5;
const double one = 2 * half;
const double lbound = half;
const double ubound = 1ULL << 52;
double a, f, r, s, t;
s = (x < 0) ? (-one) : one;
a = x * s;
t = (a < lbound) ? x : s;
f = (a < lbound) ? 0 : floor (a + half);
r = (a > ubound) ? x : (t * f);
return r;
}
uint32_t float_as_uint (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof(r));
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
float newlib_roundf (float x)
{
uint32_t w;
int exponent_less_127;
w = float_as_uint(x);
/* Extract exponent field. */
exponent_less_127 = (int)((w & 0x7f800000) >> 23) - 127;
if (exponent_less_127 < 23) {
if (exponent_less_127 < 0) {
/* Extract sign bit. */
w &= 0x80000000;
if (exponent_less_127 == -1) {
/* Result is +1.0 or -1.0. */
w |= ((uint32_t)127 << 23);
}
} else {
uint32_t exponent_mask = 0x007fffff >> exponent_less_127;
if ((w & exponent_mask) == 0) {
/* x has an integral value. */
return x;
}
w += 0x00400000 >> exponent_less_127;
w &= ~exponent_mask;
}
} else {
if (exponent_less_127 == 128) {
/* x is NaN or infinite so raise FE_INVALID by adding */
return x + x;
} else {
return x;
}
}
x = uint_as_float (w);
return x;
}
int main (void)
{
uint32_t argi, resi, refi;
float arg, res, ref;
argi = 0;
do {
arg = uint_as_float (argi);
ref = newlib_roundf (arg);
res = my_roundf (arg);
resi = float_as_uint (res);
refi = float_as_uint (ref);
if (resi != refi) { // check for identical bit pattern
printf ("!!!! arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
}
argi++;
} while (argi);
return EXIT_SUCCESS;
}

I use the following implementation of round in asm for x86 architecture and MS VS specific C++:
__forceinline int Round(const double v)
{
int r;
__asm
{
FLD v
FISTP r
FWAIT
};
return r;
}
UPD: to return double value
__forceinline double dround(const double v)
{
double r;
__asm
{
FLD v
FRNDINT
FSTP r
FWAIT
};
return r;
}
Output:
dround(0.1): 0.000000000000000
dround(-0.1): -0.000000000000000
dround(0.9): 1.000000000000000
dround(-0.9): -1.000000000000000
dround(1.1): 1.000000000000000
dround(-1.1): -1.000000000000000
dround(0.49999999999999994): 0.000000000000000
dround(-0.49999999999999994): -0.000000000000000
dround(0.5): 0.000000000000000
dround(-0.5): -0.000000000000000

Since C++ 11 simply:
#include <cmath>
std::round(1.1)
or to get int
static_cast<int>(std::round(1.1))

round_f for ARM with math
static inline float round_f(float value)
{
float rep;
asm volatile ("vrinta.f32 %0,%1" : "=t"(rep) : "t"(value));
return rep;
}
round_f for ARM without math
union f__raw {
struct {
uint32_t massa :23;
uint32_t order :8;
uint32_t sign :1;
};
int32_t i_raw;
float f_raw;
};
float round_f(float value)
{
union f__raw raw;
int32_t exx;
uint32_t ex_mask;
raw.f_raw = value;
exx = raw.order - 126;
if (exx < 0) {
raw.i_raw &= 0x80000000;
} else if (exx < 24) {
ex_mask = 0x00ffffff >> exx;
raw.i_raw += 0x00800000 >> exx;
if (exx == 0) ex_mask >>= 1;
raw.i_raw &= ~ex_mask;
};
return raw.f_raw;
};

Best way to rounding off a floating value by "n" decimal places, is as following with in O(1) time:-
We have to round off the value by 3 places i.e. n=3.So,
float a=47.8732355;
printf("%.3f",a);

// Convert the float to a string
// We might use stringstream, but it looks like it truncates the float to only
//5 decimal points (maybe that's what you want anyway =P)
float MyFloat = 5.11133333311111333;
float NewConvertedFloat = 0.0;
string FirstString = " ";
string SecondString = " ";
stringstream ss (stringstream::in | stringstream::out);
ss << MyFloat;
FirstString = ss.str();
// Take out how ever many decimal places you want
// (this is a string it includes the point)
SecondString = FirstString.substr(0,5);
//whatever precision decimal place you want
// Convert it back to a float
stringstream(SecondString) >> NewConvertedFloat;
cout << NewConvertedFloat;
system("pause");
It might be an inefficient dirty way of conversion but heck, it works lol. And it's good, because it applies to the actual float. Not just affecting the output visually.

I did this:
#include <cmath.h>
using namespace std;
double roundh(double number, int place){
/* place = decimal point. Putting in 0 will make it round to whole
number. putting in 1 will round to the
tenths digit.
*/
number *= 10^place;
int istack = (int)floor(number);
int out = number-istack;
if (out < 0.5){
floor(number);
number /= 10^place;
return number;
}
if (out > 0.4) {
ceil(number);
number /= 10^place;
return number;
}
}

Related

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.

Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}

... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.

In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.

For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

Getting correct values for trigonometric functions compared with matlab

I am trying to test a simulink block with its c++ code, the simulink block contains some algebratic, trigonometric functions and integrators. In my test procedure, a random number generator is used form the simulink block input and both input and outpus are recorded into mat file (using MatIO) which will be read by C++ code and the outputs are compared with C++ calculate ones. for signals containing only algebraic functions the results are exact and the difference is zero, for paths which contains the trigonometric functions the difference is about 10e-16.
The matlab community claim they are correct and glibc isn't.
Recently i discovered the output values of trigonometric functions implemented in glibc isn't equal to values produced in matlabs, according to old questions 1 2 3 and my experiments the differences related 1ulp> accuracy of glibc. for most of the block this 10e-16 error doesn't sense much, but in the output of the integrator the 10e-16 accumulated more and more and the final error of the integrator will be about 1e-3 which is a bit high and isn't acceptable for that kind of block.
After lot of research about the problem, i decided to use other approaches for calculating sin/cos functions than those provided in glibc.
I implemented these apporaches,
1- taylor series with long double variables and -O2 (which force using x87 FPU with its 80bit floating point arithmetic)
2- taylor series with GNU quadmath library (128bit precision)
3- MPFR library (128 bit)
4- CRLibm (correctly rounded libm)
5- Sun's LibMCR ( just like CRLibm )
6- X86 FSIN/FCOS with different rounding modes
7- Java.lang.math through JNI ( as i think matlab uses )
8- fdlibm ( according to one the blogpost i have seen )
9- openlibm
10- calling matlab function through mex/matlab engine
Non of the experiments above except last one couldn't generate values equal to matlab. i tested all of those libraries and approaches for wide range of inputs, some of them like libmcr and fdlibm will produce NAN value for some of the inputs (looks like they doesn't have good range checkings) and the rest of them producing values with error of 10e-16 and higher.
Only the last one produces correct values compared to matlab as was expeceted, but calling matlab function isn't efficent and much slower than native implementations.
also i supprised why the MPFR and taylor series with long double and quadmath are getting in error.
This is taylor series with long double variables (80bit precision)
and should be compiled with -O2 which prevents storing values from FPU stack into registers (80bit to 64bit = precision loss), also before doing any calculations the rounding mode of x87 will be set to nearest
typedef long double dt_double;
inline void setFPUModes(){
unsigned int mode = 0b0000111111111111;
asm(
"fldcw %0;"
: : "m"(mode));
}
inline dt_double factorial(int x) //calculates the factorial
{
dt_double fact = 1;
for (; x >= 1 ; x--)
fact = x * fact;
return fact;
}
inline dt_double power(dt_double x, dt_double n) //calculates the power of x
{
dt_double output = 1;
while (n > 0)
{
output = (x * output);
n--;
}
return output;
}
inline double sin(double x) noexcept //value of sine by Taylors series
{
setFPUModes();
dt_double result = x;
for (int y = 1 ; y != 44; y++)
{
int k = (2 * y) + 1;
dt_double a = (y%2) ? -1.0 : 1.0;
dt_double c = factorial(k);
dt_double b = power(x, k);
result = result + (a * b) / c;
}
return result;
}
the taylor series approach tested with all four rounding modes of x87, the best one has error of 10e-16
This is the X87 fpu one
double sin(double x) noexcept
{
double d;
unsigned int mode = 0b0000111111111111;
asm(
"finit;"
"fldcw %2;"
"fldl %1;"
"fsin;"
"fstpl %0" :
"+m"(d) : "m"(x), "m"(mode)
);
return d;
}
also the x87 fpu code isn't more accurate than previous one
Here is the code for MPFR
double sin(double x) noexcept{
mpfr_set_default_prec(128);
mpfr_set_default_rounding_mode(MPFR_RNDN);
mpfr_t t;
mpfr_init2(t, 128);
mpfr_set_d(t, x, MPFR_RNDN);
mpfr_t y;
mpfr_init2(y, 128);
mpfr_sin(y, t, MPFR_RNDN);
double d = mpfr_get_d(y, MPFR_RNDN);
mpfr_clear(t);
mpfr_clear(y);
return d;
}
I can't understand why the MPFR version didn't work as expected
also the codes for all other approaches i have tested is same and all of them have errors compared to matlab.
all the codes are tested for wide range of numbers, and i found simple cases which they fail. for example :
in matlab following code produces 0x3fe1b071cef86fbe but in these apporoches i got 0x3fe1b071cef86fbf ( difference in last bit )
format hex;
sin(0.5857069572718263)
ans = 0x3fe1b071cef86fbe
To be clear about the question,
As i described above, this one bit inaccuracy is important when it fed into the integrator and i am looking for a solution to get values exactly same as the matlab one. any ideas?
Update1:
1 Ulp error doesn't affect the alogorithm output at all, but it prevents verfication with matlab results specially in the output of integrators.
As #John Bollinger said, the errors doesn't accumulate in direct path with multiple arithmetic blocks, but not when feed into the discrete integrator
Update2:
I counted the number of inequal results for all of the approaches above, clearly openlibm will produce less inequal values compared to matlab ones, but it isn't zero.

My guess is that Matlab is using code originally based on FDLIBM. I was able to get the same results with Julia (which uses openlibm): you could try using that, or musl, which I believe uses the same code as well.
The closest double/IEEE binary64 to 0.5857069572718263 is
0.5857069572718263117394599248655140399932861328125
(which has bit pattern 0x3fe2be1c8450b590)
The sin of this is
0.55278864311139114312806521962078480744570117018100444956428008387067038680572587...
The two closest double/IEEE binary64 to this are
a) 0.5527886431113910870038807843229733407497406005859375 (0x3fe1b071cef86fbe), which has error of 0.5055 ulps
b) 0.55278864311139119802618324683862738311290740966796875 (0x3fe1b071cef86fbf), which has error of 0.4945 ulps
FDLIBM is only guaranteed to be correct to <1 ulp, so either would be acceptable, and happens to return (a). crlibm is correctly rounded, and glibc provides a tighter guarantee of 0.55 ulps, so both will return (b).

Float rounding off error

#include<iostream>
long myround(float f)
{
if (f >= UINT_MAX) return f;
return f + 0.5f;
}
int main()
{
f = 8388609.0f;
std:cout.precision(16);
std::cout << myround(f) << std::endl;
}
Output: 8388610.0
I'm trying to make sense of the output. The next floating number larger than
8388609.0 is 8388610 But why the rounded value is not 8388609?

IEEE-754 defines several possible rounding modes, but in practice, the one almost always used is "round to nearest, ties to even". This is also known as "Banker's rounding", for no reason anybody can discern.
"Ties to even" means that if the to-be-rounded result of a floating point computation is precisely halfway between two representable numbers, the rounding will be in whichever direction makes the LSB of the result zero. In your case, 8388609.5 is halfway between 8388609 and 8388610, but only the latter has a zero in the last bit, so the rounding is upwards. If you had passed in 8388610.0 instead, the result would be rounded downwards; if you had passed in 8388611.0, it would be rounded upwards.

If you change your example to use double then the error disappears. The problem is float is more limited than double in the number of significant digits it can store. Adding 0.5 to your value simply exceeds the precision limits for a float, causing it to preform some rounding. In this case, 8388609.0f + 0.5f == 8388610.0f.
#include<iostream>
long myround(double f)
{
if (f >= UINT_MAX) return f;
return f + 0.5;
}
int main()
{
double f = 8388609.0;
std::cout.precision(16);
std::cout << myround(f) << std::endl;
}
If you continue to add digits to your number, it will also eventually fail for double.
Edit:
You can test this easily with a static_assert. This compiles on my platform static_assert(8388609.0f + 0.5f == 8388610.0f, "");. It will likely compile on yours.

Best way to "Clip" (saturate) a variable [duplicate]

Is there a more efficient way to clamp real numbers than using if statements or ternary operators?
I want to do this both for doubles and for a 32-bit fixpoint implementation (16.16). I'm not asking for code that can handle both cases; they will be handled in separate functions.
Obviously, I can do something like:
double clampedA;
double a = calculate();
clampedA = a > MY_MAX ? MY_MAX : a;
clampedA = a < MY_MIN ? MY_MIN : a;
or
double a = calculate();
double clampedA = a;
if(clampedA > MY_MAX)
clampedA = MY_MAX;
else if(clampedA < MY_MIN)
clampedA = MY_MIN;
The fixpoint version would use functions/macros for comparisons.
This is done in a performance-critical part of the code, so I'm looking for an as efficient way to do it as possible (which I suspect would involve bit-manipulation)
EDIT: It has to be standard/portable C, platform-specific functionality is not of any interest here. Also, MY_MIN and MY_MAX are the same type as the value I want clamped (doubles in the examples above).

Both GCC and clang generate beautiful assembly for the following simple, straightforward, portable code:
double clamp(double d, double min, double max) {
const double t = d < min ? min : d;
return t > max ? max : t;
}
> gcc -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
GCC-generated assembly:
maxsd %xmm0, %xmm1 # d, min
movapd %xmm2, %xmm0 # max, max
minsd %xmm1, %xmm0 # min, max
ret
> clang -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
Clang-generated assembly:
maxsd %xmm0, %xmm1
minsd %xmm1, %xmm2
movaps %xmm2, %xmm0
ret
Three instructions (not counting the ret), no branches. Excellent.
This was tested with GCC 4.7 and clang 3.2 on Ubuntu 13.04 with a Core i3 M 350.
On a side note, the straightforward C++ code calling std::min and std::max generated the same assembly.
This is for doubles. And for int, both GCC and clang generate assembly with five instructions (not counting the ret) and no branches. Also excellent.
I don't currently use fixed-point, so I will not give an opinion on fixed-point.

Old question, but I was working on this problem today (with doubles/floats).
The best approach is to use SSE MINSS/MAXSS for floats and SSE2 MINSD/MAXSD for doubles. These are branchless and take one clock cycle each, and are easy to use thanks to compiler intrinsics. They confer more than an order of magnitude increase in performance compared with clamping with std::min/max.
You may find that surprising. I certainly did! Unfortunately VC++ 2010 uses simple comparisons for std::min/max even when /arch:SSE2 and /FP:fast are enabled. I can't speak for other compilers.
Here's the necessary code to do this in VC++:
#include <mmintrin.h>
float minss ( float a, float b )
{
// Branchless SSE min.
_mm_store_ss( &a, _mm_min_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float maxss ( float a, float b )
{
// Branchless SSE max.
_mm_store_ss( &a, _mm_max_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float clamp ( float val, float minval, float maxval )
{
// Branchless SSE clamp.
// return minss( maxss(val,minval), maxval );
_mm_store_ss( &val, _mm_min_ss( _mm_max_ss(_mm_set_ss(val),_mm_set_ss(minval)), _mm_set_ss(maxval) ) );
return val;
}
The double precision code is the same except with xxx_sd instead.
Edit: Initially I wrote the clamp function as commented. But looking at the assembler output I noticed that the VC++ compiler wasn't smart enough to cull the redundant move. One less instruction. :)

If your processor has a fast instruction for absolute value (as the x86 does), you can do a branchless min and max which will be faster than an if statement or ternary operation.
min(a,b) = (a + b - abs(a-b)) / 2
max(a,b) = (a + b + abs(a-b)) / 2
If one of the terms is zero (as is often the case when you're clamping) the code simplifies a bit further:
max(a,0) = (a + abs(a)) / 2
When you're combining both operations you can replace the two /2 into a single /4 or *0.25 to save a step.
The following code is over 3x faster than ternary on my Athlon II X2, when using the optimization for FMIN=0.
double clamp(double value)
{
double temp = value + FMAX - abs(value-FMAX);
#if FMIN == 0
return (temp + abs(temp)) * 0.25;
#else
return (temp + (2.0*FMIN) + abs(temp-(2.0*FMIN))) * 0.25;
#endif
}

Ternary operator is really the way to go, because most compilers are able to compile them into a native hardware operation that uses a conditional move instead of a branch (and thus avoids the mispredict penalty and pipeline bubbles and so on). Bit-manipulation is likely to cause a load-hit-store.
In particular, PPC and x86 with SSE2 have a hardware op that could be expressed as an intrinsic something like this:
double fsel( double a, double b, double c ) {
return a >= 0 ? b : c;
}
The advantage is that it does this inside the pipeline, without causing a branch. In fact, if your compiler uses the intrinsic, you can use it to implement your clamp directly:
inline double clamp ( double a, double min, double max )
{
a = fsel( a - min , a, min );
return fsel( a - max, max, a );
}
I strongly suggest you avoid bit-manipulation of doubles using integer operations. On most modern CPUs there is no direct means of moving data between double and int registers other than by taking a round trip to the dcache. This will cause a data hazard called a load-hit-store which basically empties out the CPU pipeline until the memory write has completed (usually around 40 cycles or so).
The exception to this is if the double values are already in memory and not in a register: in that case there is no danger of a load-hit-store. However your example indicates you've just calculated the double and returned it from a function which means it's likely to still be in XMM1.

For the 16.16 representation, the simple ternary is unlikely to be bettered speed-wise.
And for doubles, because you need it standard/portable C, bit-fiddling of any kind will end badly.
Even if a bit-fiddle was possible (which I doubt), you'd be relying on the binary representation of doubles. THIS (and their size) IS IMPLEMENTATION-DEPENDENT.
Possibly you could "guess" this using sizeof(double) and then comparing the layout of various double values against their common binary representations, but I think you're on a hiding to nothing.
The best rule is TELL THE COMPILER WHAT YOU WANT (ie ternary), and let it optimise for you.
EDIT: Humble pie time. I just tested quinmars idea (below), and it works - if you have IEEE-754 floats. This gave a speedup of about 20% on the code below. IObviously non-portable, but I think there may be a standardised way of asking your compiler if it uses IEEE754 float formats with a #IF...?
double FMIN = 3.13;
double FMAX = 300.44;
double FVAL[10] = {-100, 0.23, 1.24, 3.00, 3.5, 30.5, 50 ,100.22 ,200.22, 30000};
uint64 Lfmin = *(uint64 *)&FMIN;
uint64 Lfmax = *(uint64 *)&FMAX;
DWORD start = GetTickCount();
for (int j=0; j<10000000; ++j)
{
uint64 * pfvalue = (uint64 *)&FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < Lfmin) ? Lfmin : (*pfvalue > Lfmax) ? Lfmax : *pfvalue;
}
volatile DWORD hacktime = GetTickCount() - start;
for (int j=0; j<10000000; ++j)
{
double * pfvalue = &FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < FMIN) ? FMIN : (*pfvalue > FMAX) ? FMAX : *pfvalue;
}
volatile DWORD normaltime = GetTickCount() - (start + hacktime);

The bits of IEEE 754 floating point are ordered in a way that if you compare the bits interpreted as an integer you get the same results as if you would compare them as floats directly. So if you find or know a way to clamp integers you can use it for (IEEE 754) floats as well. Sorry, I don't know a faster way.
If you have the floats stored in an arrays you can consider to use some CPU extensions like SSE3, as rkj said. You can take a look at liboil it does all the dirty work for you. Keeps your program portable and uses faster cpu instructions if possible. (I'm not sure tho how OS/compiler-independent liboil is).

Rather than testing and branching, I normally use this format for clamping:
clampedA = fmin(fmax(a,MY_MIN),MY_MAX);
Although I have never done any performance analysis on the compiled code.

Realistically, no decent compiler will make a difference between an if() statement and a ?: expression. The code is simple enough that they'll be able to spot the possible paths. That said, your two examples are not identical. The equivalent code using ?: would be
a = (a > MAX) ? MAX : ((a < MIN) ? MIN : a);
as that avoid the A < MIN test when a > MAX. Now that could make a difference, as the compiler otherwise would have to spot the relation between the two tests.
If clamping is rare, you can test the need to clamp with a single test:
if (abs(a - (MAX+MIN)/2) > ((MAX-MIN)/2)) ...
E.g. with MIN=6 and MAX=10, this will first shift a down by 8, then check if it lies between -2 and +2. Whether this saves anything depends a lot on the relative cost of branching.

Here's a possibly faster implementation similar to #Roddy's answer:
typedef int64_t i_t;
typedef double f_t;
static inline
i_t i_tmin(i_t x, i_t y) {
return (y + ((x - y) & -(x < y))); // min(x, y)
}
static inline
i_t i_tmax(i_t x, i_t y) {
return (x - ((x - y) & -(x < y))); // max(x, y)
}
f_t clip_f_t(f_t f, f_t fmin, f_t fmax)
{
#ifndef TERNARY
assert(sizeof(i_t) == sizeof(f_t));
//assert(not (fmin < 0 and (f < 0 or is_negative_zero(f))));
//XXX assume IEEE-754 compliant system (lexicographically ordered floats)
//XXX break strict-aliasing rules
const i_t imin = *(i_t*)&fmin;
const i_t imax = *(i_t*)&fmax;
const i_t i = *(i_t*)&f;
const i_t iclipped = i_tmin(imax, i_tmax(i, imin));
#ifndef INT_TERNARY
return *(f_t *)&iclipped;
#else /* INT_TERNARY */
return i < imin ? fmin : (i > imax ? fmax : f);
#endif /* INT_TERNARY */
#else /* TERNARY */
return fmin > f ? fmin : (fmax < f ? fmax : f);
#endif /* TERNARY */
}
See Compute the minimum (min) or maximum (max) of two integers without branching and Comparing floating point numbers
The IEEE float and double formats were
designed so that the numbers are
“lexicographically ordered”, which –
in the words of IEEE architect William
Kahan means “if two floating-point
numbers in the same format are ordered
( say x < y ), then they are ordered
the same way when their bits are
reinterpreted as Sign-Magnitude
integers.”
A test program:
/** gcc -std=c99 -fno-strict-aliasing -O2 -lm -Wall *.c -o clip_double && clip_double */
#include <assert.h>
#include <iso646.h> // not, and
#include <math.h> // isnan()
#include <stdbool.h> // bool
#include <stdint.h> // int64_t
#include <stdio.h>
static
bool is_negative_zero(f_t x)
{
return x == 0 and 1/x < 0;
}
static inline
f_t range(f_t low, f_t f, f_t hi)
{
return fmax(low, fmin(f, hi));
}
static const f_t END = 0./0.;
#define TOSTR(f, fmin, fmax, ff) ((f) == (fmin) ? "min" : \
((f) == (fmax) ? "max" : \
(is_negative_zero(ff) ? "-0.": \
((f) == (ff) ? "f" : #f))))
static int test(f_t p[], f_t fmin, f_t fmax, f_t (*fun)(f_t, f_t, f_t))
{
assert(isnan(END));
int failed_count = 0;
for ( ; ; ++p) {
const f_t clipped = fun(*p, fmin, fmax), expected = range(fmin, *p, fmax);
if(clipped != expected and not (isnan(clipped) and isnan(expected))) {
failed_count++;
fprintf(stderr, "error: got: %s, expected: %s\t(min=%g, max=%g, f=%g)\n",
TOSTR(clipped, fmin, fmax, *p),
TOSTR(expected, fmin, fmax, *p), fmin, fmax, *p);
}
if (isnan(*p))
break;
}
return failed_count;
}
int main(void)
{
int failed_count = 0;
f_t arr[] = { -0., -1./0., 0., 1./0., 1., -1., 2,
2.1, -2.1, -0.1, END};
f_t minmax[][2] = { -1, 1, // min, max
0, 2, };
for (int i = 0; i < (sizeof(minmax) / sizeof(*minmax)); ++i)
failed_count += test(arr, minmax[i][0], minmax[i][1], clip_f_t);
return failed_count & 0xFF;
}
In console:
$ gcc -std=c99 -fno-strict-aliasing -O2 -lm *.c -o clip_double && ./clip_double
It prints:
error: got: min, expected: -0. (min=-1, max=1, f=0)
error: got: f, expected: min (min=-1, max=1, f=-1.#INF)
error: got: f, expected: min (min=-1, max=1, f=-2.1)
error: got: min, expected: f (min=-1, max=1, f=-0.1)

I tried the SSE approach to this myself, and the assembly output looked quite a bit cleaner, so I was encouraged at first, but after timing it thousands of times, it was actually quite a bit slower. It does indeed look like the VC++ compiler isn't smart enough to know what you're really intending, and it appears to move things back and forth between the XMM registers and memory when it shouldn't. That said, I don't know why the compiler isn't smart enough to use the SSE min/max instructions on the ternary operator when it seems to use SSE instructions for all floating point calculations anyway. On the other hand, if you're compiling for PowerPC, you can use the fsel intrinsic on the FP registers, and it's way faster.

As pointed out above, fmin/fmax functions work well (in gcc, with -ffast-math). Although gfortran has patterns to use IA instructions corresponding to max/min, g++ does not. In icc one must use instead std::min/max, because icc doesn't allow short-cutting the specification of how fmin/fmax work with non-finite operands.

My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated
template <typename T>
inline T clamp(T val, T lo, T hi) {
return std::max(lo, std::min(hi, val));
}

If I understand properly, you want to limit a value "a" to a range between MY_MIN and MY_MAX. The type of "a" is a double. You did not specify the type of MY_MIN or MY_MAX.
The simple expression:
clampedA = (a > MY_MAX)? MY_MAX : (a < MY_MIN)? MY_MIN : a;
should do the trick.
I think there may be a small optimization to be made if MY_MAX and MY_MIN happen to be integers:
int b = (int)a;
clampedA = (b > MY_MAX)? (double)MY_MAX : (b < MY_MIN)? (double)MY_MIN : a;
By changing to integer comparisons, it is possible you might get a slight speed advantage.

If you want to use fast absolute value instructions, check out this snipped of code I found in minicomputer, which clamps a float to the range [0,1]
clamped = 0.5*(fabs(x)-fabs(x-1.0f) + 1.0f);
(I simplified the code a bit). We can think about it as taking two values, one reflected to be >0
fabs(x)
and the other reflected about 1.0 to be <1.0
1.0-fabs(x-1.0)
And we take the average of them. If it is in range, then both values will be the same as x, so their average will again be x. If it is out of range, then one of the values will be x, and the other will be x flipped over the "boundary" point, so their average will be precisely the boundary point.

Rounding to the 100 unit

I don't know if the following idea is feasible or not to generalize it, but I want to round every calculated value to the 100 unit rounding up.
example:
double x;
int x_final;
...
if (x<400) x_final=400;
else if (x<500) x_final=500;
else if (x<600) x_final=600;
...

To round up, you can use this:
x_final = ((int)x / 100 + 1) * 100;

The obvious solution is to use remquo:
int
roundTo100( double x )
{
int results;
remquo( x, 100.0, &results );
return 100 * results;
}
You'll probably need a fairly recent compiler for this, however:
the function was added to C99, and to C++ with C++11. Depending
on the platform, you might not have it at all, or you might only
have it in <math.h> (but not in <cmath>). If the compiler
claims C++11 support, you should have it in <cmath>. But
don't believe it until you've seen it; no compiler actually
supports C++11 to any real degree yet. On platforms where there
is support for C99 (which would include pretty much all Unix
platforms, and CygWin under Windows), it should be present in
<math.h>, regardless. (But it is not present in Visual
Studios.)
In the absense of this function, something like:
int
roundTo100( double x )
{
int results = round( x / 100 );
return 100 * results;
}
might do the trick. Beware that the two functions round
slightly differently. The first is rounds up if the remainder
is exactly 50, the second is round to even. The second may
potentially introduce inaccuracies due to the division (or
not—I've not analysed it sufficiently to be sure one way
or the other).
Depending on where your x comes from, the rounding differences
or the potential inaccuracies may not be an issue. (If, for
example, the x is derived from some physical measurements with
only 3 decimal digits accuracy, the fact that it may round
"incorrectly" when x is distant from 50 by some 1E14 or the
like is probably irrelevant.)

ctry this:
#include <math.h>
...
x_final = ceil(x/100)*100;

Divide it by 100 (ignoring the remainder) and then multiply it by 100.
#include <iostream>
using namespace std;
int main() {
int val = 456;
int r = (val / 100) * 100;
cout << "r = " << r;
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

round() for float in C++ - c++

I need a simple floating point rounding function, thus: double round(double); round(0.1) = 0 round(-0.1) = 0 round(-0.9) = -1 I can find ceil() and floor() in the math.h - but not round(). Is it present in the standard C++ library under another name, or is it missing??

It may be worth noting that if you wanted an integer result from the rounding you don't need to pass it through either ceil or floor. I.e., int round_int( double r ) { return (r > 0.0) ? (r + 0.5) : (r - 0.5); }

It's usually implemented as floor(value + 0.5). Edit: and it's probably not called round since there are at least three rounding algorithms I know of: round to zero, round to closest integer, and banker's rounding. You are asking for round to closest integer.

You could round to n digits precision with: double round( double x ) { const double sd = 1000; //for accuracy to 3 decimal places return int(x*sd + (x<0? -0.5 : 0.5))/sd; }

Since C++ 11 simply: #include <cmath> std::round(1.1) or to get int static_cast<int>(std::round(1.1))

Best way to rounding off a floating value by "n" decimal places, is as following with in O(1) time:- We have to round off the value by 3 places i.e. n=3.So, float a=47.8732355; printf("%.3f",a);

Related

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Getting correct values for trigonometric functions compared with matlab

Float rounding off error

Best way to "Clip" (saturate) a variable [duplicate]

Rounding to the 100 unit

Categories

Resources