How to improve precision in multiplication?

How to improve precision in multiplication? - c++

I am using C++ to implement transfinite interpolation algorithm (http://en.wikipedia.org/wiki/Transfinite_interpolation). Everything looks good until when I was trying to test some small numbers, the results look weird and incorrect. I guess it must have something to do with loss of precision. The code is
for (int j = 0; j < Ny+1; ++j)
{
for (int i = 0; i < Nx+1; ++i)
{
int imax = Nx;
int jmax = Ny;
double CA1 = (double)i/imax; // s
double CB1 = (double)j/jmax; // t
double CA2 = 1.0-CA1; // 1-s
double CB2 = 1.0-CB1; // 1-t
point U = BD41[j]*CA2 + BD23[j]*CA1;
point V = BD12[i]*CB2 + BD34[i]*CB1;
point UV =
(BD12[0]*CB2 + BD41[jmax]*CB1)*CA2
+ (BD12[imax]*CB2 + BD23[jmax]*CB1)*CA1;
tfiGrid[k][j][i] = U + V - UV;
}
}
I guess when BD12[i] (or BD34[i] or BD41[j] or BD23[j]) is very small, the rounding error or something would accumulated and become in-negligible. Any ideas how to handle this sort of situation?
PS: Even though similar questions have been asked for millions of times. I still cannot figure out is it related to my multiplication or division or subtraction or what?

In addition to the points that Antoine made (all very good):
it's probably worth remembering that adding two values with very
different orders of magnitude will cause a very large loss of
precision. For example, if CA1 is less than about 1E-16,
1.0 - CA1 is probably still 1.0, and even if it is just
a little larger, you'll loose quite a few digits of precision.
If this is the problem, you should be able to isolate it just by
putting a few print statements in the inner loop, and looking at
the values you are adding (or perhaps even with a debugger).
What to do about it is another question. There may be some
numerically stable algorithms for what you are trying to do;
I don't know. Otherwise, you'll probably have to detect the
problem dynamically, and rewrite the equations to avoid it if it
occurs. (For example, to detect if CA1 is too small for the
addition, you might check whether 1.0 / CA1 is more than
a couple of thousand, or million, or however much precision you
can afford to loose.)

The accuracy of the arithmetics built in C/C++ is limited. Of course errors will accumulate in your case.
Have you considered using a library that provides higher precision? Maybe have a look at https://gmplib.org/
A short example that clearifys the higher accuracy:
double d, e, f;
d = 1000000000000000000000000.0;
e = d + 1.0;
f = e - d;
printf( "the difference of %f and %f is %f\n", e, d, f);
This will not print 1 but 0. With gmplib the code would look like:
#include "gmp.h"
mpz_t m, n, r;
mpz_init( m);
mpz_init( n);
mpz_init( r);
mpz_set_str( m, "1 000 000 000 000 000 000 000 000", 10);
mpz_add_ui( n, m, 1);
mpz_sub( r, n, m);
printf( "the difference of %s and %s is %s (using gmp)\n",
mpz_get_str( NULL, 10, n),
mpz_get_str( NULL, 10, m),
mpz_get_str( NULL, 10, r));
mpz_clear( m);
mpz_clear( n);
mpz_clear( r);
This will return 1.

Your algorithm doesn't appear to be accumulating errors by re-using previous computations at each iteration, so it's difficult to answer your question without looking at your data.
Basically, you have 3 options:
Increase the precision of your numbers: x86 CPUs can handle float (32 bit), double (64-bit) and often long double (80-bit). Beyond that you have to use "soft" floating points, where all operations are implemented in software instead of hardware. There is a good C lib that do just that: MPFR based on GMP, GNU recommends using MPFR. I strongly recommend using easier to use C++ wrappers like boost multiprocesion. Expect your computations to be orders of magnitude slower.
Analyze where your precision loss comes from by using something more informative than a single scalar number for your computations. Have a look at interval arithmetic and the MPFI lib, based on MPFR. CADENA is another, very promising solution, based on randomly changing rounding modes of the hardware, and comes with a low run-time cost.
Perform static analysis, which doesn't even require running your code and work by analyzing your algorithms. Unfortunately, I don't have experience with such tools so I can't recommend anything beyond goggling.
I think the best course is to run static or dynamic analysis while developing your algorithm, identify your precision problems, address them by either changing the algorithm or using higher precision for the most unstable variables - and not others to avoid too much performance impact at run-time.

Numerical (i.e., floating point) computation is hard to do precisely. You have to be particularly vary of substractions, is is mostly there where you lose precision. In this case the 1.0 - CA1 and such are suspect (if CA1 is very small, you'll get 1). Reorganize your expressions, the Wikipedia article (a stub!) probably has them written for understanding (showing symmetries, ...) and aestetics, not numerical robustness.
Search for courses/lecture notes on numerical computation, thy should include an introductory chapter on this. And check out Goldberg's classic What every computer scientist should know about floating point arithmetic.

Related

Map integer range onto another range

In runtime I have 2 ranges defined by their uint32_t borders a..b and c..d. The first range tends to be much greater than the second: 8 < (b - a) / (d - c) < 64.
Exact limits: a >= 0, b <= 2^31 - 1, c >= 0, d <= 2^20 - 1.
I need a routine that performs linear mapping of an integer from the first range onto the second one: f(uint32_t x) -> round_to_uint32_t((float)(x - a) / (b - a) * (d - c) + c).
When b - a >= d - c it is important to mantain the ratio as close to ideal as possible, otherwise in cases when element from [a; b] can be mapped on more than one integer from [c; d] it is okay to return any of these integers.
Sounds like a simple ratio problem and was already answered in many questions like
Convert a number range to another range, maintaining ratio
but here I need a really really fast solution.
This routine is a pivotal part of a specialized sorting algorithm and will be called at least once for every element of a sorted array.
SIMD solution is also acceptable if it doesn't drop overall performance.

Actual runtime division (FP and integer) is very slow so you definitely want to avoid that. The way you wrote that expression probably compiles to include a division because FP math is not associative (without -ffast-math); the compiler can't turn x / foo * bar into x * (bar/foo) for you, even though that's very good with loop-invariant bar/foo. You do need either floating point or 64-bit integers to avoid overflow in a multiply, but only FP lets you reuse a non-integer loop-invariant division result.
_mm256_fmadd_ps looks like the obvious way to go, with a pre-computed loop-invariant value for the multiplier (d - c) / (b - a). If float rounding isn't a problem for doing it strictly in order (multiply then divide), it's probably ok to do this inexact division first, outside the loop. Like
_mm256_set1_ps((d - c) / (double)(b - a)). Using double for this calculation avoids rounding error during conversion to FP of the division operands.
You're reusing the same a,b,c,d for many x, presumably coming from contiguous memory. You're using the result as part of a memory address so you do eventually need the results back from SIMD into integer registers, unfortunately. (Possibly with AVX512 scatter stores you could avoid that.)
Modern x86 CPUs have 2/clock load throughput so probably your best bet for getting 8x uint32_t back into integer registers is a vector store / integer reload, instead of spending 2 uops per element for ALU shuffle stuff. That has some latency so I'd suggest converting into a tmp buffer of maybe 16 or 32 ints (64 or 128 bytes), i.e. 2x or 4x __m256i before looping through that scalar.
Or maybe alternate converting and storing one vector then looping over the 8 elements of another one that you converted earlier. i.e. software pipelining. Out-of-order execution can hide latency but you're already going to be stretching its latency-hiding capability for cache misses for whatever you're doing with memory.
Depending on your CPU (e.g. Haswell or some Skylake), using 256-bit vector instructions might cap your max turbo slightly lower than it would otherwise. You might consider only doing vectors of 4 at once but then you're spending more uops per element.
If not SIMD, then even scalar C++ fma() is still good, for vfmadd213sd, but using intrinsics is a very convenient way to get rounding (instead of truncation) from float -> int (vcvtps2dq rather than vcvttps2dq).
Note that uint32_t <-> float conversion isn't directly available until AVX512. For scalar you can just convert to/from int64_t with truncation / zero-extension for the unsigned low half.
It's very convenient that (as discussed in comments) your inputs are range-limited so if you interpret them as signed integers they have the same value (signed non-negative). Both x and x-a (and b-a) are known to be positive and <= INT32_MAX i.e 0x7FFFFFFF. (Or at least non-negative. Zero is fine.)
Float Rounding
For SIMD, single-precision float is very good for SIMD throughput. Efficient packed-conversion to/from signed int32_t. But not every int32_t can be exactly represented as a float. Larger values get rounded to the nearest even, nearest multiple of 2^2, 2^3, or more the farther above 2^24 the value is.
Using SIMD double is possible but requires some shuffling.
I don't think float is usually a problem for the formula as-written with (float)(x-a). If the b-a input range is large, that means both ranges are large and rounding error isn't going to map all possible x values into the same output. Depending on the multiplier, the input rounding error might be worse than the output rounding error, maybe leaving some representable output floats unused for higher x-a values.
But if we want to factor out the -a * (d - c) / (b - a) part and combine it with the +c at the end, then
We potentially have precision loss from catastrophic cancellation in that value to be added.
We need to do (float)x on the raw input value. If a is huge and b-a is small, i.e. a small range near the top of the possible input range, rounding error can map all possible x values to the same float.
To make best use of FMA, we want to do the +c before converting back to integer, which again risks output rounding error if the d-c is a small output range but c is huge. In your case not a problem; with d <= 2^20 - 1 we know that float can exactly represent every output integer value in that c..d range.
If you didn't have the input range constraint, you could range-shift to/from signed before the scaling by using integer (x-a)+0x80000000U on input and ...+c+0x80000000U on output (after rounding to nearest int32_t). But that would introduce huge float rounding error for small uint32_t inputs (close to 0) which get range-shifted to close to INT_MIN.
We don't need to range-shift for the b-a or d-c because the + or - or XOR with 0x80000000U would cancel out in the subtractions.
Example:
The const vectors should be hoisted out of a loop by the compiler after this inlines,
or you can do that manually.
This requires AVX1 + FMA (e.g. AMD Piledriver or Intel Haswell or later). Untested, sorry I didn't even throw this on Godbolt to see if it compiles.
// fastest but not safe if b-a is small and a > 2^24
static inline
__m256i range_scale_fast_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
double scale_scalar = (d - c) / (double)(b - a);
const __m256 scale = _mm256_set1_ps(scale_scalar);
const __m256 add = _m256_set1_ps(-a*scale_scalar + c);
// (x-a) * scale + c
// = x * scale + (-a*scale + c) but with different rounding error from doing -a*scale + c
__m256 in = _mm256_cvtepi32_ps(data);
__m256 out = _mm256_fmadd_ps(in, scale, add);
return _mm256_cvtps_epi32(out); // convert back with round to nearest-even
// _mm256_cvttps_epi32 truncates, matching C rounding; maybe good for scalar testing
}
Or a safer version, doing the input range-shift with integer: You could easily avoid FMA here if necessary for portability (just AVX1) and use an integer add for the output, too. But we know the output range is small enough that it can always exactly represent any integer
static inline
__m256i range_scale_safe_fma(__m256i data, uint32_t a, uint32_t b, uint32_t c, uint32_t d)
{
// avoid rounding errors when computing the scale factor, but convert double->float on the final result
const __m256 scale = _mm256_set1_ps((d - c) / (double)(b - a));
const __m256 cvec = _m256_set1_ps(c);
__m256i in_offset = _mm256_add_epi32(data, _mm256_set1_epi32(-a)); // add can more easily fold a load of a memory operand than sub because it's commutative. Only some compilers will do this for you.
__m256 in_fp = _mm256_cvtepi32_ps(in_offset);
__m256 out = _mm256_fmadd_ps(in_fp, scale, _mm256_set1_ps(c)); // in*scale + c
return _mm256_cvtps_epi32(out);
}
Without FMA you could still use vmulps. You might as well convert back to integer before adding c if you're doing that, although vaddps would be safe.
You might use this in a loop like
void foo(uint32_t *arr, ptrdiff_t len)
{
if (len < 24) special case;
alignas(32) uint32_t tmpbuf[16];
// peel half of first iteration for software pipelining / loop rotation
__m256i arrdata = _mm256_loadu_si256((const __m256i*)&arr[0]);
__m256i outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)tmpbuf, outrange);
// could have used an unsigned loop counter
// since we probably just need an if() special case handler anyway for small len which could give len-23 < 0
for (ptrdiff_t i = 0 ; i < len-(15+8) ; i+=16 ) {
// prep next 8 elements
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+8]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[8], outrange);
// use first 8 elements
for (int j=0 ; j<8 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
// prep 8 more for next iteration
arrdata = _mm256_loadu_si256((const __m256i*)&arr[i+16]);
outrange = range_scale_safe_fma(arrdata);
_mm256_store_si256((__m256i*)&tmpbuf[0], outrange);
// use 2nd 8 elements
for (int j=8 ; j<16 ; j++) {
use tmpbuf[j] which corresponds to arr[i+j]
}
}
// use tmpbuf[0..7]
// then cleanup: one vector at a time until < 8 or < 4 with 128-bit vectors, then scalar
}
These variable-names sound dumb but I couldn't think of anything better.
This software pipelining is an optimization; you can just get it working / try it out with a single vector at a time used right away. (Optimize the reload of the first element from a reload to vmovd using _mm_cvtsi128_si32(_mm256_castsi256_si128(outrange)) if you want.)
Special cases
If there cases where you know (b - a) is a power of 2, you could bitscan with tzcnt or bsf, then multiply. (There are intrinsics for those, like GNU C __builtin_ctz() to count trailing zeros.)
Or can you ensure that (b - a) is always a power of 2?
Or better, if (b - a) / (d - c) is an exact power of 2 the whole thing can just be sub / right shift / add.
If you can't always ensure that you'd still need the general case sometimes, but maybe possible to do that efficiently.

What can std::numeric_limits<double>::epsilon() be used for?

unsigned int updateStandardStopping(unsigned int numInliers, unsigned int totPoints, unsigned int sampleSize)
{
double max_hypotheses_=85000;
double n_inliers = 1.0;
double n_pts = 1.0;
double conf_threshold_=0.95
for (unsigned int i = 0; i < sampleSize; ++i)
{
n_inliers *= numInliers - i;//n_linliers=I(I-1)...(I-m+1)
n_pts *= totPoints - i;//totPoints=N(N-1)(N-2)...(N-m+1)
}
double prob_good_model = n_inliers/n_pts;
if ( prob_good_model < std::numeric_limits<double>::epsilon() )
{
return max_hypotheses_;
}
else if ( 1 - prob_good_model < std::numeric_limits<double>::epsilon() )
{
return 1;
}
else
{
double nusample_s = log(1-conf_threshold_)/log(1-prob_good_model);
return (unsigned int) ceil(nusample_s);
}
}
Here is a selection statement:
if ( prob_good_model < std::numeric_limits<double>::epsilon() )
{...}
To my understanding, the judgement statement is the same as(or an approximation to)
prob_good_model < 0
So whether or not I am right and where std::numeric_limits<double>::epsilon() can be used besides that?

The point of epsilon is to make it (fairly) easy for you to figure out the smallest difference you could see between two numbers.
You don't normally use it exactly as-is though. You need to scale it based on the magnitudes of the numbers you're comparing. If you have two numbers around 1e-100, then you'd use something on the order of: std::numeric_limits<double>::epsilon() * 1.0e-100 as your standard of comparison. Likewise, if your numbers are around 1e+100, your standard would be std::numeric_limits<double>::epsilon() * 1e+100.
If you try to use it without scaling, you can get drastically incorrect (utterly meaningless) results. For example:
if (std::abs(1e-100 - 1e-200) < std::numeric_limits<double>::epsilon())
Yup, that's going to show up as "true" (i.e., saying the two are equal) even though they differ by 100 orders of magnitude. In the other direction, if the numbers are much larger than 1, comparing to an (unscaled) epsilon is equivalent to saying if (x != y)--it leaves no room for rounding errors at all.
At least in my experience, the epsilon specified for the floating point type isn't often of a great deal of use though. With proper scaling, it tells you the smallest difference there could possibly be between two numbers of a given magnitude (for a particular floating point implementation).
In real use, however, that's of relatively little real use. A more realistic number will typically be based on the precision of the inputs, and an estimate of the amount of precision you're likely to have lost due to rounding (and such).
For example, let's assume you started with values measured to a precision of 1 part per million, and you did only a few calculations, so you believe you may have lost as much as 2 digits of precision due to rounding errors. In this case, the "epsilon" you care about is roughly 1e-4, scaled to the magnitude of the numbers you're dealing with. That is to say, under those circumstances, you can expect on the order of 4 digits of precision to be meaningful, so if you see a difference in the first four digits, it probably means the values aren't equal, but if they differ only in the fifth (or later) digits, you should probably treat them as equal.
The fact that the floating point type you're using can represent (for example) 16 digits of precision doesn't mean that every measurement you use is going to be nearly the precise--in fact, it's relatively rare the anything based on physical measurements has any hope of being even close to that precise. It does, however, give you a limit on what you can hope for from a calculation--even if you start with a value that's precise to, say, 30 digits, the most you can hope for after calculation is going to be defined by std::numeric_limits<T>::epsilon.

It can be used on situation where a function is undefined, but you still need a value at that point. You lose a bit of accuracy, especially in extreme cases, but sometimes it's alright.
Like let's say you're using 1/x somewhere but your range of x is [0, n[. you can use 1/(x + std::numeric_limits<double>::epsilon()) instead so that 0 is still defined. That being said, you have to be careful with how the value is used, it might not work for every case.

Avoiding the rounding of double numbers

I'm trying to write a function that calculates the regression line from a data sheet with the least squares methode, but I encountered some serious problem with my code.
My first issue is that I don't know why my "linear regression" function is rounding the result of the iterations, even when I'm trying to use other "bigger" types.
My second issue is that the last part of my code is giving me the wrong results for the y intercept (b) and the slope (a), and I think that it could be a problem of conversion but I'm not really sure. If it is the case what should I do to avoid it?
void RegLin (const vector<double>& valuesX, const vector<double>& valuesY, vector<double>& PenOrd) {
unsigned int N=valuesX.size();
long double SomXi{0};
for (unsigned i=0; i<N; ++i){
SomXi+=valuesX.at(i);
}
long double SomXiXi{0};
for (unsigned i=0; i<N; ++i){ //Here is a problem (number rounded) Expected value: 937352,25 / Given value: 937352
SomXiXi+=(valuesX.at(i))*(valuesX.at(i));
}
long double SomYi{0};
for (unsigned i=0; i<N; ++i){
SomYi+=valuesY.at(i);
}
long double SomXiYi{0};
for (unsigned i=0; i<N; ++i){ //Here is the same problem Excepted value: 334107,41 / Given value: 334107
SomXiYi+=(valuesX.at(i))*(valuesY.at(i));
}
long double a=(SomYi*SomXiXi-SomXi*SomXiYi)/(N*SomXiXi-pow(SomXi,2)); //Bad result
long double b=(N*SomXiYi-SomYi*SomXi)/(N*SomXiXi-pow(SomXi,2)); //Bad result
PenOrd.push_back(a);
PenOrd.push_back(b);
return;
}
Thank you in advance for your support
P.S: I'm using g++ and the 2011 C++ standard.

there are serveral points with your effort. I'm a theoretical physics and numerical maths guy. So let me share some best practices with you.
First, I never encountered the need for using long double. Stick with double, because if that would not suffice, then you should go about considering working log-log plots to further analyse your data.
Second, you are using unsigned int instead of int. You should never do regression work with that much values (i.e., value-pairs) that it isn't sufficient to use int or best std::size_t for your integer counters. Using too many values can degrade accuracy because of accumulating numerical roundoff issues. So do not use use more than say 10000 to 1 Million of values until you have a very good reason to do so.
Third, it quickly becomes necessary to not bluntly add your squares (e.g., for SumXiXi and so forth) but to sort your contributions to the sum before actually summing them up. You correctly sum them up starting with the tiniest of values proceeding with your ever growing contributions to your sums. That is the only way to stay on top of accumulating roundoff issues.
Fourth, control of results. A good sign of reliability of results is achievable, if you to your work twice, one time as you did go about (i.e., using x_iy_i - xy_i - x_iy + xy formuae) and then as a second approach using the still un-multiplicated (x_i - x)(y_i - y) formulae. Good quality calculations would yield very comparable results using either formulae.
So, maybe that was quite a detour about doing numerical regression work, hope it might help a little.
Regards, Micha

The first rule of numerical calculations with floating point says: "Work only with values of the same order".
Floating point mathematics works pretty simple in case of e.g. addition (float) :
1e6 + 1e-6 = 1000000 + 0.000001 = 1000000.000001 = 1000000 = 1e6
^
precision limit
So, as you see, result is "rounded".

It's entirerly possible that you are given error due to 2 possibilites:
1) long double == double on your compiler and you are getting wrong results
2) floating point arithmetic does not represent values with 100% accuracy and therefore '0.10 != 0.10 written as float/double
Depending on what kind of calculations you are doing, I would adivse you to increase values by a few power OR change the data to float and store values in double.

Floating point equality and tolerances

Comparing two floating point number by something like a_float == b_float is looking for trouble since a_float / 3.0 * 3.0 might not be equal to a_float due to round off error.
What one normally does is something like fabs(a_float - b_float) < tol.
How does one calculate tol?
Ideally tolerance should be just larger than the value of one or two of the least significant figures. So if the single precision floating point number is use tol = 10E-6 should be about right. However this does not work well for the general case where a_float might be very small or might be very large.
How does one calculate tol correctly for all general cases? I am interested in C or C++ cases specifically.

This blogpost contains an example, fairly foolproof implementation, and detailed theory behind it
http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
it is also one of a series, so you can always read more.
In short: use ULP for most numbers, use epsilon for numbers near zero, but there are still caveats. If you want to be sure about your floating point math i recommend reading whole series.

As far as I know, one doesn't.
There is no general "right answer", since it can depend on the application's requirement for precision.
For instance, a 2D physics simulation working in screen-pixels might decide that 1/4 of a pixel is good enough, while a 3D CAD system used to design nuclear plant internals might not.
I can't see a way to programmatically decide this from the outside.

The C header file <float.h> gives you the constants FLT_EPSILON and DBL_EPSILON, which is the difference between 1.0 and the smallest number larger than 1.0 that a float/double can represent. You can scale that by the size of your numbers and the rounding error you wish to tolerate:
#include <float.h>
#ifndef DBL_TRUE_MIN
/* DBL_TRUE_MIN is a common non-standard extension for the minimum denorm value
* DBL_MIN is the minimum non-denorm value -- use that if TRUE_MIN is not defined */
#define DBL_TRUE_MIN DBL_MIN
#endif
/* return the difference between |x| and the next larger representable double */
double dbl_epsilon(double x) {
int exp;
if (frexp(x, &exp) == 0.0)
return DBL_TRUE_MIN;
return ldexp(DBL_EPSILON, exp-1);
}

Welcome to the world of traps, snares and loopholes. As mentioned elsewhere, a general purpose solution for floating point equality and tolerances does not exist. Given that, there are tools and axioms that a programmer may use in select cases.
fabs(a_float - b_float) < tol has the shortcoming OP mentioned: "does not work well for the general case where a_float might be very small or might be very large." fabs(a_float - ref_float) <= fabs(ref_float * tol) copes with the variant ranges much better.
OP's "single precision floating point number is use tol = 10E-6" is a bit worrisome for C and C++ so easily promote float arithmetic to double and then it's the "tolerance" of double, not float, that comes into play. Consider float f = 1.0; printf("%.20f\n", f/7.0); So many new programmers do not realize that the 7.0 caused a double precision calculation. Recommend using double though out your code except where large amounts of data need the float smaller size.
C99 provides nextafter() which can be useful in helping to gauge "tolerance". Using it, one can determine the next representable number. This will help with the OP "... the full number of significant digits for the storage type minus one ... to allow for roundoff error." if ((nextafter(x, -INF) <= y && (y <= nextafter(x, +INF))) ...
The kind of tol or "tolerance" used is often the crux of the matter. Most often (IMHO) a relative tolerance is important. e. g. "Are x and y within 0.0001%"? Sometimes an absolute tolerance is needed. e.g. "Are x and y within 0.0001"?
The value of the tolerance is often debatable for the best value is often situation dependent. Comparing within 0.01 may work for a financial application for Dollars but not Yen. (Hint: be sure to use a coding style that allows easy updates.)

Rounding error varies according to values used for operations.
Instead of a fixed tolerance, you can probably use a factor of epsilon like:
bool nearly_equal(double a, double b, int factor /* a factor of epsilon */)
{
double min_a = a - (a - std::nextafter(a, std::numeric_limits<double>::lowest())) * factor;
double max_a = a + (std::nextafter(a, std::numeric_limits<double>::max()) - a) * factor;
return min_a <= b && max_a >= b;
}

Although the value of the tolerance depends on the situation, if you are looking for precision comparasion you could used as tolerance the machine epsilon value, numeric_limits::epsilon() (Library limits). The function returns the difference between 1 and the smallest value greater than 1 that is representable for the data type.
http://msdn.microsoft.com/en-us/library/6x7575x3.aspx
The value of epsilon differs if you are comparing floats or doubles. For instance, in my computer, if comparing floats the value of epsilon is 1.1920929e-007 and if comparing doubles the value of epsilon is 2.2204460492503131e-016.
For a relative comparison between x and y, multiply the epsilon by the maximum absolute value of x and y.
The result above could be multiplied by the ulps (units in the last place) which allows you to play with the precision.
#include <iostream>
#include <cmath>
#include <limits>
template<class T> bool are_almost_equal(T x, T y, int ulp)
{
return std::abs(x-y) <= std::numeric_limits<T>::epsilon() * std::max(std::abs(x), std::abs(y)) * ulp
}

When I need to compare floats, I use code like this
bool same( double a, double b, double error ) {
double x;
if( a == 0 ) {
x = b;
} else if( b == 0 ) {
x = a;
} else {
x = (a-b) / a;
}
return fabs(x) < error;
}

C/C++ rounding up decimals with a certain precision, efficiently

I'm trying to optimize the following. The code bellow does this :
If a = 0.775 and I need precision 2 dp then a => 0.78
Basically, if the last digit is 5, it rounds upwards the next digit, otherwise it doesn't.
My problem was that 0.45 doesnt round to 0.5 with 1 decimalpoint, as the value is saved as 0.44999999343.... and setprecision rounds it to 0.4.
Thats why setprecision is forced to be higher setprecision(p+10) and then if it really ends in a 5, add the small amount in order to round up correctly.
Once done, it compares a with string b and returns the result. The problem is, this function is called a few billion times, making the program craw. Any better ideas on how to rewrite / optimize this and what functions in the code are so heavy on the machine?
bool match(double a,string b,int p) { //p = precision no greater than 7dp
double t[] = {0.2, 0.02, 0.002, 0.0002, 0.00002, 0.000002, 0.0000002, 0.00000002};
stringstream buff;
string temp;
buff << setprecision(p+10) << setiosflags(ios_base::fixed) << a; // 10 decimal precision
buff >> temp;
if(temp[temp.size()-10] == '5') a += t[p]; // help to round upwards
ostringstream test;
test << setprecision(p) << setiosflags(ios_base::fixed) << a;
temp = test.str();
if(b.compare(temp) == 0) return true;
return false;
}

I wrote an integer square root subroutine with nothing more than a couple dozen lines of ASM, with no API calls whatsoever - and it still could only do about 50 million SqRoots/second (this was about five years ago ...).
The point I'm making is that if you're going for billions of calls, even today's technology is going to choke.
But if you really want to make an effort to speed it up, remove as many API usages as humanly possible. This may require you to perform API tasks manually, instead of letting the libraries do it for you. Specifically, remove any type of stream operation. Those are slower than dirt in this context. You may really have to improvise there.
The only thing left to do after that is to replace as many lines of C++ as you can with custom ASM - but you'll have to be a perfectionist about it. Make sure you are taking full advantage of every CPU cycle and register - as well as every byte of CPU cache and stack space.
You may consider using integer values instead of floating-points, as these are far more ASM-friendly and much more efficient. You'd have to multiply the number by 10^7 (or 10^p, depending on how you decide to form your logic) to move the decimal all the way over to the right. Then you could safely convert the floating-point into a basic integer.
You'll have to rely on the computer hardware to do the rest.
<--Microsoft Specific-->
I'll also add that C++ identifiers (including static ones, as Donnie DeBoer mentioned) are directly accessible from ASM blocks nested into your C++ code. This makes inline ASM a breeze.
<--End Microsoft Specific-->

Depending on what you want the numbers for, you might want to use fixed point numbers instead of floating point. A quick search turns up this.

I think you can just add 0.005 for precision to hundredths, 0.0005 for thousands, etc. snprintf the result with something like "%1.2f" (hundredths, 1.3f thousandths, etc.) and compare the strings. You should be able to table-ize or parameterize this logic.

You could save some major cycles in your posted code by just making that double t[] static, so that it's not allocating it over and over.

Try this instead:
#include <cmath>
double setprecision(double x, int prec) {
return
ceil( x * pow(10,(double)prec) - .4999999999999)
/ pow(10,(double)prec);
}
It's probably faster. Maybe try inlining it as well, but that might hurt if it doesn't help.
Example of how it works:
2.345* 100 (10 to the 2nd power) = 234.5
234.5 - .4999999999999 = 234.0000000000001
ceil( 234.0000000000001 ) = 235
235 / 100 (10 to the 2nd power) = 2.35
The .4999999999999 was chosen because of the precision for a c++ double on a 32 bit system. If you're on a 64 bit platform you'll probably need more nines. If you increase the nines further on a 32 bit system it overflows and rounds down instead of up, i. e. 234.00000000000001 gets truncated to 234 in a double in (my) 32 bit environment.

Using floating point (an inexact representation) means you've lost some information about the true number. You can't simply "fix" the value stored in the double by adding a fudge value. That might fix certain cases (like .45), but it will break other cases. You'll end up rounding up numbers that should have been rounded down.
Here's a related article:
http://www.theregister.co.uk/2006/08/12/floating_point_approximation/

I'm taking at guess at what you really mean to do. I suspect you're trying to see if a string contains a decimal representation of a double to some precision. Perhaps it's an arithmetic quiz program and you're trying to see if the user's response is "close enough" to the real answer. If that's the case, then it may be simpler to convert the string to a double and see if the absolute value of the difference between the two doubles is within some tolerance.
double string_to_double(const std::string &s)
{
std::stringstream buffer(s);
double d = 0.0;
buffer >> d;
return d;
}
bool match(const std::string &guess, double answer, int precision)
{
const static double thresh[] = { 0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double g = string_to_double(guess);
const double delta = g - answer;
return -thresh[precision] < delta && delta <= thresh[precision];
}
Another possibility is to round the answer first (while it's still numeric) BEFORE converting it to a string.
bool match2(const std::string &guess, double answer, int precision)
{
const static double thresh[] = {0.5, 0.05, 0.005, 0.0005, /* etc. */ };
const double rounded = answer + thresh[precision];
std::stringstream buffer;
buffer << std::setprecision(precision) << rounded;
return guess == buffer.str();
}
Both of these solutions should be faster than your sample code, but I'm not sure if they do what you really want.

As far as i see you are checking if a rounded on p points is equal b.
Insted of changing a to string, make other way and change string to double
- (just multiplications and addion or only additoins using small table)
- then substract both numbers and check if substraction is in proper range (if p==1 => abs(p-a) < 0.05)

Old time developers trick from the dark ages of Pounds, Shilling and pence in the old country.
The trick was to store the value as a whole number fo half-pennys. (Or whatever your smallest unit is). Then all your subsequent arithmatic is straightforward integer arithimatic and rounding etc will take care of itself.
So in your case you store your data in units of 200ths of whatever you are counting,
do simple integer calculations on these values and divide by 200 into a float varaible whenever you want to display the result.
I beleive Boost does a "BigDecimal" library these days, but, your requirement for run time speed would probably exclude this otherwise excellent solution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js