I've been struggling to find a crazy bug in some C++ code and narrowed it down to this small section. I placed into a simple main.c to debug it and can't figure out why the floating point math is rounding when it shouldn't.
// setup the variables for this simple case
int writep = 672;
float offset = 672.000122;
int bufferSize = 2400;
float bufferSizeF = (float)bufferSize;
float outPointer = (float)writep - offset; // outPointer: -0.000122070313
if(outPointer < 0.0f){
printf("outPointer: %.9f \n", outPointer); // outPointer: -0.000122070313
outPointer += bufferSizeF; // outPointer SHOULD be: 2399.9998779296875
printf("outpointer: %.9f \n", outPointer); // outPointer: 2400.000000000
}
Someone...please explain. Thanks.
2400.000000000 and 2399.9998779296875 are too close for a standard float to differentiate them. Try this:
#include<iostream>
int main() {
std::cout << (float)2399.9998779296875 << "\n";
}
It will probably give 2400 as output.
An IEEE 754 single precision float can only hold about 7 to 8 significant decimal digits. If you need a higher number of significant digits use a double precision double.
In IEEE 754 standard the floating point numbers are not equidistantly distributed over the number axis. Density of floating point values is higher around 0 than around 2400, so this is why the rounding is done when value is around 2400.
Here is the picture to illustrate it:
https://www.google.fi/search?q=IEEE+754+distribution&biw=1920&bih=895&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj-tKOWkMzPAhUEDywKHRdRAEUQ_AUIBigB#imgrc=rshe5_x1ZXFoKM%3A
Related
I am using stirling's approximation to calculate this, but I am having trouble storing the huge answer I get from Stirling's approximation. Is there any good way to store such a big number on the device?
The CUDA standard math library supports both the double-precision function tgamma and the single-precision functiontgammaf. Since Γ(50) is on the order of 1062, the result overflows the representable range of the float type and tgammaf (50.0f) thus returns infinity. However, the computation can proceed without overflow by using the double-precision function tgamma and storing the result to a double variable, e.g.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
__global__ void kernel (int a)
{
float r = tgammaf ((float)a);
double rr = tgamma ((double)a);
printf ("tgammaf(%d) = %23.16e\ntgamma(%d) = %23.16e\n", a, r, a, rr);
}
int main (void)
{
kernel<<<1,1>>>(50);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}
The result of the above program should look like so:
tgammaf(50) = inf
tgamma(50) = 6.0828186403426752e+62
I have a cross-platform application, which is an audio application and therefore uses sine waves a lot, and the std::sin() and other goniometric functions.
I noticed that particularly on the iOS platform, the precision of the std::sin() is extremely poor. I wrote the following test:
void TestSineZeroCrossings()
{
const static float kTwoPi = 6.28318530718f;
const static float epsilon = 1e-5f;
for (int ii = 0; ii < 10000; ++ii)
{
const float difference = std::abs(std::sin(kTwoPi * static_cast<float>(ii)));
if (difference > epsilon)
printf("Zero crossing fail, difference: %f\n", difference);
}
}
On Windows and MaxOSX this passes (i.e. no print-outs), but on iOS this fails on pretty much every iteration. In fact, only with an epsilon > 0.004f does it succeed. That results in clearly audible noise in my application.
Is there a way to tell the compiler to use a better implementation that's not as lossy?
I would assume the implementation is quite accurate.
Your actual problem is that kTwoPi * static_cast<float>(ii) gets rounded to the next float. E.g., for ii=10000 the value is (if I did not miscalculate): 62831.8515625
If you subtract 10000*2*pi in exact math from that you get approximately: -0.001509... And the sine of that value is approximately the same (and not 0). It is "relatively" close to zero but far away from your desired 10e-6 "accuracy".
If you want to have more accurate values for sin(x*pi), have a look at boost::math::sin_pi:
https://www.boost.org/doc/libs/1_69_0/libs/math/doc/html/math_toolkit/powers/sin_pi.html
If you want more precision, use double or long double rather than float.
For instance,
replace
const static float kTwoPi = 6.28318530718f;
const static float epsilon = 10e-6f;
with
const static double kTwoPi = 6.28318530718;
const static double epsilon = 10e-6;
and
const float difference = std::abs(std::sin(kTwoPi * static_cast<float>(ii)));
with
const double difference = std::abs(std::sin(kTwoPi * ii));
At the risk of repetition, your problem is obviously the use of float rather than double or long double.
You could verify this by doing
cout << kTwoPi << endl ;
and seeing how many digits get printed out and how they compare to your original value.
const static float kTwoPi = 6.28318530718f;
is roughly equivalent to
const static float kTwoPi = 6.283185 ;
on many (most?) systems.Your delta is way too small for a single precision value. Float is useless for most applications because of its usual lack of precision.
I encountered a problem when I tried to calculate the mean of an array in two ways. Below is the code:
float sum1, sum2, tmp, mean1, mean2;
double sum1_double, sum2_double, tmp_double;
int i, j;
int Nt=29040000; //array size
int piecesize=32;
int Npiece=Nt/piecesize;
float* img;
float* d_img;
double* img_double;
img_double = (double*)calloc(Nt, sizeof(double));
cudaHostAlloc((void**)&img, sizeof(float)*Nt, cudaHostAllocDefault);
cudaMalloc((void**)&d_img, sizeof(float)*Nt);
...
//Some calculation is done in GPU and the results are stored in d_img;
...
cudaMemcpy(img, d_img, Nt*sizeof(float), cudaMemcpyDeviceToHost);
for (i=0;i<Nt;i++) img_double[i]=(double)img[i];
//Method 1
sum1=0;
for (i=0;i<Nt;i++)
{ sum1 += img[i]; }
sum1_double=0;
for (i=0;i<Nt;i++)
{ sum1_double += img_double[i]; }
//Method 2
sum2=0;
for (i=0;i<Npiece;i++)
{ tmp=0;
for (j=0;j<piecesize;j++)
{ tmp += img[i*piecesize+j];}
sum2 += tmp;
}
sum2_double=0;
for (i=0;i<Npiece;i++)
{ tmp_double=0;
for (j=0;j<piecesize;j++)
{ tmp_double += img_double[i*piecesize+j];}
sum2_double += tmp_double;
}
mean1=sum1/(float)Nt;
mean2=sum2/(float)Nt;
mean1_double=sum1_double/(double)Nt;
mean2_double=sum2_double/(double)Nt;
cout<<setprecision(15)<<mean1<<endl;
cout<<setprecision(15)<<mean2<<endl;
cout<<setprecision(15)<<mean1_double<<endl;
cout<<setprecision(15)<<mean2_double<<endl;
Output:
132.221862792969
129.565872192383
129.565938340543
129.565938340543
The results obtained from the two methods, mean1=129.6, mean2=132.2, are significantly different. May I know why?
Thanks a lot in advance!
The reason is that floating point arithmetic is not precise. When you accumulate integers, float becomes imprecise when abs(value) becomes larger than 224 (I'm supposing IEEE-754 32-bit here). For example, float is incapable to store 16777217 precisely (it will become 16777216 or 16777218, depending on the rounding mode).
Supposedly your second calculation is the more precise one, as less precision is lost, because of the separate tmp accumulation.
Change your sum1, sum2, tmp variables to long long int, and hopefully you'll get the same result for both calculations.
Note: I've supposed that your img stores integer data. If it stores floats, then there is no easy way to fix this perfectly. One way is to use double instead of float for sum1, sum2 and tmp. The difference will be there, but it will be much smaller. And there are techniques how to accumuluate floats more precisely than simple summing. Like Kahan Summation.
I am facing problem using float
in loop its value stuck at 8388608.00
int count=0;
long X=10;
cout.precision(flt::digits10);
cout<<"Iterration #"<<setw(15)<<"Add"<<setw(21)<<"Mult"<<endl;
float Start=0.0;
float Multiplication = Addition * N;
long i = 1;
for (i; i <= N; i++){
float temp = Start + Addition;
Start=temp;
count++;
if(count%X==0 && count!=0)
{
X*=10;
cout<<i;
cout<<fixed<<setw(30)<<Start<<setw(20)<<fixed<<i*Addition<<endl;
}
}
what should i do??
Floating point addition doesn't work when you're adding (relatively) small number to (relatively) big one. It's caused by the way float is stored in memory.
You may try replacing single precision floating point (float) with double precision floating point (double) representation but if that doesn't work you'll probably need to implement hack like this:
// Lets say
double OriginalAddition = 0.123;
int Addition = 1;
// You just use base math substitution:
// Addition = OriginalAddition
int temp = Start + Addition; // You will treat transform floating point to fixed point
// with step 0.123, so 1 = 0.123
// And when displaying result (transform back into original floating point):
printf( "%f", (double)result*OriginalAddition)
This needs a lot of thought to find a substitution that doesn't cause data loss, covers required precision and won't cause int to overflow. Try to google fixed point int C (some results: 1, 2) to get better idea what to do.
Say I have a method returning a double, but I want to determine the precision after the dot of the value to be returned. I don't know the value of the double varaible.
Example:
double i = 3.365737;
return i;
I want the return value to be with precision of 3 number after the dot
Meaning: the return value is 3.365.
Another example:
double i = 4644.322345;
return i;
I want the return value to be: 4644.322
What you want is truncation of decimal digits after a certain digit. You can easily do that with the floor function from <math.h> (or std::floor from <cmath> if you're using C++):
double TruncateNumber(double In, unsigned int Digits)
{
double f=pow(10, Digits);
return ((int)(In*f))/f;
}
Still, I think that in some cases you may get some strange results (the last digit being one over/off) due to how floating point internally works.
On the other hand, most of time you just pass around the double as is and truncate it only when outputting it on a stream, which is done automatically with the right stream flags.
You are going to need to take care with the borderline cases. Any implementation based solely on pow and casting or fmod will occasionally give wrong results, particularly so an implementation based on pow(- PRECISION).
The safest bet is to implement something that neither C nor C++ provide: A fixed point arithmetic capability. Lacking that, you will need to find the representations of the pertinent borderline cases. This question is similar to the question on how Excel does rounding. Adapting my answer there, How does Excel successfully Rounds Floating numbers even though they are imprecise? , to this problem,
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Truncate number to some precision.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double truncate (double x, int nplaces) {
bool is_neg = false;
// Things will be easier if we only have to deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the supposedly truncated value (round down) and the nearest
// truncated value above it.
double round_down, round_up;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x / scale);
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x * scale);
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
// Usually the round_down value is the desired value.
// On rare occasions it is the rounded-up value that is.
// This is one of those cases where you do want to compare doubles by ==.
if (x != round_up) x = round_down;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
You cannot "remove" precision from a double. You could have: 4644.322000. It's a different number but the precision is the same.
As #David Heffernan said do it when you convert it to a string for display.
You want to truncate your double to n decimal places, then you can use this function:
#import <cmath>
double truncate_to_places(double d, int n) {
return d - fmod(d, pow(10.0, -n));
}
Instead of multiplying and dividing by powers of 10 like the other answers, you can use the fmod function to find the digits after the precision you want, and then subtract to remove them.
#include <math.h>
#define PRECISION 0.001
double truncate(double x) {
x -= fmod(x,PRECISION);
return x;
}
There is no good way to do this with plain doubles, but you can write a class or simply struct like
struct lim_prec_float {
float value;
int precision;
};
then have your function
lim_prec_float testfn() {
double i = 3.365737;
return lim_prec_float{i, 4};
}
(4 = 1 before point + 3 after. This uses a C++11 initialization list, it would be better if lim_prec_float was a class with proper constructors.)
When you now want to output the variable, do this with a custom
std::ostream &operator<<(std::ostream &tgt, const lim_prec_float &v) {
std::stringstream s;
s << std::setprecision(v.precision) << v.value;
return (tgt << s.str());
}
Now you can, for instance,
int main() {
std::cout << testfn() << std::endl
<< lim_prec_float{4644.322345, 7} << std::endl;
return 0;
}
which will output
3.366
4644.322
this is because std::setprecision means rounding to the desired number of places, which is likely what you really want. If you actually mean truncate, you can modify the operator<< with one of the truncation functions given by the other answers.
In the same way you format a date before displaying it, you should do the same with double.
However, here are two approaches I have used for rounding.
double roundTo3Places(double d) {
return round(d * 1000) / 1000.0;
}
double roundTo3Places(double d) {
return (long long) (d * 1000 + (d > 0 ? 0.5 : -0.5)) / 1000.0;
}
The later is faster, however numbers cannot be larger than 9e15