float value issue

float value issue - c++

I am facing problem using float
in loop its value stuck at 8388608.00
int count=0;
long X=10;
cout.precision(flt::digits10);
cout<<"Iterration #"<<setw(15)<<"Add"<<setw(21)<<"Mult"<<endl;
float Start=0.0;
float Multiplication = Addition * N;
long i = 1;
for (i; i <= N; i++){
float temp = Start + Addition;
Start=temp;
count++;
if(count%X==0 && count!=0)
{
X*=10;
cout<<i;
cout<<fixed<<setw(30)<<Start<<setw(20)<<fixed<<i*Addition<<endl;
}
}
what should i do??

Floating point addition doesn't work when you're adding (relatively) small number to (relatively) big one. It's caused by the way float is stored in memory.
You may try replacing single precision floating point (float) with double precision floating point (double) representation but if that doesn't work you'll probably need to implement hack like this:
// Lets say
double OriginalAddition = 0.123;
int Addition = 1;
// You just use base math substitution:
// Addition = OriginalAddition
int temp = Start + Addition; // You will treat transform floating point to fixed point
// with step 0.123, so 1 = 0.123
// And when displaying result (transform back into original floating point):
printf( "%f", (double)result*OriginalAddition)
This needs a lot of thought to find a substitution that doesn't cause data loss, covers required precision and won't cause int to overflow. Try to google fixed point int C (some results: 1, 2) to get better idea what to do.

Related

calculating the mean of an array c++

I encountered a problem when I tried to calculate the mean of an array in two ways. Below is the code:
float sum1, sum2, tmp, mean1, mean2;
double sum1_double, sum2_double, tmp_double;
int i, j;
int Nt=29040000; //array size
int piecesize=32;
int Npiece=Nt/piecesize;
float* img;
float* d_img;
double* img_double;
img_double = (double*)calloc(Nt, sizeof(double));
cudaHostAlloc((void**)&img, sizeof(float)*Nt, cudaHostAllocDefault);
cudaMalloc((void**)&d_img, sizeof(float)*Nt);
...
//Some calculation is done in GPU and the results are stored in d_img;
...
cudaMemcpy(img, d_img, Nt*sizeof(float), cudaMemcpyDeviceToHost);
for (i=0;i<Nt;i++) img_double[i]=(double)img[i];
//Method 1
sum1=0;
for (i=0;i<Nt;i++)
{ sum1 += img[i]; }
sum1_double=0;
for (i=0;i<Nt;i++)
{ sum1_double += img_double[i]; }
//Method 2
sum2=0;
for (i=0;i<Npiece;i++)
{ tmp=0;
for (j=0;j<piecesize;j++)
{ tmp += img[i*piecesize+j];}
sum2 += tmp;
}
sum2_double=0;
for (i=0;i<Npiece;i++)
{ tmp_double=0;
for (j=0;j<piecesize;j++)
{ tmp_double += img_double[i*piecesize+j];}
sum2_double += tmp_double;
}
mean1=sum1/(float)Nt;
mean2=sum2/(float)Nt;
mean1_double=sum1_double/(double)Nt;
mean2_double=sum2_double/(double)Nt;
cout<<setprecision(15)<<mean1<<endl;
cout<<setprecision(15)<<mean2<<endl;
cout<<setprecision(15)<<mean1_double<<endl;
cout<<setprecision(15)<<mean2_double<<endl;
Output:
132.221862792969
129.565872192383
129.565938340543
129.565938340543
The results obtained from the two methods, mean1=129.6, mean2=132.2, are significantly different. May I know why?
Thanks a lot in advance!

The reason is that floating point arithmetic is not precise. When you accumulate integers, float becomes imprecise when abs(value) becomes larger than 224 (I'm supposing IEEE-754 32-bit here). For example, float is incapable to store 16777217 precisely (it will become 16777216 or 16777218, depending on the rounding mode).
Supposedly your second calculation is the more precise one, as less precision is lost, because of the separate tmp accumulation.
Change your sum1, sum2, tmp variables to long long int, and hopefully you'll get the same result for both calculations.
Note: I've supposed that your img stores integer data. If it stores floats, then there is no easy way to fix this perfectly. One way is to use double instead of float for sum1, sum2 and tmp. The difference will be there, but it will be much smaller. And there are techniques how to accumuluate floats more precisely than simple summing. Like Kahan Summation.

C++: Strange floating point math results

I've been struggling to find a crazy bug in some C++ code and narrowed it down to this small section. I placed into a simple main.c to debug it and can't figure out why the floating point math is rounding when it shouldn't.
// setup the variables for this simple case
int writep = 672;
float offset = 672.000122;
int bufferSize = 2400;
float bufferSizeF = (float)bufferSize;
float outPointer = (float)writep - offset; // outPointer: -0.000122070313
if(outPointer < 0.0f){
printf("outPointer: %.9f \n", outPointer); // outPointer: -0.000122070313
outPointer += bufferSizeF; // outPointer SHOULD be: 2399.9998779296875
printf("outpointer: %.9f \n", outPointer); // outPointer: 2400.000000000
}
Someone...please explain. Thanks.

2400.000000000 and 2399.9998779296875 are too close for a standard float to differentiate them. Try this:
#include<iostream>
int main() {
std::cout << (float)2399.9998779296875 << "\n";
}
It will probably give 2400 as output.
An IEEE 754 single precision float can only hold about 7 to 8 significant decimal digits. If you need a higher number of significant digits use a double precision double.

In IEEE 754 standard the floating point numbers are not equidistantly distributed over the number axis. Density of floating point values is higher around 0 than around 2400, so this is why the rounding is done when value is around 2400.
Here is the picture to illustrate it:
https://www.google.fi/search?q=IEEE+754+distribution&biw=1920&bih=895&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj-tKOWkMzPAhUEDywKHRdRAEUQ_AUIBigB#imgrc=rshe5_x1ZXFoKM%3A

Weird behaviour in a for loop changing the results

i've got a weird problem in my code.
here's the context : in my method i create an object and then i fill the (int) buffer of this object with data in TWO "for loops".
The problem is , when i insert a printf in my loop to look at the data into my buffer, it change the data in the buffer.
actually, the result in the buffer is different if there's a printf inside the loop or not
Heres's my code, maybe it can help to understand :
bool Mod::Realiser(FFTResult * inputdata,FFTSample_s * & moduleData){
bool done = true;
float module;
unsigned int r,n;
moduleData = new FFTSample_s(NbPointsSample);
unsigned int limit = NbPointsSample >> 1;
int iGain= 0;
for (n = CentrageFFT, r = 0; r < limit; n++, r++)
{
module = inputdata->buffer[n][0] * inputdata->buffer[n][0] + inputdata->buffer[n][1] * inputdata->buffer[n][1];
// printf(" M = %lf\n",module);
moduleData->buffer[r] = (int)(10.0*log10(module)) + iGain;
}
for (n = 0; n < limit; n++, r++)
{
module = inputdata->buffer[n][0] * inputdata->buffer[n][0] + inputdata->buffer[n][1] * inputdata->buffer[n][1];
moduleData->buffer[r] = (int)(10.0*log10(module)) + iGain;
}
/* for (int i=0;i<2048;i++){
printf(" X = %lf \n",inputdata->buffer[i][0]);
printf(" Y = %lf \n",inputdata->buffer[i][1]);
printf(" M = %d\n",moduleData->buffer[i]);
}*/

This is normal behavior. See What Every Computer Scientist Should Know about Floating-Point Arithmetic. By passing a floating point number to printf, you probably force the implementation to convert it into canonical float form from an internal form that happens to have higher precision.
The results can be different. There is not one and only one right answer.
Also:
"The result of a + b is stored in a temporary destination of unspecified precision. Neither the C++ or IEEE standards mandate what precision intermediate calculations are done to and this intermediate precision will affect your results. The temporary result could equally easily be stored in a float or a double and there are significant advantages to both options. " - Floating-Point Determinism

How can you convert a std::bitset<64> to a double?

Is there a way to convert a std::bitset<64> to a double without using any external library (Boost, etc.)? I am using a bitset to represent a genome in a genetic algorithm and I need a way to convert a set of bits to a double.

The C++11 road:
union Converter { uint64_t i; double d; };
double convert(std::bitset<64> const& bs) {
Converter c;
c.i = bs.to_ullong();
return c.d;
}
EDIT: As noted in the comments, we can use char* aliasing as it is unspecified instead of being undefined.
double convert(std::bitset<64> const& bs) {
static_assert(sizeof(uint64_t) == sizeof(double), "Cannot use this!");
uint64_t const u = bs.to_ullong();
double d;
// Aliases to `char*` are explicitly allowed in the Standard (and only them)
char const* cu = reinterpret_cast<char const*>(&u);
char* cd = reinterpret_cast<char*>(&d);
// Copy the bitwise representation from u to d
memcpy(cd, cu, sizeof(u));
return d;
}
C++11 is still required for to_ullong.

Most people are trying to provide answers that let you treat the bit-vector as though it directly contained an encoded int or double.
I would advise you completely avoid that approach. While it does "work" for some definition of working, it introduces hamming cliffs all over the place. You usually want your encoding to arrange things so that if two decoded values are near to one another, then their encoded values are near to one another as well. It also forces you to use 64-bits of precision.
I would manage the conversion manually. Say you have three variables to encode, x, y, and z. Your domain expertise can be used to say, for example, that -5 <= x < 5, 0 <= y < 100, and 0 <= z < 1, where you need 8 bits of precision for x, 12 bits for y, and 10 bits for z. This gives you a total search space of only 30 bits. You can have a 30 bit string, treat the first 8 as encoding x, the next 12 as y, and the last 10 as z. You are also free to gray code each one to remove the hamming cliffs.
I've personally done the following in the past:
inline void binary_encoding::encode(const vector<double>& params)
{
unsigned int start=0;
for(unsigned int param=0; param<params.size(); ++param) {
// m_bpp[i] = number of bits in encoding of parameter i
unsigned int num_bits = m_bpp[param];
// map the double onto the appropriate integer range
// m_range[i] is a pair of (min, max) values for ith parameter
pair<double,double> prange=m_range[param];
double range=prange.second-prange.first;
double max_bit_val=pow(2.0,static_cast<double>(num_bits))-1;
int int_val=static_cast<int>((params[param]-prange.first)*max_bit_val/range+0.5);
// convert the integer to binary
vector<int> result(m_bpp[param]);
for(unsigned int b=0; b<num_bits; ++b) {
result[b]=int_val%2;
int_val/=2;
}
if(m_gray) {
for(unsigned int b=0; b<num_bits-1; ++b) {
result[b]=!(result[b]==result[b+1]);
}
}
// insert the bits into the correct spot in the encoding
copy(result.begin(),result.end(),m_genotype.begin()+start);
start+=num_bits;
}
}
inline void binary_encoding::decode()
{
unsigned int start = 0;
// for each parameter
for(unsigned int param=0; param<m_bpp.size(); param++) {
unsigned int num_bits = m_bpp[param];
unsigned int intval = 0;
if(m_gray) {
// convert from gray to binary
vector<int> binary(num_bits);
binary[num_bits-1] = m_genotype[start+num_bits-1];
intval = binary[num_bits-1];
for(int i=num_bits-2; i>=0; i--) {
binary[i] = !(binary[i+1] == m_genotype[start+i]);
intval += intval + binary[i];
}
}
else {
// convert from binary encoding to integer
for(int i=num_bits-1; i>=0; i--) {
intval += intval + m_genotype[start+i];
}
}
// convert from integer to double in the appropriate range
pair<double,double> prange = m_range[param];
double range = prange.second - prange.first;
double m = range / (pow(2.0,double(num_bits)) - 1.0);
// m_phenotype is a vector<double> containing all the decoded parameters
m_phenotype[param] = m * double(intval) + prange.first;
start += num_bits;
}
}
Note that for reasons that probably don't matter to you, I wasn't using bit vectors -- just ordinary vector<int> to encoding things. And of course, there's a bunch of stuff tied into this code that isn't shown here, but you can probably get the basic idea.
One other note, if you're doing GPU calculations or if you have a particular problem such that 64 bits are the appropriate size anyway, it may be worth the extra overhead to stuff everything into native words. Otherwise, I would guess that the overhead you add to the search process will probably overwhelm whatever benefits you get by faster encoding and decoding.

Edit:: I've decided that I was being a bit silly with this. While you do end up with a double it assumes that the bitset holds an integer... which is a big assumption to make. You will end up with a predictable and repeatable value per bitset but still I don't think that this is what the author intended.
Well if you iterate over the bit values and do
output_double += pow( 2, 64-(bit_position+1) ) * bit_value;
That would work. As long as it is big-endian

Returning double with precision

Say I have a method returning a double, but I want to determine the precision after the dot of the value to be returned. I don't know the value of the double varaible.
Example:
double i = 3.365737;
return i;
I want the return value to be with precision of 3 number after the dot
Meaning: the return value is 3.365.
Another example:
double i = 4644.322345;
return i;
I want the return value to be: 4644.322

What you want is truncation of decimal digits after a certain digit. You can easily do that with the floor function from <math.h> (or std::floor from <cmath> if you're using C++):
double TruncateNumber(double In, unsigned int Digits)
{
double f=pow(10, Digits);
return ((int)(In*f))/f;
}
Still, I think that in some cases you may get some strange results (the last digit being one over/off) due to how floating point internally works.
On the other hand, most of time you just pass around the double as is and truncate it only when outputting it on a stream, which is done automatically with the right stream flags.

You are going to need to take care with the borderline cases. Any implementation based solely on pow and casting or fmod will occasionally give wrong results, particularly so an implementation based on pow(- PRECISION).
The safest bet is to implement something that neither C nor C++ provide: A fixed point arithmetic capability. Lacking that, you will need to find the representations of the pertinent borderline cases. This question is similar to the question on how Excel does rounding. Adapting my answer there, How does Excel successfully Rounds Floating numbers even though they are imprecise? , to this problem,
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Truncate number to some precision.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double truncate (double x, int nplaces) {
bool is_neg = false;
// Things will be easier if we only have to deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the supposedly truncated value (round down) and the nearest
// truncated value above it.
double round_down, round_up;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x / scale);
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x * scale);
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
// Usually the round_down value is the desired value.
// On rare occasions it is the rounded-up value that is.
// This is one of those cases where you do want to compare doubles by ==.
if (x != round_up) x = round_down;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}

You cannot "remove" precision from a double. You could have: 4644.322000. It's a different number but the precision is the same.
As #David Heffernan said do it when you convert it to a string for display.

You want to truncate your double to n decimal places, then you can use this function:
#import <cmath>
double truncate_to_places(double d, int n) {
return d - fmod(d, pow(10.0, -n));
}

Instead of multiplying and dividing by powers of 10 like the other answers, you can use the fmod function to find the digits after the precision you want, and then subtract to remove them.
#include <math.h>
#define PRECISION 0.001
double truncate(double x) {
x -= fmod(x,PRECISION);
return x;
}

There is no good way to do this with plain doubles, but you can write a class or simply struct like
struct lim_prec_float {
float value;
int precision;
};
then have your function
lim_prec_float testfn() {
double i = 3.365737;
return lim_prec_float{i, 4};
}
(4 = 1 before point + 3 after. This uses a C++11 initialization list, it would be better if lim_prec_float was a class with proper constructors.)
When you now want to output the variable, do this with a custom
std::ostream &operator<<(std::ostream &tgt, const lim_prec_float &v) {
std::stringstream s;
s << std::setprecision(v.precision) << v.value;
return (tgt << s.str());
}
Now you can, for instance,
int main() {
std::cout << testfn() << std::endl
<< lim_prec_float{4644.322345, 7} << std::endl;
return 0;
}
which will output
3.366
4644.322
this is because std::setprecision means rounding to the desired number of places, which is likely what you really want. If you actually mean truncate, you can modify the operator<< with one of the truncation functions given by the other answers.

In the same way you format a date before displaying it, you should do the same with double.
However, here are two approaches I have used for rounding.
double roundTo3Places(double d) {
return round(d * 1000) / 1000.0;
}
double roundTo3Places(double d) {
return (long long) (d * 1000 + (d > 0 ? 0.5 : -0.5)) / 1000.0;
}
The later is faster, however numbers cannot be larger than 9e15

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

float value issue - c++

Related

calculating the mean of an array c++

C++: Strange floating point math results

Weird behaviour in a for loop changing the results

How can you convert a std::bitset<64> to a double?

Returning double with precision

Categories

Resources