I'm trying to convert a double with 4 decimals in a quint32, but when I iterate the list, the values are different.
I added a breakpoint at the first cycle and these are the variables, how can I make "i" to be 112778?
EDIT:
This is the code:
QList<double> list;
list << 11.2778;
list << 11.3467;
list << 11.3926;
list << 11.4531;
list << 11.4451;
list << 11.4625;
list << 11.4579;
list << 11.4375;
list << 11.4167;
list << 11.6285;
list << 11.5625;
list << 11.4427;
list << 11.4278;
list << 11.4063;
list << 11.2500;
for(double value : list)
{
double v = value * 10000;
quint32 i = v;
qDebug() << v << i;
}
I was expecting the numbers to be converted to quint32 without floating point, but that's not the result
This is just a question of floating point precision in C++, and there are a lot of existing SO questions on the topic. The problem I think arises from the fact that: 11.2778 * 10000 might not get calculated to be exactly 112778. It might think it is 112777.999999999, or whatever. Converting to an int doesn't round to the nearest integer, it just truncates everything after the decimal point. So that's how you can end up with 112777. To fix this, you can simply force it to round:
for(double value : list)
{
double v = value * 10000;
quint32 i = qRound(v); // Round the double to get the best int
qDebug() << value << v << i;
}
I printed value, too, as like below.
qDebug() << value << v << i;
The output is below.
11.2778 112778 112777
11.3467 113467 113467
11.3926 113926 113926
11.4531 114531 114530
11.4451 114451 114451
11.4625 114625 114625
11.4579 114579 114579
11.4375 114375 114375
11.4167 114167 114167
11.6285 116285 116285
11.5625 115625 115625
11.4427 114427 114427
11.4278 114278 114278
11.4063 114063 114063
11.25 112500 112500
Do you mean that the last digit is different? If so, the last digit may be different because decimal numbers are not hold on the memory digit by digit.
Related
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
So when I input a decimal number like 12.45, it gets decremented by 0.00001 or something that causes my function to work badly.
For example:
If x is 12.45 and div is 0.1 when watching x you can see that it becomes 12.449999999
BUT
If x is 12.455 and div is 0.01 it doesn't reduce x
double round(double x, double div){
if (div != 0){
double temp2 = 0;
int temp = x;
double dec = x - temp;
dec = dec/div;
temp = dec;
temp2 = dec-temp;
temp2 = temp2 * div;
cout << x << endl << endl;
if (temp2 >= div/2){
x+=(div-temp2);
}else{
x-=temp2;
}
cout << temp << " " << dec << " " << x << " " << temp2 << " " << div/2;
return x;
}else{
cout << "div cant be equal to zero" << endl;
}
}
I was trying to make a function that rounds up decimal numbers. I know its probably not the best to do it, but it works except the problem I described earlier.
To fix it I tried limiting decimal places at the input, didn't work. Also tried using other methods instead of using a double/integer combo without any results.
I expect the output of 12.5 when x is 12.45 and div is 0.1 but it's not working, because of the 0.000001 of the input getting lost.
Your program is going to be miss informed and will not work.
This is how floating point values are handled in programming languages as is defined in this standard.
https://en.wikipedia.org/wiki/IEEE_754#Basic_formats
They often require rounding as a result of an operation to fit within their finite representation making them difficult to compare.
https://www.boost.org/doc/libs/1_36_0/libs/test/doc/html/utf/testing-tools/floating_point_comparison.html
The issues you are seeing are artifacts of the rounding error.
https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf
OK this time I have a really weird mistake that does not always appear. Here is the function that actually contains problems. All it does is literally summing elements of a vector. It works in most cases but in a few cases it tend to become highly problematic.
int sumvec(vect v) {
int i = 0;
int sum = 0;
int l = v.length();
std::cout << "Sum of vector " << v.tostring(3) << std::endl;
for (; i < l; i++) {
sum += v[i];
std::cout << v[i] << " " << sum << " ";
};
std::cout << std::endl;
return sum;
}
Here vect is defined using typedef alglib::real_1d_array vect;. OK so what do I get? Huh..
Sum of vector [1.000,1.000,0.000,1.000,1.000,1.000]
1 0 1 0 0 0 1 0 1 0 1 1
What?!!!!!
As your sum variable is an integer you may not get the expected results when summing elements in your vector which are not integers.
If your elements have the value of 0.999999999 rather than 1.00000 then printing them could be rounded to 1.00000 but when you add them to an integer the value will be truncated to 0.
Judging by the provided output all of your values are less than 1 apart from the last one which is greater or equal to 1.
There are 2 possible solutions:
Change the type of sum to be float or double.
Change your calculation to: sum += static_cast<int>(round( v[i] ));
Note that your compiler was probably warning about the truncation of a double to an integer. Pay attention to compiler warnings, they often indicate a bug.
As commented use a double to store the sum if you are working with floating point integers. using an integer will cause the variable to be implicitely casted to an int which just cuts of the mantissa:
0.9999998 -> 0
Depending on cout::precision, 0.99999 would be printed as 1.0000(rounded) or without std::fixed just as 1 what propably happens in your example.
double a = 0.999;
std::cout.precision(2);
std::cout << a << std::endl; /* This prints 1 */
std::cout << std::fixed;
std::cout << a << endl; /* This prints 1.00 */
I have been writing code to produce a horizontal histogram. This program takes user input of any range of numbers into a vector. Then it asks the user for the lowest value they want the histogram to begin at, and how big they want each bin to be. For example:
if lowestValue = 1 and binSize = 20
and vector is filled with values {1, 2, 3, 20, 30, 40, 50} it would print something like:
(bin) (bars) (num)(percent)
[ 1-21) #### 4 57%
[21-41) ## 2 28%
[41-61) ## 2 28%
Here is most of the code that does so:
void printHistogram(int lowestValue, int binSize, vector<double> v)
{
int binFloor = lowestValue, binCeiling = 0;
int numBins = amountOfBins(binSize, (int)range(v));
for (int i = 0; i<=numBins; i++)
{
binCeiling = binFloor+binSize;
int amoInBin = amountInBin(v,binFloor, binSize);
double perInBin = percentInBin(v, amoInBin);
if (binFloor < 10)
{
cout << "[ " << binFloor << '-' << binCeiling << ") " << setw(20) << left << formatBars(perInBin) << ' ' << amoInBin << ' '<< setprecision(4) << perInBin << '%' << endl;
binFloor += binSize;
}
else
{
cout << '[' << binFloor << '-' << binCeiling << ") " << setw(20) << left << formatBars(perInBin) << ' ' << amoInBin << ' '<< setprecision(4) << perInBin << '%' << endl;
binFloor += binSize;
}
}
}
and the function that counts how many terms are in each bin:
int amountInBin(vector<double> v, int lowestBinValue, int binSize)
{
int count = 0;
for (size_t i; i<v.size(); i++)
{
if (v[i] >= lowestBinValue && v[i] < (lowestBinValue+binSize))
count += 1;
}
return count;
}
Now my issue:
For some reason, it is not counting values between 20-40. At least as far as I can see from my testing. Here is an image of a run:
Any help is appreciated.
I would suggest a different approach. Making two passes, first calculating the number of bins, then another pass to add them up, looks fragile, and error-prone. Not really surprise to see you trying to figure out a bug of this kind. I think your original approach is too complicated.
As the saying goes "the more you overthink the plumbing, the easier it is to stop up the drain". Find the simplest way to do something, and it will have the least amount of surprises and gotchas, to deal with.
I think it's simpler to make a single pass over the values, calculating which bin each value belongs to, and counting the number of values seen per bin. Let's use a std::map, keyed by bin number, with the value being the number of values in each bin.
void printHistogram(int lowestValue, int binSize, const std::vector<double> &v)
{
std::map<int, size_t> histogram;
for (auto value:v)
{
int bin_number= value < lowestValue ? 0:(value-lowestValue)/binSize;
++histogram[bin_number];
}
And ...that's it. histogram is now your histogram. histogram[0] is now the number of values in the first bin, [lowestValue, lowestValue+binSize), which also includes all values less than lowestValue. histogram[1] will be the number of values found for the next bin, and so on.
Now, you just have to iterate over the histogram map, and generate your actual histogram.
Now, the tricky part here is that the histogram map will only include keys for which at least 1 value was found. If no value was dropped into the bin, the map will not include the bin number. So, if there were no values in the first bin, histogram[0] won't even exist, the first value in the map will be the bin for the lowest value in the vector.
This isn't such a difficult problem to solve, by iterating over the map with a little bit of extra intelligence:
int next_bin_number=0;
for (auto b=histogram.begin(); b != histogram.end(); b++)
{
while (next_bin_number < b->first)
{
// next_bin_number had 0 values. Print the histogram row
// for bin #next_bin_number, showing 0 values in it.
++next_bin_number;
}
int n_values=b->second;
// Bin #n_next_number, with n_values, print its histogram row
++next_bin_number;
}
The code in the loop doesn't initialize i, so the results are at best unpredictable.
const double dBLEPTable_8_BLKHAR[4096] = {
0.00000000000000000000000000000000,
-0.00000000239150987901837200000000,
-0.00000000956897738824125100000000,
-0.00000002153888378764179400000000,
-0.00000003830892270073604800000000,
-0.00000005988800189093979000000000,
-0.00000008628624126316708500000000,
-0.00000011751498329992671000000000,
-0.00000015358678995269770000000000,
-0.00000019451544774895524000000000,
-0.00000024031597312124120000000000,
-0.00000029100459975062165000000000
}
If I change the double above to float, am I doing incurring conversion cpu cycles when I perform operations on the array contents? Or is the "conversion" sorted out during compile time?
Say, dBLEPTable_8_BLKHAR[1] + dBLEPTable_8_BLKHAR[2] , something simple like this?
On a related note, how many trailing decimal places should a float be able to store?
This is c++.
Any good compiler will convert the initializers during compile time. However, you also asked
am I incurring conversion cpu cycles when I perform operations on the array contents?
and that depends on the code performing the operations. If your expression combines array elements with variables of double type, then the operation will be performed at double precision, and the array elements will be promoted (converted) before the arithmetic takes place.
If you just combine array elements with variables of float type (including other array elements), then the operation is performed on floats and the language doesn't require any promotion (But if your hardware only implements double precision operations, conversion might still be done. Such hardware surely makes the conversions very cheap, though.)
Ben Voigt answer addresses your question for most parts.
But you also ask:
On a related note, how many trailing decimal places should a float be able to store
It depends on the value of the number you are trying to store. For large numbers there is no decimals - in fact the format can't even give you a precise value for the integer part. For instance:
float x = BIG_NUMBER;
float y = x + 1;
if (x == y)
{
// The code get here if BIG_NUMBER is very high!
}
else
{
// The code get here if BIG_NUMBER is no so high!
}
If BIG_NUMBER is 2^23 the next greater number would be (2^23 + 1).
If BIG_NUMBER is 2^24 the next greater number would be (2^24 + 2).
The value (2^24 + 1) can not be stored.
For very small numbers (i.e. close to zero), you will have a lot of decimal places.
Floating point is to be used with great care because they are very imprecise.
http://en.wikipedia.org/wiki/Single-precision_floating-point_format
For small numbers you can experiment with the program below.
Change the exp variable to set the starting point. The program will show you what the step size is for the range and the first four valid numbers.
int main (int argc, char* argv[])
{
int exp = -27; // <--- !!!!!!!!!!!
// Change this to set starting point for the range
// Starting point will be 2 ^ exp
float f;
unsigned int *d = (unsigned int *)&f; // Brute force to set f in binary format
unsigned int e;
cout.precision(100);
// Calculate step size for this range
e = ((127-23) + exp) << 23;
*d = e;
cout << "Step size = " << fixed << f << endl;
cout << "First 4 numbers in range:" << endl;
// Calculate first four valid numbers in this range
e = (127 + exp) << 23;
*d = e | 0x00000000;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
*d = e | 0x00000001;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
*d = e | 0x00000002;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
*d = e | 0x00000003;
cout << hex << "0x" << *d << " = " << fixed << f << endl;
return 0;
}
For exp = -27 the output will be:
Step size = 0.0000000000000008881784197001252323389053344726562500000000000000000000000000000000000000000000000000
First 4 numbers in range:
0x32000000 = 0.0000000074505805969238281250000000000000000000000000000000000000000000000000000000000000000000000000
0x32000001 = 0.0000000074505814851022478251252323389053344726562500000000000000000000000000000000000000000000000000
0x32000002 = 0.0000000074505823732806675252504646778106689453125000000000000000000000000000000000000000000000000000
0x32000003 = 0.0000000074505832614590872253756970167160034179687500000000000000000000000000000000000000000000000000
const double dBLEPTable_8_BLKHAR[4096] = {
If you change the double in that line to float, then one of two things will happen:
At compile time, the compiler will convert the numbers -0.00000000239150987901837200000000 to the float that best represents them, and will then store that data directly into the array.
At runtime, during the program initialization (before main() is called!) the runtime that the compiler generated will fill that array with data of type float.
Either way, once you get to main() and to code that you've written, all of that data will be stored as float variables.
#include <iostream>
#include <iomanip>
using namespace std;
int a[8], e[8];
void term (int n)
{
a[0]=1;
for (int i=0; i<8; i++)
{
if (i<7)
{
a[i+1]+=(a[i]%n)*100000;
}
/* else
{
a[i+1]+=((a[i]/640)%(n/640))*100000;
}
*/
a[i]=a[i]/(n);
}
}
void sum ()
{
}
int factorial(int x, int result = 1)
{
if (x == 1)
return result;
else return factorial(x - 1, x * result);
}
int main()
{
int n=1;
for (int i=1; i<=30; i++)
{
term(n);
cout << a[0] << " "<< a[1] << " " << a[2] << " "
<< a[3] << " " << a[4] << " " << a[5]<< " "
<< " " << a[6] << " " << a[7] << endl;
n++;
for (int j=1; j<8; j++)
a[j]=0;
}
return 0;
}
That what I have above is the code that I have thus far.
the Sum and the rest are left purposely uncompleted because that is still in the building phase.
Now, I need to make an expansion of euler' number,
This is supposed to make you use series like x[n] in order to divide a result into multiple parts and use functions to calculate the results and such.
According to it,
I need to find the specific part of the Maclaurin's Expansion and calculate it.
So the X in e=1+x+(1/2!)*x and so on is always 1
Giving us e=1+1+1/2!+1/3!+1/n! to calculate
The program should calculate it in order of the N
so If N is 1 it will calculate only the corresponding factorial division part;
meaning that one part of the variable will hold the result of the calculation which will be x=1.00000000~ and the other will hold the actual sum up until now which is e=2.000000~
For N=2
x=1/2!, e=previous e+x
for N=3
x=1/3!, e=previous e+x
The maximum number of N is 29
each time the result is calculated, it needs to hold all the numbers after the dot into separate variables like x[1] x[2] x[3] until all the 30~35 digits of precision are filled with them.
so when printing out, in the case of N=2
x[0].x[1]x[2]x[3]~
should come out as
0.50000000000000000000
where x[0] should hold the value above the dot and x[1~3] would be holding the rest in 5 digits each.
Well yeah Sorry if my explanation sucks but This is what its asking.
All the arrays must be in Int and I cannot use others
And I cant use bigint as it defeats the purpose
The other problem I have is, while doing the operations, it goes well till the 7th.
Starting from the 8th and so on It wont continue without giving me negative numbers.
for N=8
It should be 00002480158730158730158730.
Instead I get 00002 48015 -19220 -41904 30331 53015 -19220
That is obviously due to int's limit and since at that part it does
1936000000%40320
in order to get a[3]'s value which then is 35200 which is then multiplied by 100000
giving us a 3520000000/40320, though the value of a[3] exceeds the limit of integer, any way to fix this?
I cannot use doubles or Bigints for this so if anyone has a workaround for this, it would be appreciated.
You cannot use floating point or bigint, but what about other compiler intrinsic integral types like long long, unsigned long long, etc.? To make it explicit you could use <stdint.h>'s int64_t and uint64_t (or <cstdint>'s std::int64_t and std::uint64_t, though this header is not officially standard yet but is supported on many compilers).
I don't know if this is of any use, but you can find the code I wrote to calculate Euler's number here: http://41j.com/blog/2011/10/program-for-calculating-e/
32bit int limits fact to 11!
so you have to store all the above facts divided by some number
12!/10000
13!/10000
when it does not fit anymore use 10000^2 and so on
when using the division result is just shifted to next four decimals ... (as i assumed was firstly intended)
of course you do not divide 1/n!
on integers that will be zero instead divide 10000
but that limits the n! to only 9999 so if you want more add zeroes everywhere and the result are decimals
also i think there can be some overflow so you should also carry on to upper digits