Largest floating point variable - c++

The program takes for parameter one double and its doing computation with long float point values. eg double myvar= n*0.000005478554 /298477.
The problem is that im not sure that the real computational value is inserted to myvar.
because whenvever i change n it produce the same thing cout<<"myvar ="<<myvar;
What is the biggest type in c++ that can be used for maximum accuracy? Does a buffer overflow caused by this code because double variable cant hold too much info ? If yes what can happen and how can i detect it for later use?

double will hold values much smaller (and bigger) than 0.000005478554 /298477. Any problem that you have is almost certainly caused by a bug in your code. Show it!
Try to reduce your problem to a few lines. This kind of problems can be reproduced with something as small as
#include <iostream>
int main() {
double myvar = 7 * 0.000005478554 /298477;
std::cout << myvar;
}

Related

pow() returning wrong result under specific conditions (and an unexpected fix) - Why is all this?

I have been creating a short program for implementing Sieve of Eratosthenes. In he program, I have used int startingPoint which held the value of the current prime number, which had it's multiplications being marked non-prime, starting from its square. I used the pow() function to calculate the square of the starting point in the current loop. While doing so, I came across a weird phenomenon. When startingPoint was equal to 5, then after int i = pow(startingPoint,2) the value of i was 24, instead of 25. This phenomenon occurred only when startingPoint was equal to 5. Afterwards, I ran a few tests, that leaded to the following code snippet:
#include <iostream>
#include <math.h>
int main()
{
int p=5;
int num1 = pow(5,2);
int num2 = pow(p,2);
float num3 = pow(p,2);
int num4 = floor(pow(p,2));
std::cout.precision(10);
std::cout<<num1<<std::endl; //25
std::cout<<num2<<std::endl; //24
std::cout<<std::fixed<<num3<<std::endl; //25.0000000000
std::cout<<num4<<std::endl; //25
}
As it can be seen, if I call pow with int literals, the result will be the actual square of the number. However, if I call it using int p=5, then what pow returns is actually one lower than the expected result. If I pass the result into a float variable, it also recieves the correct result.
Now, I know the way pow calculates powers is via approximation, and thus, errors such as this when converting to integer may occur. I might just let it be like that. But what REALLY made me stop and think is what happens with num4. pow(p,2) casted to int returns 24. But floor(pow(p,2)) casted to int returns 25. So the floor function, which, by standard rounds a number down, somehow makes the returned value cast into a higher integer value.
My question in short: Just how does that happen?
(I was using gcc 5.3.0 through MinGW)
Edit: As I already stated in the question, I can accept the reason behind pow return value being casted into a lower integer value, but what I really can't comprehend (and I haven't seen this being brought up anywhere else either) is how floor fixes that. How can floor make the return value of pow actually cast into a higher integer value? Now THAT is the real question.

simple loop not working for random numbers

I am a programming newbie. I needed a simple function to convert any number with decimal point X.YZ into XYZ. I did it by multiplying it by 10 enough times and using double to int conversion.
int main()
{
std::cout << "Number: " << std::endl;
double a;
// the uninitialized b was pointed out, its not the issue
long b = 0;
std::cin >> a;
while(b!=a)
{
a*=10;
b=a;
}
std::cout << a << std::endl;
return 0;
}
This works like 90 percent of the time. For some numbers like 132.54, the program runs infinitely long. It processes 132.547(which should use more memory then 132.54) the way it should.
So my question is : Why is it not working 100 percent for the numbers in the memory range of long int? Why 132.54 and similar numbers?
I am using Codeblocks and GNU GCC compiler.
Many decimal floating point numbers cannot be exactly represented in binary. You only get a close approximation.
If 132.54 is represented as 132.539999999999999, you will never get a match.
Print the values in the loop, and you will see what happens.
The problem is that most decimal values cannot be represented exactly as floating-point values. So having a decimal value that only has a couple of digits doesn't guarantee that multiplying by ten enough times will produce a floating-point value with no fractional part. To see this, display the value of a each time through the loop. There's lots of noise down in the low bits.
Your problem is that you never initialize b and therefore have undefined behaviour.
You should do this:
long b = 0;
Now you can go compare b with something else and get good behaviour.
Also comparing a float with an integral type should be done like comparing to an appropriate epsilon value:
while(fabs(an_int - a_float) < eps)
Instead of reading it as a double, read it as a string and parse it. You won't run into floating precision problem that way.
long b;
Here you define b. From this point, the variable contains garbage value. The value can be absolutely random, basically it's just what happened to be in the memory when it was allocated. After that you are using this variable in a condition:
while(b!=a)
This will lead to undefined behaviour, which basically means that anything can happen, including an opportunity that the app will seem to be working (if you are lucky), based on the garbage value that is in b.
To avoid this, you will need to initialize the b with some value, for example, long b = 0.

newtons methods implementation

i have posted a few hours ago question about newtons method,i got answers and want to thanks everybody,now i have tried to implement code itself
#include <iostream>
#include <math.h>
using namespace std;
#define h powf(10,-7)
#define PI 180
float funct(float x){
return cos(x)-x;
}
float derivative (float x){
return (( funct(x+h)-funct(x-h))/(2*h));
}
int main(){
float tol=.001;
int N=3;
float p0=PI/4;
float p=0;
int i=1;
while(i<N){
p=p0-(float)funct(p0)/derivative(p0);
if ((p-p0)<tol){
cout<<p<<endl;
break;
}
i=i+1;
p0=p;
if (i>=N){
cout<<"solution not found "<<endl;
break;
}
}
return 0;
}
but i writes output "solution not found",in book after three iteration when n=3 ,it finds solution like this .7390851332,so my question is how small i should change h or how should i change my code such that,get correct answer?
Several things:
2 iterations is rarely going to be enough even in the best case.
You need to make sure your starting point is actually convergent.
Be aware of destructive cancellation in your derivative function. You are subtracting two numbers that are very close to each other so the difference will lose a lot of precision.
To expand on the last point, the general method is to decrease h as the value converges. But as I mentioned in your previous question, this "adjusting" h method essentially (algebraically) reduces to the Secant Method.
If you make h too small then your derivative will be innaccurate due to floating point roundoff. Your code would benefit from using double precision rather than single, especially as you are doing differentiation by finite difference. With double precision your value of h would be fine. If you stick to single precision you will need to use a larger value.
Only allowing 2 iterations seems rather restrictive. Make N larger and get your program to print out the number of iterations used.
Also, no need to use pow. Simply write 1e-7.
You're only allowing 2 iterations which may not be enough to get close enough to the answer. If you only have 1 correct bit to start, you can expect to have at best about 4 good bits after 2 iterations. You're looking for 10 bits accuracy (0.001 is roughly 1/2^10), you have to allow at least 2 more iterations.
Moreover, the quadratic convergence property only holds when you're close to the solution. When you're further out, it may take longer to get close to the solution.
The optimal h for computing the numerical derivative using central differences is 0.005 * max(1,|x|) for single-precision (float), where |x| is the absolute value of the argument, x. For double precision, it's about 5e-6 * max(1,|x|).

Converting variable type (or workaround)

The class below is supposed to represent a musical note. I want to be able to store the length of the note (e.g. 1/2 note, 1/4 note, 3/8 note, etc.) using only integers. However, I also want to be able to store the length using a floating point number for the rare case that I deal with notes of irregular lengths.
class note{
string tone;
int length_numerator;
int length_denominator;
public:
set_length(int numerator, int denominator){
length_numerator=numerator;
length_denominator=denominator;
}
set_length(double d){
length_numerator=d; // unfortunately truncates everything past decimal point
length_denominator=1;
}
}
The reason it is important for me to be able to use integers rather than doubles to store the length is that in my past experience with floating point numbers, sometimes the values are unexpectedly inaccurate. For example, a number that is supposed to be 16 occasionally gets mysteriously stored as 16.0000000001 or 15.99999999999 (usually after enduring some operations) with floating point, and this could cause problems when testing for equality (because 16!=15.99999999999).
Is it possible to convert a variable from int to double (the variable, not just its value)? If not, then what else can I do to be able to store the note's length using either an integer or a double, depending on the what I need the type to be?
If your only problem is comparing floats for equality, then I'd say to use floats, but read "Comparing floating point numbers" / Bruce Dawson first. It's not long, and it explains how to compare two floating numbers correctly (by checking the absolute and relative difference).
When you have more time, you should also look at "What Every Computer Scientist Should Know About Floating Point Arithmetic" to understand why 16 occasionally gets "mysteriously" stored as 16.0000000001 or 15.99999999999.
Attempts to use integers for rational numbers (or for fixed point arithmetic) are rarely as simple as they look.
I see several possible solutions: the first is just to use double. It's
true that extended computations may result in inaccurate results, but in
this case, your divisors are normally powers of 2, which will give exact
results (at least on all of the machines I've seen); you only risk
running into problems when dividing by some unusual value (which is the
case where you'll have to use double anyway).
You could also scale the results, e.g. representing the notes as
multiples of, say 64th notes. This will mean that most values will be
small integers, which are guaranteed exact in double (again, at least
in the usual representations). A number that is supposed to be 16 does
not get stored as 16.000000001 or 15.99999999 (but a number that is
supposed to be .16 might get stored as .1600000001 or .1599999999).
Before the appearance of long long, decimal arithmetic classes often
used double as a 52 bit integral type, ensuring at each step that the
actual value was exactly an integer. (Only division might cause a problem.)
Or you could use some sort of class representing rational numbers.
(Boost has one, for example, and I'm sure there are others.) This would
allow any strange values (5th notes, anyone?) to remain exact; it could
also be advantageous for human readable output, e.g. you could test the
denominator, and then output something like "3 quarter notes", or the
like. Even something like "a 3/4 note" would be more readable to a
musician than "a .75 note".
It is not possible to convert a variable from int to double, it is possible to convert a value from int to double. I'm not completely certain which you are asking for but maybe you are looking for a union
union DoubleOrInt
{
double d;
int i;
};
DoubleOrInt length_numerator;
DoubleOrInt length_denominator;
Then you can write
set_length(int numerator, int denominator){
length_numerator.i=numerator;
length_denominator.i=denominator;
}
set_length(double d){
length_numerator.d=d;
length_denominator.d=1.0;
}
The problem with this approach is that you absolutely must keep track of whether you are currently storing ints or doubles in your unions. Bad things will happen if you store an int and then try to access it as a double. Preferrably you would do this inside your class.
This is normal behavior for floating point variables. They are always rounded and the last digits may change valued depending on the operations you do. I suggest reading on floating points somewhere (e.g. http://floating-point-gui.de/) - especially about comparing fp values.
I normally subtract them, take the absolute value and compare this against an epsilon, e.g. if (abs(x-y)
Given you have a set_length(double d), my guess is that you actually need doubles. Note that the conversion from double to a fraction of integer is fragile and complexe, and will most probably not solve your equality problems (is 0.24999999 equal to 1/4 ?). It would be better for you to either choose to always use fractions, or always doubles. Then, just learn how to use them. I must say, for music, it make sense to have fractions as it is even how notes are being described.
If it were me, I would just use an enum. To turn something into a note would be pretty simple using this system also. Here's a way you could do it:
class Note {
public:
enum Type {
// In this case, 16 represents a whole note, but it could be larger
// if demisemiquavers were used or something.
Semiquaver = 1,
Quaver = 2,
Crotchet = 4,
Minim = 8,
Semibreve = 16
};
static float GetNoteLength(const Type &note)
{ return static_cast<float>(note)/16.0f; }
static float TieNotes(const Type &note1, const Type &note2)
{ return GetNoteLength(note1)+GetNoteLength(note2); }
};
int main()
{
// Make a semiquaver
Note::Type sq = Note::Semiquaver;
// Make a quaver
Note::Type q = Note::Quaver;
// Dot it with the semiquaver from before
float dottedQuaver = Note::TieNotes(sq, q);
std::cout << "Semiquaver is equivalent to: " << Note::GetNoteLength(sq) << " beats\n";
std::cout << "Dotted quaver is equivalent to: " << dottedQuaver << " beats\n";
return 0;
}
Those 'Irregular' notes you speak of can be retrieved using TieNotes

Unexpected loss of precision when dividing doubles

I have a function getSlope which takes as parameters 4 doubles and returns another double calculated using this given parameters in the following way:
double QSweep::getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
The problem is that when calling this function with arguments for example:
getSlope(2.71156, -1.64161, 2.70413, -1.72219);
the returned result is:
10.8557
and this is not a good result for my computations.
I have calculated the slope using Mathematica and the result for the slope for the same parameters is:
10.8452
or with more digits for precision:
10.845222072678331.
The result returned by my program is not good in my further computations.
Moreover, I do not understant how does the program returns 10.8557 starting from 10.845222072678331 (supposing that this is the approximate result for the division)?
How can I get the good result for my division?
thank you in advance,
madalina
I print the result using the command line:
std::cout<<slope<<endl;
It may be that my parameters are maybe not good, as I read them from another program (which computes a graph; after I read this parameters fromt his graph I have just displayed them to see their value but maybe the displayed vectors have not the same internal precision for the calculated value..I do not know it is really strange. Some numerical errors appears..)
When the graph from which I am reading my parameters is computed, some numerical libraries written in C++ (with templates) are used. No OpenGL is used for this computation.
thank you,
madalina
I've tried with float instead of double and I get 10.845110 as a result. It still looks better than madalina result.
EDIT:
I think I know why you get this results. If you get a, b, c and d parameters from somewhere else and you print it, it gives you rounded values. Then if you put it to Mathemtacia (or calc ;) ) it will give you different result.
I tried changing a little bit one of your parameters. When I did:
double c = 2.7041304;
I get 10.845806. I only add 0.0000004 to c!
So I think your "errors" aren't errors. Print a, b, c and d with better precision and then put them to Mathematica.
The following code:
#include <iostream>
using namespace std;
double getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
int main( ) {
double s = getSlope(2.71156, -1.64161, 2.70413, -1.72219);
cout << s << endl;
}
gives a result of 10.8452 with g++. How are you printing out the result in your code?
Could it be that you use DirectX or OpenGL in your project? If so they can turn off double precision and you will get strange results.
You can check your precision settings with
std::sqrt(x) * std::sqrt(x)
The result has to be pretty close to x.
I met this problem long time ago and spend a month checking all the formulas. But then I've found
D3DCREATE_FPU_PRESERVE
The problem here is that (c-a) is small, so the rounding errors inherent in floating point operations is magnified in this example. A general solution is to rework your equation so that you're not dividing by a small number, I'm not sure how you would do it here though.
EDIT:
Neil is right in his comment to this question, I computed the answer in VB using Doubles and got the same answer as mathematica.
The results you are getting are consistent with 32bit arithmetic. Without knowing more about your environment, it's not possible to advise what to do.
Assuming the code shown is what's running, ie you're not converting anything to strings or floats, then there isn't a fix within C++. It's outside of the code you've shown, and depends on the environment.
As Patrick McDonald and Treb brought both up the accuracy of your inputs and the error on a-c, I thought I'd take a look at that. One technique to look at rounding errors is interval arithmetic, which makes the upper and lower bounds which value represents explicit (they are implicit in floating point numbers, and are fixed to the precision of the representation). By treating each value as an upper and lower bound, and by extending the bounds by the error in the representation ( approx x * 2 ^ -53 for a double value x ), you get a result which gives the lower and upper bounds on the accuracy of a value, taking into account worst case precision errors.
For example, if you have a value in the range [1.0, 2.0] and subtract from it a value in the range [0.0, 1.0], then the result must lie in the range [below(0.0),above(2.0)] as the minimum result is 1.0-1.0 and the maximum is 2.0-0.0. below and above are equivalent to floor and ceiling, but for the next representable value rather than for integers.
Using intervals which represent worst-case double rounding:
getSlope(
a = [2.7115599999999995262:2.7115600000000004144],
b = [-1.6416099999999997916:-1.6416100000000002357],
c = [2.7041299999999997006:2.7041300000000005888],
d = [-1.7221899999999998876:-1.7221900000000003317])
(d-b) = [-0.080580000000000526206:-0.080579999999999665783]
(c-a) = [-0.0074300000000007129439:-0.0074299999999989383218]
to double precision [10.845222072677243474:10.845222072679954195]
So although c-a is small compared to c or a, it is still large compared to double rounding, so if you were using the worst imaginable double precision rounding, then you could trust that value's to be precise to 12 figures - 10.8452220727. You've lost a few figures off double precision, but you're still working to more than your input's significance.
But if the inputs were only accurate to the number significant figures, then rather than being the double value 2.71156 +/- eps, then the input range would be [2.711555,2.711565], so you get the result:
getSlope(
a = [2.711555:2.711565],
b = [-1.641615:-1.641605],
c = [2.704125:2.704135],
d = [-1.722195:-1.722185])
(d-b) = [-0.08059:-0.08057]
(c-a) = [-0.00744:-0.00742]
to specified accuracy [10.82930108:10.86118598]
which is a much wider range.
But you would have to go out of your way to track the accuracy in the calculations, and the rounding errors inherent in floating point are not significant in this example - it's precise to 12 figures with the worst case double precision rounding.
On the other hand, if your inputs are only known to 6 figures, it doesn't actually matter whether you get 10.8557 or 10.8452. Both are within [10.82930108:10.86118598].
Better Print out the arguments, too. When you are, as I guess, transferring parameters in decimal notation, you will lose precision for each and every one of them. The problem being that 1/5 is an infinite series in binary, so e.g. 0.2 becomes .001001001.... Also, decimals are chopped when converting an binary float to a textual representation in decimal.
Next to that, sometimes the compiler chooses speed over precision. This should be a documented compiler switch.
Patrick seems to be right about (c-a) being the main cause:
d-b = -1,72219 - (-1,64161) = -0,08058
c-a = 2,70413 - 2,71156 = -0,00743
S = (d-b)/(c-a)= -0,08058 / -0,00743 = 10,845222
You start out with six digits precision, through the subtraction you get a reduction to 3 and four digits. My best guess is that you loose additonal precision because the number -0,00743 can not be represented exaclty in a double. Try using intermediate variables with a bigger precision, like this:
double QSweep::getSlope(double a, double b, double c, double d)
{
double slope;
long double temp1, temp2;
temp1 = (d-b);
temp2 = (c-a);
slope = temp1/temp2;
return slope;
}
While the academic discussion going on is great for learning about the limitations of programming languages, you may find the simplest solution to the problem is an data structure for arbitrary precision arithmetic.
This will have some overhead, but you should be able to find something with fairly guaranteeable accuracy.