newtons methods implementation - c++

i have posted a few hours ago question about newtons method,i got answers and want to thanks everybody,now i have tried to implement code itself
#include <iostream>
#include <math.h>
using namespace std;
#define h powf(10,-7)
#define PI 180
float funct(float x){
return cos(x)-x;
}
float derivative (float x){
return (( funct(x+h)-funct(x-h))/(2*h));
}
int main(){
float tol=.001;
int N=3;
float p0=PI/4;
float p=0;
int i=1;
while(i<N){
p=p0-(float)funct(p0)/derivative(p0);
if ((p-p0)<tol){
cout<<p<<endl;
break;
}
i=i+1;
p0=p;
if (i>=N){
cout<<"solution not found "<<endl;
break;
}
}
return 0;
}
but i writes output "solution not found",in book after three iteration when n=3 ,it finds solution like this .7390851332,so my question is how small i should change h or how should i change my code such that,get correct answer?

Several things:
2 iterations is rarely going to be enough even in the best case.
You need to make sure your starting point is actually convergent.
Be aware of destructive cancellation in your derivative function. You are subtracting two numbers that are very close to each other so the difference will lose a lot of precision.
To expand on the last point, the general method is to decrease h as the value converges. But as I mentioned in your previous question, this "adjusting" h method essentially (algebraically) reduces to the Secant Method.

If you make h too small then your derivative will be innaccurate due to floating point roundoff. Your code would benefit from using double precision rather than single, especially as you are doing differentiation by finite difference. With double precision your value of h would be fine. If you stick to single precision you will need to use a larger value.
Only allowing 2 iterations seems rather restrictive. Make N larger and get your program to print out the number of iterations used.
Also, no need to use pow. Simply write 1e-7.

You're only allowing 2 iterations which may not be enough to get close enough to the answer. If you only have 1 correct bit to start, you can expect to have at best about 4 good bits after 2 iterations. You're looking for 10 bits accuracy (0.001 is roughly 1/2^10), you have to allow at least 2 more iterations.
Moreover, the quadratic convergence property only holds when you're close to the solution. When you're further out, it may take longer to get close to the solution.
The optimal h for computing the numerical derivative using central differences is 0.005 * max(1,|x|) for single-precision (float), where |x| is the absolute value of the argument, x. For double precision, it's about 5e-6 * max(1,|x|).

Related

Why pow() return 999... in C++ [duplicate]

While running the following lines of code:
int i,a;
for(i=0;i<=4;i++)
{
a=pow(10,i);
printf("%d\t",a);
}
I was surprised to see the output, it comes out to be 1 10 99 1000 9999 instead of 1 10 100 1000 10000.
What could be the possible reason?
Note
If you think it's a floating point inaccuracy that in the above for loop when i = 2, the values stored in variable a is 99.
But if you write instead
a=pow(10,2);
now the value of a comes out to be 100. How is that possible?
You have set a to be an int. pow() generates a floating point number, that in SOME cases may be just a hair less than 100 or 10000 (as we see here.)
Then you stuff that into the integer, which TRUNCATES to an integer. So you lose that fractional part. Oops. If you really needed an integer result, round may be a better way to do that operation.
Be careful even there, as for large enough powers, the error may actually be large enough to still cause a failure, giving you something you don't expect. Remember that floating point numbers only carry so much precision.
The function pow() returns a double. You're assigning it to variable a, of type int. Doing that doesn't "round off" the floating point value, it truncates it. So pow() is returning something like 99.99999... for 10^2, and then you're just throwing away the .9999... part. Better to say a = round(pow(10, i)).
This is to do with floating point inaccuracy. Although you are passing in ints they are being implicitly converted to a floating point type since the pow function is only defined for floating point parameters.
Mathematically, the integer power of an integer is an integer.
In a good quality pow() routine this specific calculation should NOT produce any round-off errors. I ran your code on Eclipse/Microsoft C and got the following output:
1 10 100 1000 10000
This test does NOT indicate if Microsoft is using floats and rounding or if they are detecting the type of your numbers and choosing the appropriate method.
So, I ran the following code:
#include <stdio.h>
#include <math.h>
main ()
{
double i,a;
for(i=0.0; i <= 4.0 ;i++)
{
a=pow(10,i);
printf("%lf\t",a);
}
}
And got the following output:
1.000000 10.000000 100.000000 1000.000000 10000.000000
No one spelt out how to actually do it correctly - instead of pow function, just have a variable that tracks the current power:
int i, a, power;
for (i = 0, a = 1; i <= 4; i++, a *= 10) {
printf("%d\t",a);
}
This continuing multiplication by ten is guaranteed to give you the correct answer, and quite OK (and much better than pow, even if it were giving the correct results) for tasks like converting decimal strings into integers.

Truncate floor into three decimal point C++

I want to truncate floor number to be 3 digit decimal number. Example:
input : x = 0.363954;
output: 0.364
i used
double myCeil(float v, int p)
{
return int(v * pow(float(10),p))/pow(float(10),p );
}
but the output was 0.3630001 .
I tried to use trunc from <cmath> but it doesn't exist.
Floating-point math typically uses a binary representation; as a result, there are decimal values that cannot be exactly represented as floating-point values. Trying to fiddle with internal precisions runs into exactly this problem. But mostly when someone is trying to do this they're really trying to display a value using a particular precision, and that's simple:
double x = 0.363954;
std::cout.precision(3);
std::cout << x << '\n';
The function your looking for is the std::ceil, not std::trunc
double myCeil(double v, int p)
{
return std::ceil(v * std::pow(10, p)) / std::pow(10, p);
}
substitue in std::floor or std::round for a myFloor or myRound as desired. (Note that std::round appears in C++11, which you will have to enable if it isn't already done).
It is just impossible to get 0.364 exactly. There is no way you can store the number 0.364 (364/1000) exactly as a float, in the same way you would need an infinite number of decimals to write 1/3 as 0.3333333333...
You did it correctly, except for that you probably want to use std::round(), to round to the closest number, instead of int(), which truncates.
Comparing floating point numbers is tricky business. Typically the best you can do is check that the numbers are sufficiently close to each other.
Are you doing your rounding for comparison purposes? In such case, it seems you are happy with 3 decimals (this depends on each problem in question...), in such case why not just
bool are_equal_to_three_decimals(double a, double b)
{
return std::abs(a-b) < 0.001;
}
Note that the results obtained via comparing the rounded numbers and the function I suggested are not equivalent!
This is an old post, but what you are asking for is decimal precision with binary mathematics. The conversion between the two is giving you an apparent distinction.
The main point, I think, which you are making is to do with identity, so that you can use equality/inequality comparisons between two numbers.
Because of the fact that there is a discrepancy between what we humans use (decimal) and what computers use (binary), we have three choices.
We use a decimal library. This is computationally costly, because we are using maths which are different to how computers work. There are several, and one day they may be adopted into std. See eg "ISO/IEC JTC1 SC22 WG21 N2849"
We learn to do our maths in binary. This is mentally costly, because it's not how we do our maths normally.
We change our algorithm to include an identity test.
We change our algorithm to use a difference test.
With option 3, it is where we make a decision as to just how close one number needs to be to another number to be considered 'the same number'.
One simple way of doing this is (as given by #SirGuy above) where we use ceiling or floor as a test - this is good, because it allows us to choose the significant number of digits we are interested in. It is domain specific, and the solution that he gives might be a bit more optimal if using a power of 2 rather than of 10.
You definitely would only want to do the calculation when using equality/inequality tests.
So now, our equality test would be (for 10 binary places (nearly 3dp))
// Normal identity test for floats.
// Quick but fails eg 1.0000023 == 1.0000024
return (a == b);
Becomes (with 2^10 = 1024).
// Modified identity test for floats.
// Works with 1.0000023 == 1.0000024
return (std::floor(a * 1024) == std::floor(b * 1024));
But this isn't great
I would go for option 4.
Say you consider any difference less than 0.001 to be insignificant, such that 1.00012 = 1.00011.
This does an additional subtraction and a sign removal, which is far cheaper (and more reliable) than bit shifts.
// Modified equality test for floats.
// Returns true if the ∂ is less than 1/10000.
// Works with 1.0000023 == 1.0000024
return abs(a - b) < 0.0001;
This boils down to your comment about calculating circularity, I am suggesting that you calculate the delta (difference) between two circles, rather than testing for equivalence. But that isn't exactly what you asked in the question...

C++ calculating more precise than double or long double

I'm teaching myself C++ and on this practice question it asks to write code that can calculate PI to >30 digits. I learned that double / long double are both 16 digits precise on my computer.
I think the lesson of this question is to be able to calculate precision beyond what is available. Therefore how do I do this? Is it possible?
my code for calculating Pi right now is
#include "stdafx.h"
#include <iostream>
#include <math.h>
#include <iomanip>
using namespace std;
int main(){
double pi;
pi = 4*atan(1.0);
cout<<setprecision(30)<<pi;
return 0;
}
Output is to 16 digits and pi to 30 digits is listed below for comparison.
3.1415926535897931
3.141592653589793238462643383279
Any suggestions for increasing precision or is this something that won't matter ever? Alternatively if there is another lesson you think I should be learning here feel free to offer it. Thank you!
You will need to perform the calculation using some other method than floating point. There are libraries for doing "long math" such as GMP.
If that's not what you're looking for, you can also write code to do this yourself. The simplest way is to just use a string, and store a digit per character. Do the math just like you would do if you did it by hand on paper. Adding numbers together is relatively easy, so is subtracting. Doing multiplication and division is a little harder.
For non-integer numbers, you'll need to make sure you line up the decimal point for add/subtract...
It's a good learning experience to write that, but don't expect it to be something you knock up in half an hour without much thought [add and subtract, perhaps!]
You can use quad math, builtin type __float128 and q/Q suffixes in GCC/clang.
#include <stdio.h>
#include <quadmath.h>
int main ()
{
__float128 x = strtoflt128("1234567891234567891234567891234566", nullptr);
auto y = 1.0q;
printf("%.Qf", x + y); // there is quadmath_snprintf, but this also works fine
return 0;
}

Is floating-point == ever OK?

Just today I came across third-party software we're using and in their sample code there was something along these lines:
// Defined in somewhere.h
static const double BAR = 3.14;
// Code elsewhere.cpp
void foo(double d)
{
if (d == BAR)
...
}
I'm aware of the problem with floating-points and their representation, but it made me wonder if there are cases where float == float would be fine? I'm not asking for when it could work, but when it makes sense and works.
Also, what about a call like foo(BAR)? Will this always compare equal as they both use the same static const BAR?
Yes, you are guaranteed that whole numbers, including 0.0, compare with ==
Of course you have to be a little careful with how you got the whole number in the first place, assignment is safe but the result of any calculation is suspect
ps there are a set of real numbers that do have a perfect reproduction as a float (think of 1/2, 1/4 1/8 etc) but you probably don't know in advance that you have one of these.
Just to clarify. It is guaranteed by IEEE 754 that float representions of integers (whole numbers) within range, are exact.
float a=1.0;
float b=1.0;
a==b // true
But you have to be careful how you get the whole numbers
float a=1.0/3.0;
a*3.0 == 1.0 // not true !!
There are two ways to answer this question:
Are there cases where float == float gives the correct result?
Are there cases where float == float is acceptable coding?
The answer to (1) is: Yes, sometimes. But it's going to be fragile, which leads to the answer to (2): No. Don't do that. You're begging for bizarre bugs in the future.
As for a call of the form foo(BAR): In that particular case the comparison will return true, but when you are writing foo you don't know (and shouldn't depend on) how it is called. For example, calling foo(BAR) will be fine but foo(BAR * 2.0 / 2.0) (or even maybe foo(BAR * 1.0) depending on how much the compiler optimises things away) will break. You shouldn't be relying on the caller not performing any arithmetic!
Long story short, even though a == b will work in some cases you really shouldn't rely on it. Even if you can guarantee the calling semantics today maybe you won't be able to guarantee them next week so save yourself some pain and don't use ==.
To my mind, float == float is never* OK because it's pretty much unmaintainable.
*For small values of never.
The other answers explain quite well why using == for floating point numbers is dangerous. I just found one example that illustrates these dangers quite well, I believe.
On the x86 platform, you can get weird floating point results for some calculations, which are not due to rounding problems inherent to the calculations you perform. This simple C program will sometimes print "error":
#include <stdio.h>
void test(double x, double y)
{
const double y2 = x + 1.0;
if (y != y2)
printf("error\n");
}
void main()
{
const double x = .012;
const double y = x + 1.0;
test(x, y);
}
The program essentially just calculates
x = 0.012 + 1.0;
y = 0.012 + 1.0;
(only spread across two functions and with intermediate variables), but the comparison can still yield false!
The reason is that on the x86 platform, programs usually use the x87 FPU for floating point calculations. The x87 internally calculates with a higher precision than regular double, so double values need to be rounded when they are stored in memory. That means that a roundtrip x87 -> RAM -> x87 loses precision, and thus calculation results differ depending on whether intermediate results passed via RAM or whether they all stayed in FPU registers. This is of course a compiler decision, so the bug only manifests for certain compilers and optimization settings :-(.
For details see the GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Rather scary...
Additional note:
Bugs of this kind will generally be quite tricky to debug, because the different values become the same once they hit RAM.
So if for example you extend the above program to actually print out the bit patterns of y and y2 right after comparing them, you will get the exact same value. To print the value, it has to be loaded into RAM to be passed to some print function like printf, and that will make the difference disappear...
I'll provide more-or-less real example of legitimate, meaningful and useful testing for float equality.
#include <stdio.h>
#include <math.h>
/* let's try to numerically solve a simple equation F(x)=0 */
double F(double x) {
return 2 * cos(x) - pow(1.2, x);
}
/* a well-known, simple & slow but extremely smart method to do this */
double bisection(double range_start, double range_end) {
double a = range_start;
double d = range_end - range_start;
int counter = 0;
while (a != a + d) // <-- WHOA!!
{
d /= 2.0;
if (F(a) * F(a + d) > 0) /* test for same sign */
a = a + d;
++counter;
}
printf("%d iterations done\n", counter);
return a;
}
int main() {
/* we must be sure that the root can be found in [0.0, 2.0] */
printf("F(0.0)=%.17f, F(2.0)=%.17f\n", F(0.0), F(2.0));
double x = bisection(0.0, 2.0);
printf("the root is near %.17f, F(%.17f)=%.17f\n", x, x, F(x));
}
I'd rather not explain the bisection method used itself, but emphasize on the stopping condition. It has exactly the discussed form: (a == a+d) where both sides are floats: a is our current approximation of the equation's root, and d is our current precision. Given the precondition of the algorithm — that there must be a root between range_start and range_end — we guarantee on every iteration that the root stays between a and a+d while d is halved every step, shrinking the bounds.
And then, after a number of iterations, d becomes so small that during addition with a it gets rounded to zero! That is, a+d turns out to be closer to a then to any other float; and so the FPU rounds it to the closest representable value: to a itself. Calculation on a hypothetical machine can illustrate; let it have 4-digit decimal mantissa and some large exponent range. Then what result should the machine give to 2.131e+02 + 7.000e-3? The exact answer is 213.107, but our machine can't represent such number; it has to round it. And 213.107 is much closer to 213.1 than to 213.2 — so the rounded result becomes 2.131e+02 — the little summand vanished, rounded up to zero. Exactly the same is guaranteed to happen at some iteration of our algorithm — and at that point we can't continue anymore. We have found the root to maximum possible precision.
Addendum
No you can't just use "some small number" in the stopping condition. For any choice of the number, some inputs will deem your choice too large, causing loss of precision, and there will be inputs which will deem your choiсe too small, causing excess iterations or even entering infinite loop. Imagine that our F can change — and suddenly the solutions can be both huge 1.0042e+50 and tiny 1.0098e-70. Detailed discussion follows.
Calculus has no notion of a "small number": for any real number, you can find infinitely many even smaller ones. The problem is, among those "even smaller" ones might be a root of our equation. Even worse, some equations will have distinct roots (e.g. 2.51e-8 and 1.38e-8) — both of which will get approximated by the same answer if our stopping condition looks like d < 1e-6. Whichever "small number" you choose, many roots which would've been found correctly to the maximum precision with a == a+d — will get spoiled by the "epsilon" being too large.
It's true however that floats' exponent has finite limited range, so one actually can find the smallest nonzero positive FP number; in IEEE 754 single precision, it's the 1e-45 denorm. But it's useless! while (d >= 1e-45) {…} will loop forever with single-precision (positive nonzero) d.
At the same time, any choice of the "small number" in d < eps stopping condition will be too small for many equations. Where the root has high enough exponent, the result of subtraction of two neighboring mantissas will easily exceed our "epsilon". For example, 7.00023e+8 - 7.00022e+8 = 0.00001e+8 = 1.00000e+3 = 1000 — meaning that the smallest possible difference between numbers with exponent +8 and 6-digit mantissa is... 1000! It will never fit into, say, 1e-4. For numbers with relatively high exponent we simply have not enough precision to ever see a difference of 1e-4. This means eps = 1e-4 will be too small!
My implementation above took this last problem into account; you can see that d is halved each step — instead of getting recalculated as difference of (possibly huge in exponent) a and b. For reals, it doesn't matter; for floats it does! The algorithm will get into infinite loops with (b-a) < eps on equations with huge enough roots. The previous paragraph shows why. d < eps won't get stuck, but even then — needless iterations will be performed during shrinking d way down below the precision of a — still showing the choice of eps as too small. But a == a+d will stop exactly at precision.
Thus as shown: any choice of eps in while (d < eps) {…} will be both too large and too small, if we allow F to vary.
... This kind of reasoning may seem overly theoretical and needlessly deep, but it's to illustrate again the trickiness of floats. One should be aware of their finite precision when writing arithmetic operators around.
Perfect for integral values even in floating point formats
But the short answer is: "No, don't use ==."
Ironically, the floating point format works "perfectly", i.e., with exact precision, when operating on integral values within the range of the format. This means that you if you stick with double values, you get perfectly good integers with a little more than 50 bits, giving you about +- 4,500,000,000,000,000, or 4.5 quadrillion.
In fact, this is how JavaScript works internally, and it's why JavaScript can do things like + and - on really big numbers, but can only << and >> on 32-bit ones.
Strictly speaking, you can exactly compare sums and products of numbers with precise representations. Those would be all the integers, plus fractions composed of 1 / 2n terms. So, a loop incrementing by n + 0.25, n + 0.50, or n + 0.75 would be fine, but not any of the other 96 decimal fractions with 2 digits.
So the answer is: while exact equality can in theory make sense in narrow cases, it is best avoided.
The only case where I ever use == (or !=) for floats is in the following:
if (x != x)
{
// Here x is guaranteed to be Not a Number
}
and I must admit I am guilty of using Not A Number as a magic floating point constant (using numeric_limits<double>::quiet_NaN() in C++).
There is no point in comparing floating point numbers for strict equality. Floating point numbers have been designed with predictable relative accuracy limits. You are responsible for knowing what precision to expect from them and your algorithms.
It's probably ok if you're never going to calculate the value before you compare it. If you are testing if a floating point number is exactly pi, or -1, or 1 and you know that's the limited values being passed in...
I also used it a few times when rewriting few algorithms to multithreaded versions. I used a test that compared results for single- and multithreaded version to be sure, that both of them give exactly the same result.
Let's say you have a function that scales an array of floats by a constant factor:
void scale(float factor, float *vector, int extent) {
int i;
for (i = 0; i < extent; ++i) {
vector[i] *= factor;
}
}
I'll assume that your floating point implementation can represent 1.0 and 0.0 exactly, and that 0.0 is represented by all 0 bits.
If factor is exactly 1.0 then this function is a no-op, and you can return without doing any work. If factor is exactly 0.0 then this can be implemented with a call to memset, which will likely be faster than performing the floating point multiplications individually.
The reference implementation of BLAS functions at netlib uses such techniques extensively.
In my opinion, comparing for equality (or some equivalence) is a requirement in most situations: standard C++ containers or algorithms with an implied equality comparison functor, like std::unordered_set for example, requires that this comparator be an equivalence relation (see C++ named requirements: UnorderedAssociativeContainer).
Unfortunately, comparing with an epsilon as in abs(a - b) < epsilon does not yield an equivalence relation since it loses transitivity. This is most probably undefined behavior, specifically two 'almost equal' floating point numbers could yield different hashes; this can put the unordered_set in an invalid state.
Personally, I would use == for floating points most of the time, unless any kind of FPU computation would be involved on any operands. With containers and container algorithms, where only read/writes are involved, == (or any equivalence relation) is the safest.
abs(a - b) < epsilon is more or less a convergence criteria similar to a limit. I find this relation useful if I need to verify that a mathematical identity holds between two computations (for example PV = nRT, or distance = time * speed).
In short, use == if and only if no floating point computation occur;
never use abs(a-b) < e as an equality predicate;
Yes. 1/x will be valid unless x==0. You don't need an imprecise test here. 1/0.00000001 is perfectly fine. I can't think of any other case - you can't even check tan(x) for x==PI/2
The other posts show where it is appropriate. I think using bit-exact compares to avoid needless calculation is also okay..
Example:
float someFunction (float argument)
{
// I really want bit-exact comparison here!
if (argument != lastargument)
{
lastargument = argument;
cachedValue = very_expensive_calculation (argument);
}
return cachedValue;
}
I would say that comparing floats for equality would be OK if a false-negative answer is acceptable.
Assume for example, that you have a program that prints out floating points values to the screen and that if the floating point value happens to be exactly equal to M_PI, then you would like it to print out "pi" instead. If the value happens to deviate a tiny bit from the exact double representation of M_PI, it will print out a double value instead, which is equally valid, but a little less readable to the user.
I have a drawing program that fundamentally uses a floating point for its coordinate system since the user is allowed to work at any granularity/zoom. The thing they are drawing contains lines that can be bent at points created by them. When they drag one point on top of another they're merged.
In order to do "proper" floating point comparison I'd have to come up with some range within which to consider the points the same. Since the user can zoom in to infinity and work within that range and since I couldn't get anyone to commit to some sort of range, we just use '==' to see if the points are the same. Occasionally there'll be an issue where points that are supposed to be exactly the same are off by .000000000001 or something (especially around 0,0) but usually it works just fine. It's supposed to be hard to merge points without the snap turned on anyway...or at least that's how the original version worked.
It throws of the testing group occasionally but that's their problem :p
So anyway, there's an example of a possibly reasonable time to use '=='. The thing to note is that the decision is less about technical accuracy than about client wishes (or lack thereof) and convenience. It's not something that needs to be all that accurate anyway. So what if two points won't merge when you expect them to? It's not the end of the world and won't effect 'calculations'.

Unexpected loss of precision when dividing doubles

I have a function getSlope which takes as parameters 4 doubles and returns another double calculated using this given parameters in the following way:
double QSweep::getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
The problem is that when calling this function with arguments for example:
getSlope(2.71156, -1.64161, 2.70413, -1.72219);
the returned result is:
10.8557
and this is not a good result for my computations.
I have calculated the slope using Mathematica and the result for the slope for the same parameters is:
10.8452
or with more digits for precision:
10.845222072678331.
The result returned by my program is not good in my further computations.
Moreover, I do not understant how does the program returns 10.8557 starting from 10.845222072678331 (supposing that this is the approximate result for the division)?
How can I get the good result for my division?
thank you in advance,
madalina
I print the result using the command line:
std::cout<<slope<<endl;
It may be that my parameters are maybe not good, as I read them from another program (which computes a graph; after I read this parameters fromt his graph I have just displayed them to see their value but maybe the displayed vectors have not the same internal precision for the calculated value..I do not know it is really strange. Some numerical errors appears..)
When the graph from which I am reading my parameters is computed, some numerical libraries written in C++ (with templates) are used. No OpenGL is used for this computation.
thank you,
madalina
I've tried with float instead of double and I get 10.845110 as a result. It still looks better than madalina result.
EDIT:
I think I know why you get this results. If you get a, b, c and d parameters from somewhere else and you print it, it gives you rounded values. Then if you put it to Mathemtacia (or calc ;) ) it will give you different result.
I tried changing a little bit one of your parameters. When I did:
double c = 2.7041304;
I get 10.845806. I only add 0.0000004 to c!
So I think your "errors" aren't errors. Print a, b, c and d with better precision and then put them to Mathematica.
The following code:
#include <iostream>
using namespace std;
double getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
int main( ) {
double s = getSlope(2.71156, -1.64161, 2.70413, -1.72219);
cout << s << endl;
}
gives a result of 10.8452 with g++. How are you printing out the result in your code?
Could it be that you use DirectX or OpenGL in your project? If so they can turn off double precision and you will get strange results.
You can check your precision settings with
std::sqrt(x) * std::sqrt(x)
The result has to be pretty close to x.
I met this problem long time ago and spend a month checking all the formulas. But then I've found
D3DCREATE_FPU_PRESERVE
The problem here is that (c-a) is small, so the rounding errors inherent in floating point operations is magnified in this example. A general solution is to rework your equation so that you're not dividing by a small number, I'm not sure how you would do it here though.
EDIT:
Neil is right in his comment to this question, I computed the answer in VB using Doubles and got the same answer as mathematica.
The results you are getting are consistent with 32bit arithmetic. Without knowing more about your environment, it's not possible to advise what to do.
Assuming the code shown is what's running, ie you're not converting anything to strings or floats, then there isn't a fix within C++. It's outside of the code you've shown, and depends on the environment.
As Patrick McDonald and Treb brought both up the accuracy of your inputs and the error on a-c, I thought I'd take a look at that. One technique to look at rounding errors is interval arithmetic, which makes the upper and lower bounds which value represents explicit (they are implicit in floating point numbers, and are fixed to the precision of the representation). By treating each value as an upper and lower bound, and by extending the bounds by the error in the representation ( approx x * 2 ^ -53 for a double value x ), you get a result which gives the lower and upper bounds on the accuracy of a value, taking into account worst case precision errors.
For example, if you have a value in the range [1.0, 2.0] and subtract from it a value in the range [0.0, 1.0], then the result must lie in the range [below(0.0),above(2.0)] as the minimum result is 1.0-1.0 and the maximum is 2.0-0.0. below and above are equivalent to floor and ceiling, but for the next representable value rather than for integers.
Using intervals which represent worst-case double rounding:
getSlope(
a = [2.7115599999999995262:2.7115600000000004144],
b = [-1.6416099999999997916:-1.6416100000000002357],
c = [2.7041299999999997006:2.7041300000000005888],
d = [-1.7221899999999998876:-1.7221900000000003317])
(d-b) = [-0.080580000000000526206:-0.080579999999999665783]
(c-a) = [-0.0074300000000007129439:-0.0074299999999989383218]
to double precision [10.845222072677243474:10.845222072679954195]
So although c-a is small compared to c or a, it is still large compared to double rounding, so if you were using the worst imaginable double precision rounding, then you could trust that value's to be precise to 12 figures - 10.8452220727. You've lost a few figures off double precision, but you're still working to more than your input's significance.
But if the inputs were only accurate to the number significant figures, then rather than being the double value 2.71156 +/- eps, then the input range would be [2.711555,2.711565], so you get the result:
getSlope(
a = [2.711555:2.711565],
b = [-1.641615:-1.641605],
c = [2.704125:2.704135],
d = [-1.722195:-1.722185])
(d-b) = [-0.08059:-0.08057]
(c-a) = [-0.00744:-0.00742]
to specified accuracy [10.82930108:10.86118598]
which is a much wider range.
But you would have to go out of your way to track the accuracy in the calculations, and the rounding errors inherent in floating point are not significant in this example - it's precise to 12 figures with the worst case double precision rounding.
On the other hand, if your inputs are only known to 6 figures, it doesn't actually matter whether you get 10.8557 or 10.8452. Both are within [10.82930108:10.86118598].
Better Print out the arguments, too. When you are, as I guess, transferring parameters in decimal notation, you will lose precision for each and every one of them. The problem being that 1/5 is an infinite series in binary, so e.g. 0.2 becomes .001001001.... Also, decimals are chopped when converting an binary float to a textual representation in decimal.
Next to that, sometimes the compiler chooses speed over precision. This should be a documented compiler switch.
Patrick seems to be right about (c-a) being the main cause:
d-b = -1,72219 - (-1,64161) = -0,08058
c-a = 2,70413 - 2,71156 = -0,00743
S = (d-b)/(c-a)= -0,08058 / -0,00743 = 10,845222
You start out with six digits precision, through the subtraction you get a reduction to 3 and four digits. My best guess is that you loose additonal precision because the number -0,00743 can not be represented exaclty in a double. Try using intermediate variables with a bigger precision, like this:
double QSweep::getSlope(double a, double b, double c, double d)
{
double slope;
long double temp1, temp2;
temp1 = (d-b);
temp2 = (c-a);
slope = temp1/temp2;
return slope;
}
While the academic discussion going on is great for learning about the limitations of programming languages, you may find the simplest solution to the problem is an data structure for arbitrary precision arithmetic.
This will have some overhead, but you should be able to find something with fairly guaranteeable accuracy.