Unsigned long long arithmetic into double - c++

I'm making a function that takes in 3 unsigned long longs, and applies the law of cosines to find out if the triangle is obtuse, acute or a right triangle. Should I just cast the variables to doubles before I use them?
void triar( unsigned long long& r,
unsigned long long x,
unsigned long long y,
unsigned long long z )
{
if(x==0 || y==0 || z==0) die("invalid triangle sides");
double t=(x*x + y*y -z*z)/(2*x*y);
t=acos (t) * (180.0 / 3.14159265);
if(t > 90) {
cout<<"Obtuse Triangle"<<endl;
r=t;
} else if(t < 90){
cout<<"Acute Triangle"<<endl;
r=t;
} else if(t == 90){
cout<<"Right Traingle"<<endl;
r=t;
}
}

There is generally no reason why you could not cast if you need floating point arithmetics. However, there is also an implicit conversion from unsigned long to double, so you can also often do completely without casting.
In many cases, including yours, you can cast only one of the arguments to force double arithmetics on a particular operation only. For example,
double t = (double)(x*x + y*y - z*z) / (2*x*y)
This way, all operations except for the division are computed in integer arithmetics and are therefore slighly faster. The cast is still necessary to avoid truncation during division.
Your code contains a comparison of floating point arguments. Floating point arithmetics however almost inevitably reduces accuracy. Avoid limited accuracy, or analyze and control accuracy.
Prefer an integer only solution as described in an excellent sister answer if you have a wide enough integral type at your disposal
Always avoid conversion from radians to degrees except for presentation to humans
Take the value of π from your mathematical library header files (unfortunately, this is platform dependent - try _USE_MATH_DEFINES + M_PI or, if already using boost libraries, boost::math::constants::pi<double>()), or express it analytically. For example, std::atan(1)*2 is the right angle.
If you choose double precision, and the ultimate difference value is less than, say, std::numeric_limits<double>::min() * 8, you can probably not tell anything about the triangle and the classification you return is basically bogus. (I made up the value of 8, you will possibly lose way more bits than three.)

You have a problem with obtuse triangles, x*x + y*y - z*z would mathematically give a negative result, that is then reduced modulo 2^WIDTH (where WIDTH is the number of value bits in unsigned long long, at least 64 and probably exactly that) yielding a - probably large - positive value (or in rare cases 0). Then the computed result of t = (x*x + y*y - z*z)/(2*x*y) can be larger than 1, and acos(t) would return a NaN.
The correct way to find out whether the triangle is obtuse/acute/right-angled with the given argument type is to check whether x*x + y*y < /* > / == */ z*z - if you can be sure the mathematical results don't exceed the unsigned long long range.
If you can't be sure of that, you can either convert the variables to double before the computation,
double xd = x, yd = y, zd = z;
double t = (xd*xd + yd*yd - zd*zd)/(2*xd*yd);
with possible loss of precision and incorrect results for nearly right-angled triangles (e.g. for the slightly obtuse triangle x = 2^29, y = 2^56-1, z = 2^56+2, both y and z would be converted to 2^56 with standard 64-bit doubles, xd*xd + yd*yd = 2^58 + 2^112 would be evaluated to 2^112, subtracting zd*zd then results in 0).
Or you can compare x*x + y*y to z*z - or x*x to z*z - y*y - using only integer arithmetic. If x*x is representable as an unsigned long long (I assume that 0 < x <= y <= z), it's relatively easy, first check whether (z - y)*(z + y) would exceed ULLONG_MAX, if yes, the triangle is obtuse, otherwise calculate and compare. If x*x is not representable, it becomes complicated, I think the easiest way (except for using a big integer library, of course) would be to compute the high and if necessary low 64 (or whatever width unsigned long long has) bits separately by splitting the numbers at half the width and compare those.
Further note: Your value for π, 3.14159265 is too inaccurate, right-angled triangles will be reported as obtuse.

Related

Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
{
return (float) 1 / std::log(x);
}
float my_simpson(float a, float b, long int n)
{
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
{
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
}
return area*h/3;
}
int main()
{
std::cout.precision(20);
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
}
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

How to increase accuracy of floating point second derivative calculation?

I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
}
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
}
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
}
Main function:
int main() {
pc.baud(9600);
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
}
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?
The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.
Go analytical. ;-) probably not an option given "with the current
method".
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.

Is there a way to optimize this function?

For an application I'm working on, I need to take two integers and add them together using a particular mathematical formula. This ends up looking like this:
int16_t add_special(int16_t a, int16_t b) {
float limit = std::numeric_limits<int16_t>::max();//32767 as a floating point value
float a_fl = a, b_fl = b;
float numerator = a_fl + b_fl;
float denominator = 1 + a_fl * b_fl / std::pow(limit, 2);
float final_value = numerator / denominator;
return static_cast<int16_t>(std::round(final_value));
}
Any readers with a passing familiarity with physics will recognize that this formula is the same as what is used to calculate the sum of near-speed-of-light velocities, and the calculation here intentionally mirrors that computation.
The code as-written gives the results I need: for low numbers, they nearly add together normally, but for high numbers, they converge to the maximum value of 32767, i.e.
add_special(10, 15) == 25
add_special(100, 200) == 300
add_special(1000, 3000) == 3989
add_special(10000, 25000) == 28390
add_special(30000, 30000) == 32640
Which all appears to be correct.
The problem, however, is that the function as-written involves first transforming the numbers into floating point values before transforming them back into integers. This seems like a needless detour for numbers that I know, as a principle of its domain, will never not be integers.
Is there a faster, more optimized way to perform this computation? Or is this the most optimized version of this function I can create?
I'm building for x86-64, using MSVC 14.X, although methods that also work for GCC would be beneficial. Also, I'm not interested in SSE/SIMD optimizations at this stage; I'm mostly just looking at the elementary operations being performed on the data.
You might avoid floating number and does all computation in integral type:
constexpr int16_t add_special(int16_t a, int16_t b) {
std::int64_t limit = std::numeric_limits<int16_t>::max();
std::int64_t a_fl = a;
std::int64_t b_fl = b;
return static_cast<int16_t>(((limit * limit) * (a_fl + b_fl)
+ ((limit * limit + a_fl * b_fl) / 2)) /* Handle round */
/ (limit * limit + a_fl * b_fl));
}
Demo
but according to Benchmark, it is not faster for those values.
As noted by Johannes Overmann, a big performance boost is gained by avoiding std::round, at the cost of some (little) discrepancies in the results, though.
I tried some other little changes HERE, where it seems that the following is a faster approach (at least for that architecture)
constexpr int32_t i_max = std::numeric_limits<int16_t>::max();
constexpr int64_t i_max_2 = static_cast<int64_t>(i_max) * i_max;
int16_t my_add_special(int16_t a, int16_t b)
{
// integer multipication instead of floating point division
double numerator = (a + b) * i_max_2;
double denominator = i_max_2 + a * b;
// Approximated rounding instead of std::round
return 0.5 + numerator / denominator;
}
Suggestions:
Use 32767.0*32767.0 (which is a constant) instead of std::pow(limit, 2).
Use integer values as much as possible, potentially with fixed points. Just the two divisions are a problem. Use floats just form them, if necessary (depends on the input data ranges).
Make it inline if the function is small and if it is appropriate.
Something like:
int16_t add_special(int16_t a, int16_t b) {
float numerator = int32_t(a) + int32_t(b); // Cannot overflow.
float denominator = 1 + (int32_t(a) * int32_t(b)) / (32767.0 * 32767.0); // Cannot overflow either.
return (numerator / denominator) + 0.5; // Relying on implementation defined rounding. Not good but potentially faster than std::round().
}
The only risk with the above is the omission of the explicit rounding, so you will get some implicit rounding.

How can I check whether a double has a fractional part?

Basically I have two variables:
double halfWidth = Width / 2;
double halfHeight = Height / 2;
As they are being divided by 2, they will either be a whole number or a decimal. How can I check whether they are a whole number or a .5?
You can use modf, this should be sufficient:
double intpart;
if( modf( halfWidth, &intpart) == 0 )
{
// your code here
}
First, you need to make sure that you're using double-precision floating-point math:
double halfWidth = Width / 2.0;
double halfHeight = Height / 2.0;
Because one of the operands is a double (namely, 2.0), this will force the compiler to convert Width and Height to doubles before doing the math (assuming they're not already doubles). Once converted, the division will be done in double-precision floating-point. So it will have a decimal, where appropriate.
The next step is to simply check it with modf.
double temp;
if(modf(halfWidth, &temp) != 0)
{
//Has fractional part.
}
else
{
//No fractional part.
}
You may discard a fractional part and compare the result with the original value using floor():
if (floor(halfWidth) == halfWidth) {
// halfWidth is a whole number
} else {
// halfWidth has a non-zero fractional part
}
As rightly pointed out by #Dávid Laczkó, it's a better solution than modf() because there's no need for an additional variable.
And according to my benchmarks (Linux, gcc 8.3.0, optimizations -O0...-O3), the floor() call consumes less CPU time than modf() on the modern notebook and server processors. The difference even growing with compiler optimizations enabled. Probably it's because the modf() has two arguments when the floor() has only one argument.

What is the optimum epsilon/dx value to use within the finite difference method?

double MyClass::dx = ?????;
double MyClass::f(double x)
{
return 3.0*x*x*x - 2.0*x*x + x - 5.0;
}
double MyClass::fp(double x) // derivative of f(x), that is f'(x)
{
return (f(x + dx) - f(x)) / dx;
}
When using finite difference method for derivation, it is critical to choose an optimum dx value. Mathematically, dx must be as small as possible. However, I'm not sure if it is a correct choice to choose it the smallest positive double precision number (i.e.; 2.2250738585072014 x 10−308).
Is there an optimal numeric interval or exact value to choose a dx in to make the calculation error as small as possible?
(I'm using 64-bit compiler. I will run my program on a Intel i5 processor.)
Choosing the smallest possible value is almost certainly wrong: if dx were that smallest number, then f(x + dx) would be exactly equal to f(x) due to rounding.
So you have a tradeoff: Choose dx too small, and you lose precision to rounding errors. Choose it too large, and your result will be imprecise due to changes in the derivative as x changes.
To judge the numeric errors, consider (f(x + dx) - f(x))/f(x)1 mathematically. The numerator denotes the difference you want to compute, but the denominator denotes the magnitude of numbers you're dealing with. If that fraction is about 2‒k, then you can expect approximately k bits of precision in your result.
If you know your function, you can compute what error you'd get from choosing dx too large. You can then balence things, so that the error incurred from this is about the same as the error incurred from rounding. But if you know the function, you might be better off by providing a function that directly computes the derivative, like in your example with the polygonal f.
The Wikipedia section that pogorskiy pointed out suggests a value of sqrt(ε)x, or approximately 1.5e-8 * x. Without any more detailed knowledge about the function, such a rule of thumb will provide a reasonable default. Also note that that same section suggests not dividing by dx, but instead by (x + dx) - x, as this takes rounding errors incurred by computing x + dx into account. But I guess that whole article is full of suggestions you might use.
1 This formula really should divide by f(x), not by dx, even though a past editor thought differently. I'm attempting to compare the amount of significant bits remaining after the division, not the slope of the tangent.
Why not just use the Power Rule to derive the derivative, you'll get an exact answer:
f(x) = 3x^3 - 2x^2 + x - 5
f'(x) = 9x^2 - 4x + 1
Therefore:
f(x) = 3.0 * x * x * x - 2.0 * x * x + x - 5.0
fp(x) = 9.0 * x * x - 4.0 * x + 1.0