Fortran round-off errors - fortran

I have simple code, which flags nodes with in region enclosed by cylinder. On implementing the code, the result is mild tilt of the cylinder observed case with 90 degrees
The actual issue:
The above algorithm is implemented in Fortran. The code checks for points in Cartesian grid if inside the cylinder. Following being the test case:
The cylinder makes an angle 90 degrees in the yz-plane with respect to y-axis. Therefore, the orientation vector $\vec{o}$ is (0, 1, 0).
Case 1:
Orientation vector is assigned directly with $\vec{o}=(0.0,1.0,0.0)$. This results in perfect cylinder with $\theta=90.$
Case 2:
Orientation vector is specified with intrinsic Fortran functions with double precision accuracy dsin and dcos with $\vec{o}=(0.0, \sin(\pi/2.0), \cos(\pi/2.0))$ with $\pi$ value assigned with more than 20 significant decimal points. The resulting cylinder results in mild tilt.
The highlighted region indicates the extra material due to tilt of cylinder with respect to Cartesian axes. I also tried architecture specific maximum precision "pi" value. This also results in same problem.
This shows like the actual angle made by cylinder is not 90 degrees. Can anyone suggest valid solution for this problem. I need to use the inbuilt trigonometric functions for arbitrary angles and looking for accurate cell flagging method.
Note: All operations are performed with double precision accuracy.
The actual function is below. rk is defined parameter with value 8
pure logical function in_particle(p,px,x)
type(md_particle_type),intent(in) :: p
real(kind=rk),intent(in) :: px(3),x(3)
real(kind=rk) :: r(3),rho(3),rop(2),ro2,rdiff,u
rop = particle_radii(p) ! (/R_orth,R_para/)
ro2 = rop(1)**2
rdiff = rop(2) - rop(1)
r = x-px
! Case 1:
! u = dot_product((/0.0_rk,-1.0_rk,0.0_rk/),r)
! rho = r-u*(/0.0_rk,-1.0_rk,0.0_rk/)
! Case 2:
u = dot_product((/0.0_rk,-dsin(pi/2.0_rk),dcos(pi/2.0_rk)/),r)
rho = r-u*(/0.0_rk,-dsin(pi/2.0_rk),dcos(pi/2.0_rk)/)
if((u.le.rdiff).and.(u.ge.-rdiff)) then
in_particle = dot_product(rho,rho) < ro2
else
in_particle = .false.
end if
end function in_particle
Note: The trigonometric operations are done inside the code to explain the problem better. However the original code reads the orientation in vector form from user. Then converts this information to quaternions for particle-particle collision operations. On converting quaternions back to orientation vector, this error is even more amplified. Even before the start of collision, the orientation of cylinder tends to be disoriented by 2 lattice cells.

cos(pi/2) is not necessarily going to give you exactly 0, no matter how exact you make the cos calculation, and no matter how many digits of pi you have, because:
pi, as an irrational number, will contain up to 1/2 ulp of error when represented as an FP number; and
sin and cos are not guaranteed by the IEEE-754 standard to be correctly rounded (or even implemented).
Now, sin(pi/2) is extremely likely to come out as 1 regardless of precision and FP architecture, simply because sin has such a low derivative around 1; with single-precision floats, it should come out to 1 if you're anywhere within about 3e-4 of the exact value of pi/2. The problematic call is the cos, which has lots of precision to play with around 0 and a derivative of about -1 in the neighborhood.
Still, we're talking about extremely small values here. I think what's really potentiating the problem here is the in/out test you're doing, combined with ordinary FP rounding rules. I would guess, in fact, that if you were to bias your test points by, say, a quarter of the grid quantum, you'd see all straight verticals in your voxelization (though it might not be symmetrical around the minor axes).
Another option would be to actually discard some precision from your sin/cos calculation before doing the dot product, effectively quantizing your axes.

Short answer: Create a table of sin and cos of common angles (0, pi/6, pi/4, pi/3, pi/2, pi and their multiples) and compute only for uncommon angles. The reason being that errors with uncommon angles will be tolerated by most people while errors with common angles will likely not be tolerated.
Explanation:
Because floating point computation is not exact (that is its nature), you sometime need a little bit of compromise between the accuracy and the readability of the code.
One way of doing that is to avoid to compute something that is known exactly. To do that, you can check the value of the angle and do the actual computation only if it is not an obvious angle. For example angle 0, 90, 180 and 270 degrees have obvious values of sin and cos. More generally, the cos and sin of common angles (0, pi/6, pi/4, pi/3, pi/2, pi and their multiples) are known exactly (even if they are irrational numbers).

Related

How can I effectively calculate the phase angle of a complex number that is (essentially) equal to zero?

I'm writing a C++ program that takes the FFT of a real input signal containing double values and returns a vector X containing std::complex<double> values. Once I have the vector of results I then attempt to calculate the magnitude and phase of the result.
I am running into an issue with calculating the phase angle when one of the outputs is "zero". Zero is in quotes because when a calculation that results in 0 returns a double, the returned value will be very near zero, but not quite exactly zero.
For example, at index 3 my output array has the calculated "zero" value:
X[3] = 3.0531133177191805e-16 - i*5.5511151231257827e-17
I am trying to use the standard library std::arg function that is supposed to return the phase angle of a complex number. std::arg(X[3])
While X[3] is essentially 0, it is not EXACTLY 0 and the way phase is calculated this causes a problem because the calculation uses the ratio of the imaginary part divided by the ratio of the real part which is far from 0!
Doing the actual calculation results in a far from desirable result.
How can I make C++ realize that the result is really 0 so I can get the correct phase angle?
I'm looking for a more elegant solution than using an arbitrary hard-coded "epsilon" value to compare the double to, but so far searching online I haven't had any luck coming up with something better.
If you are computing the floating-point FFT of an input signal, then that signal will include noise, thus have a signal-to-noise ratio, including sensor noise, thermal noise, quantization noise, timing jitter noise, etc.
Thus the threshold for discarding FFT results as below your noise floor most likely isn't a matter of computational mathematics, but part of your physical or electronic data acquisition analysis. You will have to plug that number in, and set the phase to 0.0 or NaN or whatever your default flagging value is for a non-useful (at or below the noise floor) FFT result.
It was brought to my attention that my original answer will not work when the input to the FFT has been scaled. I believe I have an actual valid solution now... The original answer is kept below so that the comments still make sense.
From the comments on this answer and others, I've gathered that calculating the exact rounding error in the language may technically be possible but it is definitely not practical. The best practical solution seems to be to allow the user to provide their own noise threshold (in dB) and ignore any data points whose power level falls below that threshold. It would be impossible to come up with a generic threshold for all situations, but the user can provide a reasonable threshold based on the signal-to-noise ratio of the signal being analyzed and pass that in.
A generic phase calculation function is shown below that calculates the phase angles for a vector of complex data points.
std::vector<double> Phase(std::vector<std::complex<double>> X, double threshold, double amplitude)
{
size_t N = X.size();
std::vector<double> X_phase(N);
std::transform(X.begin(), X.end(), X_phase.begin(), [threshold, amplitude](const std::complex<double>& value) {
double level = 10.0 * std::log10(std::norm(value) / std::pow(amplitude, 2.0));
return level > threshold ? std::arg(value) : 0.0;
});
return X_phase;
}
This function takes 3 arguments:
The vector of complex signal data you want to calculate the phase of.
A sensible threshold -- Can be calculated from the signal-to-noise ratio of whatever measurement device was used to capture the signal. If your signal contains no noise other than the rounding errors of the language itself you can set this to some arbitrary really low value, like -120dB.
The maximum possible amplitude of your input signal. If your signal is calculated, this should simply be set to the amplitude of your signal. If your signal is measured, this should be set to the maximum amplitude the measuring device is capable of measuring (If your signal comes from reading an audio file, often its data will be normalized between -1.0 and 1.0. In this case you would just set the amplitude value to 1.0).
This new implementation still provides me with the correct results, but is much more robust. By leaving the threshold calculation to the user they can set the most sensible value themselves based on the characteristics of the measurement device used to measure their input signal.
Please let me know if you notice any errors or any ways I can further improve the design!
Original Answer
I found a solution that seems generic enough.
In the #include <limits> header, there is a constant value for std::numeric_limits<double>::digits10.
According the the documentation:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many significant decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow.
Using this I can filter out any output values that have a magnitude lower than this limit:
Calculate the phase of X[3]:
int N = X.size();
auto tmp = std::abs(X[3])/N > std::pow(10, -std::numeric_limits<double>::digits10)
? value
: 0.0
double phase = std::arg(tmp);
This effectively filters out any values that are not precisely zero due to rounding errors within the C++ language itself. It will NOT however filter out garbage data caused by noise in the input signal.
After adding this to my phase calculation I get the expected results.
The map from complex numbers to magnitude and phase is discontinuous at 0.
This is a discontinuity caused by the choice of coordinates you are using.
The solution will depend on why you chose those coordinates in a situation where values near the discontinuity are possible.
It isn't "really" zero. If you factored in error bars properly, your answer would really be a small magnitude (hopefully) and a unconstrained angle.

Creating a lookup table for atan

I have an application that requires very low precision (within 2 degrees) and very high speed to determine the angle of a line given rise/run. Specifically, the precision is really only needed closer to the x axis (below 45 or above 135 degrees), which I think is easier to accomplish because as the angle nears 90 it approaches an undefined value. Currently, I use atan2 from the math.h library, but I would like something faster.
I have seen this example and think a lookup table for atan would suffice, however its much more tricky to make one for an arctan than tan, as I have to think in terms of the slope and how it can be represented as an integer so it can be used as an indices of the table.
Has anyone done this before? I'm thinking I need to have some sort of scale factor, so when I take rise/run and get my slope as a decimal I may have to multiply it by a constant value, otherwise everything below 45 will be 0 degrees. However, in this case I sacrifice a lot of accuracy above 45 degrees. Really, I do not need to distinguish between anything between 75-105 degrees. But in the 30/160 degree range it would be good to be able to have accuracy at 2-3 degrees.

Restoring the exact angle from std::cos(angle) using std::acos

Is it guaranteed by the C++ standard that angle == std::acos(std::cos(angle)) if angle is in the range [0, Pi], or in other words is it possible to restore the exact original value of angle from the result of std::cos using std::acos given the mentioned range limit?
The marginal cases when angle is infinity or NaN are omitted.
Answer by StoryTeller:
The standard cannot make that guarantee, simply because the result of std::cos may not be representable exactly by a double, so you get a truncation error, which will affect the result of std::acos.
From cppreference.com:
” If no errors occur, [acos returns] the arc cosine of arg (arccos(arg)) in the range [0 ; π]
In degrees, that's 0 to 180, inclusive, corresponding to cosine values 1 down through -1, inclusive.
Outside that range you can't even get an approximate correspondence. Computing the cosine discards information about which angle you had outside of that range. There's no way to get that information back.
How information is discarded:
First, in degrees, cos(x) = cos(K*360 + x), for arbitrary integer K. Secondly, cos(x) = cos(-x). This adds up to an awful lot of angle values that produce the same cosine value.
Also, even though all readers likely know this, but for completeness: since sines are cosines are very irrational numbers, generally not simple fractions, you can't expect exact results except for maybe cosine 1, which corresponds to 0 degrees.
According to the standard:
This International Standard imposes no requirements on the accuracy
of floating-point operations; see also 18.3.2. — end note ]
http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/n4606.pdf
Even mathematically this is impossible. For example, cos(2*PI) is 0, but so is cos(4*PI).

How can you transform a set of numbers into mostly whole ones?

Small amount of background: I am working on a converter that bridges between a map maker (Tiled) that outputs in XML, and an engine (Angel2D) that inputs lua tables. Most of this is straight forward
However, Tiled outputs in pixel offsets (integers of absolute values), while Angel2D inputs OpenGL units (floats of relative values); a conversion factor between these two is needed (for example, 32px = 1gu). Since OpenGL units are abstract, and the camera can zoom in or out if the objects are too small or big, the actual conversion factor isn't important; I could use a random number, and the user would merely have to zoom in or out.
But it would be best if the conversion factor was selected such that most numbers outputted were small and whole (or fractions of small whole numbers), because that makes it easier to work with (and the whole point of the OpenGL units is that they are easy to work with).
How would I find such a conversion factor reliably?
My first attempt was to use the smallest number given; this resulted in no fractions below 1, but often lead to lots of decimal places where the factors didn't line up.
Then I tried the mode of the sequence, which lead to the largest number of 1's possible, but often lead to very long floats for background images.
My current approach gets the GCD of the whole sequence, which, when it works, works great, but can easily be thrown off course by a single bad apple.
Note that while I could easily just pass the numbers I am given along, or pick some fixed factor, or use one of the conversions I specified above, I am looking for a method to reliably scale this list of integers to small, whole numbers or simple fractions, because this would most likely be unsurprising to the end user; this is not a one off conversion.
The end users tend to use 1.0 as their "base" for manipulations (because it's simple and obvious), so it would make more sense for the sizes of entities to cluster around this.
How about the 'largest number which is a factor of some % of the values'.
So the GCD is the 'largest number which is a factor of 100%' of the values.
You could pick the largest number which is a factor of, say 60%, of the values. I don't know if it's a technical term but it's sort of a 'rough GCD if not a precise GCD'.
You might have to do trail and error to find it (possibly a binary search). But you could also consider sampling. I.e. if you have a million data points, just pick 100 or 1000 at random to find a number which divides evenly into your goal percentage of the sample set and that might be good enough.
some crummy pseudo C.
/** return percent of values in sampleset for which x is a factor */
double percentIsFactorOf(x, sampleset) {
int factorCount = 0;
for (sample : sampleset)
if (sample%x == 0) factorCount++;
return (double)factorCount/sampleset.size;
}
/** find largest value which is a factor of goalPercentage of sampleset */
double findGoodEnoughCommonFactor(sampleset, goalPercentage) {
// slow n^2 alogrithm here - add binary search, sampling, or something smarter to improve if you like
int start = max(sampleset);
while (percentIsFactorOf(start, sampleset)< goalPercent)
start--;
}
If your input is in N^2 (two dimensional space over the field the natural numbers, i.e. non-negative integers), and you need to output to R^2 (two dimensional space over the field of real numbers, which in this case will be represented/approximated with a float).
Forget about scaling for a minute and let the output be of the same scale as the input. The first step is to realize that you the input coordinate <0, 0> does not represent <0, 0> in the output, it represents <0.5f, 0.5f>, the center of the pixel. Similarly the input <2, 3> becomes <2.5, 3.5>. In general the conversion can be performed like this:
float x_prime = (float)x + 0.5f;
float y_prime = (float)y + 0.5f;
Next, you probably want to pick a scaling factor, as you have mentioned. I've always found it useful to pick some real-world unit, usually meters. This way you can reason about other physical aspects of what you're trying to model, because they have units; i.e. speeds, accelerations, can now be in meters per second, or meters per second squared. How many meters tall or wide is the thing you are making? How many meters is a pixel? Pick something that makes sense, and then your formula becomes this:
float x_prime = ((float)x + 0.5f) * (float)units_per_pixel;
float y_prime = ((float)y + 0.5f) * (float)units_per_pixel;
You may not want all of your output coordinates to be in the positive quadrant; that is you may want the origin to be in the center of the object. If you do, you probably want your starting coordinate system's field to include negative integers, or provide some offset to the true center. Lets say you provide a pixel offset to the true center. Your conversion then becomes this:
float x_prime = ((float)x + 0.5f - (float)x_offset) * (float)units_per_pixel;
float y_prime = ((float)y + 0.5f - (float)y_offset) * (float)units_per_pixel;
Discarding your background information, I understand that the underlying problem you are trying to solve is the following:
Given a finite number of (positive) integers {x_1, ... x_N} find some (rational) number f such that all x_i / f are "nice".
If you insist on "nice" meaning integer and as small as possible, then f = GCD is the (mathematically) exact answer to this question. There just is nothing "better", if the GCD is 1, tough luck.
If "nice" is supposed to mean rational with small numerator and denominator, the question gets more interesting and depending on what "small" means, find your trade off between small absolute value (f = max) and small denominator (f = GCD). Notice, however, that small numerator/denominator does not mean small floating point representation, e.g. 1/3 = 0.333333... in base 10.
If you want short floating points, make sure that f is a power of your base, i.e. 10 or 2, depending on whether the numbers should look short to the user or actually have a reasonable machine representation. This is what is used for scientific representation of floating points, which might be the best answer to the question of how to make decimal numbers look nice in the first place.
I have no idea what you are talking about with "GL units".
At the most abstract level, GL has no unit. Vertex coordinates are in object-space initially, and go through half a dozen user-defined transformations before they eventually produce coordinates (window-space) with familiar units (pixels).
You are absolutely correct that even in window-space, coordinates are still not whole numbers. You would not want this in fact, or triangles would jump all over the place and generally would not resemble triangles if their vertex positions were snapped to integer pixel coordinates.
Instead, GL throws sub-pixel precision into the mix. Coordinates still ultimately wind up quantized to integer values, but each integer may cover 1/256th of a pixel given 8-bit sub-pixel precision. Pixel coverage testing is done at the sub-pixel level as you can see here:
(source: microsoft.com)
GL never attempts to find any conversion factor like you are discussing, it just splits the number space for pixel coordinates up into a fixed division between integral and fractional... fixed-point in other words. You might consider doing the same thing.
You can recycle the code you probably currently use for vector normalisation, normalise the values to fit within a max. value of 1; for example:
the formula for 3d normalisation of a vector works fine here
Get the length first:
|a| = sqrt((ax * ax) + (ay * ay) + (az * az))
Then you will need to divide the values of each component by the length:
x = ax/|a|
y = ay/|a|
z = az/|a|
Now all the x, y, z values will fall into the maxima of -1 to 1, the same as the OpenGL base coordinate system.
I know this does not generate the whole numbers system you would like, however it does give a smaller more unified feel to the range.
Say you want to limit the range to whole numbers only, simply use a function like the following, which will take the normalised value and convert it to an int-only range value:
#include <algorithm> // this allows the use of std::min
int maxVal = 256
unsigned char convertToSpread(float floatValueToConvert){
return (unsigned char) (std::min((maxVal-1), (int) (floatValueToConvert * maxVal)));
}
The above will spread your values between 0 and 255, simply increase the value of maxVal to what you need and change the unsigned char to a datatype which suits your needs.
So if you want 1024 values, simply change maxVal to 1024 and unsigned char tounsigned int`
Hope this helps, however, let me know if you need more information as well, and I can elaborate:)

Fourier transform floating point issues

I am implementing a conventional (that means not fast), separated Fourier transform for images. I know that in floating point a sum over one period of sin or cos in equally spaced samples is not perfectly zero, and that this is more a problem with the conventional transform than with the fast.
The algorithm works with 2D double arrays and is correct. The inverse is done inside (over a double sign flag and conditional check when using the asymmetric formula), not outside with conjugations. Results are nearly 100% like expected, so its a question about details:
When I perform a forward transform, save logarithmed magnitude and angle to images, reload them, and do an inverse transform, I experience different types of rounding errors with different types of implemented formulas:
F(u,v) = Sum(x=0->M-1) Sum(y=0->N-1) f(x,y) * e^(-i*2*pi*u*x/M) * e^(-i*2*pi*v*y/N)
f(x,y) = 1/M*N * (like above)
F(u,v) = 1/sqrt(M*N) * (like above)
f(x,y) = 1/sqrt(M*N) * (like above)
So the first one is the asymmetric transform pair, the second one the symmetric. With the asymmetric pair, the rounding errors are more in the bright spots of the image (some pixel are rounded slightly outside value range (e.g. 256)). With the symmetric pair, the errors are more in the constant mid-range area of the image (no exceeding of value range!). In total, it seems that the symmetric pair produces a bit more rounding errors.
Then, it also depends of the input: when image stored in [0,255] the rounding errors are other than when in [0,1].
So my question: how should an optimal, most accurate algorithm be implemented (theoretically, no code): asymmetric/symmetric pair? value range of input in [0,255] or [0,1]? How linearly upscaling result before saving logarithmed one to file?
Edit:
my algorithm simply computes the separated asymmetric or symmetric DFT formula. Factors are decomposed into real and imaginary part using Eulers identity, then expanded and sumed up separately as real and imaginary part:
sum_re += f_re * cos(-mode*pi*((2.0*v*y)/N)) - // mode = 1 for forward, -1
f_im * sin(-mode*pi*((2.0*v*y)/N)); // for inverse transform
// sum_im permutated in the known way and + instead of -
This value grouping indside cos and sin should give in my eyes the lowest rounding error (compared to e.g. cos(-mode*2*pi*v*y/N)), because not multiplicating/dividing significantly false rounded transcedental pi several times, but only one time. Isn't it?
The scale factor 1/M*N or 1/sqrt(M*N) is applied separately after each separation outside of the innermost sum. Better inside? Or combined completely at the end of both separations?
For some deeper analysis, I have quitted the input->transform->save-to-file->read-from-file->transform^-1->output workflow and chosen to compare directly in double-precision: input->transform->transform^-1->output.
Here the results for an real life 704x528 8-bit image (delta = max absolute difference between real part of input and output):
with input inside [0,1] and asymmetric formula: delta = 2.6609e-13 (corresponds to 6.785295e-11 for [0,255] range).
with input insde [0,1] and symmetric formula: delta = 2.65232e-13 (corresponds to 6.763416e-11 for [0,255] range).
with input inside [0,255] and asymmetric formula: delta = 6.74731e-11.
with input inside [0,255] and symmetric formula: delta = 6.7871e-11.
These are no real significant differences, however, the full ranged input with the asymmetric transform performs best. I think the values may get worse with 16-bit input.
But in general I see, that my experienced issues are more because of scaling-before-saving-to-file (or inverse) rounding errors, than real transformation rounding errors.
However, I am curious: what is the most used implementation of the Fourier transform: the symmetric or asymmetric? Which value range is in general used for the input: [0,1] or [0,255]? And usual shown spectra in log scale: e.g. [0,M*N] after asymmetric transform of [0,1] input is directly log-scaled to [0,255] or before linearly scaled to [0,255*M*N]?
The errors you report are tiny, normal, and generally can be ignored. Simply scale your results and clamp any results outside the target interval to the endpoints.
In library implementations of FFTs (that is, FFT routines written to be used generally by diverse applications, not custom designed for a single application), little regard is given to scaling; the routine often simply returns data that has been naturally scaled by the arithmetic, with no additional multiplication operations used to adjust the scale. This is because the scale is often either irrelevant for the application (e.g., finding the frequencies with the largest energies works no matter what the scale is) or that the scale may be distributed through multiply operations and performed just once (e.g., instead of scaling in a forward transform and in an inverse transform, the application can get the same effect by explicitly scaling just once). So, since scaling is often not needed, there is no point in including it in a library routine.
The target interval that data are scaled to depends on the application.
Regarding the question on what transform to use (logarithmic or linear) for showing spectra, I cannot advise; I do not work with visualizing spectra.
Scaling causes roundoff errors. Hence, solution 1 (which scales once) is better than solution 2 (which does it twice). Similarly, scaling once after summation is better than scaling everything before summation.
Do you run y from 0 to 2*N or from -N to +N ? Mathematically it's the same, but you have an extra bit of precision in the latter case.
BTW, what's mode doing in cos(-mode * stuff) ?