Increasing accuracy of “sum()” command in Fortran - fortran

I am trying to perform summation of an array.
There are 1024 elements in that array and on applying the command "sum(a(:))", where a(:) is the array, I get the answer 1981.9072.
If I do the same summation of 1024 elements in Excel sheet the answer is 1981.93530 which is the right answer. So, a difference of 0.0281 is observed between the above two values. As I increase the number of elements in a(:) the difference in value obtained from "sum()" and Excel sheet increases.
I think the "sum()" value is different due to rounding off error. How do I get the true value (Excel value) using "sum()" without any rounding off error ?

REAL(4) is a 32-bit IEEE-754 floating-point number which is imprecise.
You absolutely need to read and understand this important QA on the matter: Why are floating point numbers inaccurate?
Here's a page on the same topic from the GCC Fortran compiler docs: https://gcc.gnu.org/onlinedocs/gcc-3.3.6/g77/Floating_002dpoint-Errors.html
You need to change the type of a to REAL(8) - but note that even if you do, the sum still won't be completely accurate (it will just be less inaccurate).
This can be reproduced consistently in any language that implements IEEE-754 32-bit floating-point numbers, like C# (I'm using C# as I don't have access to a Fortran90 compiler right now):
static void Main()
{
Single t = 0f;
for( Int32 i = 0; i < 1024; i++ )
{
t += 1.93548369f; // This is a 32-bit float literal
}
Console.WriteLine( t ); // "1981.9072"
}
If you sum the numbers using a double-precision (64-bit) floating number type you get the same result as Excel:
static void Main()
{
Double d = 0d;
for( Int32 i = 0; i < 1024; i++ )
{
d += 1.93548369d; // This is a 64-bit float literal
}
Console.WriteLine( d ); // "1981.935302734375" --> "1981.9350"
}
If you want an accurate and precise answer then you need to use a data-type that doesn't use IEEE-754 - I'm not a Fortran user so I don't know what/if Fortran's equivalent of C#'s Decimal is, but when I do the calculation with Decimal I get 1981.93529856 which is different to Excel's answer (this is because Excel uses 64-bit IEEE-754 instead of a real Decimal type):
static void Main()
{
Decimal dec = 0M;
for( Int32 i = 0; i < 1024; i++ )
{
dec += 1.93548369M; // This is a decimal literal, note the `M`.
}
Console.WriteLine( dec ); // "1981.93529856" --> "1981.9352"
}

Related

Casting float to int in C++

int uniquePaths(int m, int n) {
int num = m+n-2;
int den=1;
double ans = 1;
while(den<=m-1) {
ans = ans*(num--)/(den++);
}
cout<<ans;
return (int)ans;
}
The expected answer for m=53, n=4 as input to the above piece of code is 26235 but the code returns 26234. However, the stdout shows 26235.
Could you please help me understand this behavior?
Due to floating-point rounding, your code computes ans to be 26,234.999999999985448084771633148193359375. When it is printed with cout<<ans, the default formatting does not show the full value and rounds it to “26235”. However, when the actual value is converted to int, the result is 26,234.
After setting num to m+n-2, your code is computing num! / ((m-1)!(num-m+1)!), which of course equals num! / ((num-m+1)!(m-1)!). Thus, you can use either m-1 or num-m+1 as the limit. So you can change the while line to these two lines:
int limit = m-1 < num-m+1 ? m-1 : num-m+1;
while(den<=limit) {
and then your code will run to the lower limit, which will avoid dividing ans by factors that are not yet in it. All results will be exact integer results, with no rounding errors, unless you try to calculate a result that exceeds the range of your double format where it is able to represent all integers (up to 253 in the ubiquitous IEEE-754 binary64 format used for double).

float number to string converting implementation in STD

I faced with a curious issue. Look at this simple code:
int main(int argc, char **argv) {
char buf[1000];
snprintf_l(buf, sizeof(buf), _LIBCPP_GET_C_LOCALE, "%.17f", 0.123e30f);
std::cout << "WTF?: " << buf << std::endl;
}
The output looks quire wired:
123000004117574256822262431744.00000000000000000
My question is how it's implemented? Can someone show me the original code? I did not find it. Or maybe it's too complicated for me.
I've tried to reimplement the same transformation double to string with Java code but was failed. Even when I tried to get exponent and fraction parts separately and summarize fractions in cycle I always get zeros instead of these numbers "...822262431744". When I tried to continue summarizing fractions after the 23 bits (for float number) I faced with other issue - how many fractions I need to collect? Why the original code stops on left part and does not continue until the scale is end?
So, I really do not understand the basic logic, how it implemented. I've tried to define really big numbers (e.g. 0.123e127f). And it generates huge number in decimal format. The number has much higher precision than float can be. Looks like this is an issue, because the string representation contains something which float number cannot.
Please read documentation:
printf, fprintf, sprintf, snprintf, printf_s, fprintf_s, sprintf_s, snprintf_s - cppreference.com
The format string consists of ordinary multibyte characters (except %), which are copied unchanged into the output stream, and conversion specifications. Each conversion specification has the following format:
introductory % character
...
(optional) . followed by integer number or *, or neither that specifies precision of the conversion. In the case when * is used, the precision is specified by an additional argument of type int, which appears before the argument to be converted, but after the argument supplying minimum field width if one is supplied. If the value of this argument is negative, it is ignored. If neither a number nor * is used, the precision is taken as zero. See the table below for exact effects of precision.
....
Conversion Specifier
Explanation
Expected Argument Type
f F
converts floating-point number to the decimal notation in the style [-]ddd.ddd. Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6. In the alternative implementation decimal point character is written even if no digits follow it. For infinity and not-a-number conversion style see notes.
double
So with f you forced form ddd.ddd (no exponent) and with .17 you have forced to show 17 digits after decimal separator. With such big value printed outcome looks that odd.
Finally I've found out what the difference between Java float -> decimal -> string convertation and c++ float -> string (decimal) convertation. I did not find the original source code, but I replicated the same code in Java to make it clear. I think the code explains everything:
// the context size might be calculated properly by getting maximum
// float number (including exponent value) - its 40 + scale, 17 for me
MathContext context = new MathContext(57, RoundingMode.HALF_UP);
BigDecimal divisor = BigDecimal.valueOf(2);
int tmp = Float.floatToRawIntBits(1.23e30f)
boolean sign = tmp < 0;
tmp <<= 1;
// there might be NaN value, this code does not support it
int exponent = (tmp >>> 24) - 127;
tmp <<= 8;
int mask = 1 << 23;
int fraction = mask | (tmp >>> 9);
// at this line we have all parts of the float: sign, exponent and fractions. Let's build mantissa
BigDecimal mantissa = BigDecimal.ZERO;
for (int i = 0; i < 24; i ++) {
if ((fraction & mask) == mask) {
// i'm not sure about speed, maybe division at each iteration might be faster than pow
mantissa = mantissa.add(divisor.pow(-i, context));
}
mask >>>= 1;
}
// it was the core line where I was losing accuracy, because of context
BigDecimal decimal = mantissa.multiply(divisor.pow(exponent, context), context);
String str = decimal.setScale(17, RoundingMode.HALF_UP).toPlainString();
// add minus manually, because java lost it if after the scale value become 0, C++ version of code doesn't do it
if (sign) {
str = "-" + str;
}
return str;
Maybe topic is useless. Who really need to have the same implementation like C++ has? But at least this code keeps all precision for float number comparing to the most popular way converting float to decimal string:
return BigDecimal.valueOf(1.23e30f).setScale(17, RoundingMode.HALF_UP).toPlainString();
The C++ implementation you are using uses the IEEE-754 binary32 format for float. In this format, the closet representable value to 0.123•1030 is 123,000,004,117,574,256,822,262,431,744, which is represented in the binary32 format as +13,023,132•273. So 0.123e30f in the source code yields the number 123,000,004,117,574,256,822,262,431,744. (Because the number is represented as +13,023,132•273, we know its value is that exactly, which is 123,000,004,117,574,256,822,262,431,744, even though the digits “123000004117574256822262431744” are not stored directly.)
Then, when you format it with %.17f, your C++ implementation prints the exact value faithfully, yielding “123000004117574256822262431744.00000000000000000”. This accuracy is not required by the C++ standard, and some C++ implementations will not do the conversion exactly.
The Java specification also does not require formatting of floating-point values to be exact, at least in some formatting operations. (I am going from memory and some supposition here; I do not have a citation at hand.) It allows, perhaps even requires, that only a certain number of correct digits be produced, after which zeros are used if needed for positioning relative to the decimal point or for the requested format.
The number has much higher precision than float can be.
For any value represented in the float format, that value has infinite precision. The number +13,023,132•273 is exactly +13,023,132•273, which is exactly 123,000,004,117,574,256,822,262,431,744, to infinite precision. The precision the format has for representing numbers affects only which numbers it can represent, not how precisely it represents the numbers that it does represent.

Iterate though all possible floating-point values, starting from lowest

I am writing a unit test for a math function and I would like to be able to "walk" all possible floats/doubles.
Due to IEEE shenanigans, floating types cannot be incremented (++) at their extremities. See this question for more details. That answer states :
one can only add multiples of 2^(n-N)
But never mentions what little n is.
A solution to iterate all possible values from +0.0 to +infinity is given in this great blog post. The technique involves using a union with an int to walk the different values of a float. This works due to the following properties explained in the post, though they are only valid for positive numbers.
Adjacent floats have adjacent integer representations
Incrementing the integer representation of a float moves to the next representable float, moving away from zero
His solution for +0.0 to +infinity (0.f to std::numeric_limits<float>::max()) :
union Float_t {
int32_t RawExponent() const { return (i >> 23) & 0xFF; }
int32_t i;
float f;
};
Float_t allFloats;
allFloats.f = 0.0f;
while (allFloats.RawExponent() < 255) {
allFloats.i += 1;
}
Is there a solution for -infinity to +0.0 (std::numeric_limits<float>::lowest() to 0.f)?
I've tested std::nextafter and std::nexttoward and couldn't get them to work. Maybe this is an MSVC issue?
I would be ok with any sort of hack since this is a unit test. Thanks!
You can walk all 32-bit bit representations by using all values of a 32-bit unsigned int. Then you will walk really all representations, positive and negative, including both nulls (there are two) and also all the not a number representations (NaN). You may or may not want to filter out the NaN representations, or just filter out the signaling ones and leave the non signaling ones in. This depends on your use case.
Example:
for (uint32_t i = 0;;)
{
float f;
// Type punning: Force the bit representation of i into f.
// Type punning is hard because mostly undefined in C/C++.
// Using memcpy() usually avoids any type punning warning.
memcpy(&f, &i, sizeof(f));
// Use f here.
// Warning: Using signaling NaNs may throw exceptions or raise signals.
i++;
if (i == 0)
break;
}
Instead you can also walk a 32-bit int from -2**31 to +(2**31-1). This makes no difference.
Pascal Cuoq correctly points out std::nextafter is the right solution. I had a problem elsewhere in my code. Sorry for the unnecessary question.
#include <cassert>
#include <cmath>
#include <limits>
float i = std::numeric_limits<float>::lowest();
float hi = std::numeric_limits<float>::max();
float new_i = std::nextafterf(i, hi);
assert(i != new_i);
double d = std::numeric_limits<double>::lowest();
double hi_d = std::numeric_limits<double>::max();
double new_d = std::nextafter(d, hi_d);
assert(d != new_d);
long double ld = std::numeric_limits<long double>::lowest();
long double hi_ld = std::numeric_limits<long double>::max();
long double new_ld = std::nextafterl(ld, hi_ld);
assert(ld != new_ld);
for (float d = std::numeric_limits<float>::lowest();
d < std::numeric_limits<float>::max();
d = std::nextafterf(
d, std::numeric_limits<float>::max())) {
// Wait a lifetime?
}
Iterating through all the float values can be done with simple understanding of the floating-point representation:
The distance between consecutive subnormal values is the minimum normal times the “epsilon”. Simply iterate through all the subnormals using this distance as an increment.
The distance between the normal values at the lowest exponent is the same. Step through them with the same increment.
For each exponent, the distance increases according to the floating-point radix. Simply multiply the increment by the radix and step through all the values for the next exponent.
Repeat until infinity is reached.
Observe that the inner loop in the code below is simply:
for (; x < Limit; x += Increment)
Test(x);
This has the advantage that only normal floating-point arithmetic is used. The inner loop contains only one addition and one comparison (plus any tests you want to perform with each number). No library functions are called in the loop, no representations are dissected or copied to general registers or otherwise manipulated. There is nothing to impede performance.
This code steps through only the non-negative numbers. The negative numbers can be tested separately in the same way or can share this code by inserting a call Test(-x).
#include <limits>
static void Test(float x)
{
// Insert unit test for value x here.
}
int main(void)
{
typedef float T;
static const int Radix = std::numeric_limits<T>::radix;
static const T Infinity = std::numeric_limits<T>::infinity();
/* Increment is the current distance between floating-point numbers. We
start it at distance between subnormal numbers.
*/
T Increment =
std::numeric_limits<T>::min() * std::numeric_limits<T>::epsilon();
/* Limit is the next boundary where the distance between floating-point
numbers changes. We will increment up to that limit and then adjust
the limit and increment. We start it at the top of the first set of
normals, which allows the first loop to increment first through the
subnormals and then through the normals with the lowest exponent.
(These two sets have the same step size between adjacent values.)
*/
T Limit = std::numeric_limits<T>::min() * Radix;
/* Start with zero and continue until we reach infinity.
We execute an inner loop that iterates through all the significands of
one floating-point exponent. Each time it completes, we step up the
limit and increment.
*/
for (T x = 0; x < Infinity; Limit *= Radix, Increment *= Radix)
// Increment x through all the significands with the current exponent.
for (; x < Limit; x += Increment)
// Test with the current value of x.
Test(x);
// Also test infinity.
Test(Infinity);
}
(This code assumes the floating-point type has subnormals, and that they are not flushed to zero. The code can be readily adjusted to support these alternatives as well.)

pow() function gives an error [duplicate]

Recently i write a block of code:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = pow(sections, 5- t -1);
cout << i << endl;
}
And the result is wrong:
9999
1000
99
10
1
If i using just this code:
for(int t = 0; t < 5; t++){
cout << pow(sections,5-t-1) << endl;
}
The problem doesn't occur anymore:
10000
1000
100
10
1
Does anyone give me an explaination? thanks you very much!
Due to the representation of floating point values pow(10.0, 5) could be 9999.9999999 or something like this. When you assign that to an integer that got truncated.
EDIT: In case of cout << pow(10.0, 5); it looks like the output is rounded, but I don't have any supporting document right now confirming that.
EDIT 2: The comment made by BoBTFish and this question confirms that when pow(10.0, 5) is used directly in cout that is getting rounded.
When used with fractional exponents, pow(x,y) is commonly evaluated as exp(log(x)*y); such a formula would mathematically correct if evaluated with infinite precision, but may in practice result in rounding errors. As others have noted, a value of 9999.999999999 when cast to an integer will yield 9999. Some languages and libraries use such a formulation all the time when using an exponentiation operator with a floating-point exponent; others try to identify when the exponent is an integer and use iterated multiplication when appropriate. Looking up documentation for the pow function, it appears that it's supposed to work when x is negative and y has no fractional part (when x is negative and `y is even, the result should be pow(-x,y); when y is odd, the result should be -pow(-x,y). It would seem logical that when y has no fractional part a library which is going to go through the trouble of dealing with a negative x value should use iterated multiplication, but I don't know of any spec dictating that it must.
In any case, if you are trying to raise an integer to a power, it is almost certainly best to use integer maths for the computation or, if the integer to be raised is a constant or will always be small, simply use a lookup table (raising numbers from 0 to 15 by any power that would fit in a 64-bit integer would require only a 4,096-item table).
From Here
Looking at the pow() function: double pow (double base, double exponent); we know the parameters and return value are all double type. But the variable num, i and res are all int type in code above, when tranforming int to double or double to int, it may cause precision loss. For example (maybe not rigorous), the floating point unit (FPU) calculate pow(10, 4)=9999.99999999, then int(9999.9999999)=9999 by type transform in C++.
How to solve it?
Solution1
Change the code:
const int num = 10;
for(int i = 0; i < 5; ++i){
double res = pow(num, i);
cout << res << endl;
}
Solution2
Replace floating point unit (FPU) having higher calculation precision in double type. For example, we use SSE in Windows CPU. In Code::Block 13.12, we can do this steps to reach the goal: Setting -> Compiler setting -> GNU GCC Compile -> Other options, add
-mfpmath=sse -msse3
The picture is as follows:
(source: qiniudn.com)
Whats happens is the pow function returns a double so
when you do this
int i = pow(sections, 5- t -1);
the decimal .99999 cuts of and you get 9999.
while printing directly or comparing it with 10000 is not a problem because it is runded of in a sense.
If the code in your first example is the exact code you're running, then you have a buggy library. Regardless of whether you're picking up std::pow or C's pow which takes doubles, even if the double version is chosen, 10 is exactly representable as a double. As such the exponentiation is exactly representable as a double. No rounding or truncation or anything like that should occur.
With g++ 4.5 I couldn't reproduce your (strange) behavior even using -ffast-math and -O3.
Now what I suspect is happening is that sections is not being assigned the literal 10 directly but instead is being read or computed internally such that its value is something like 9.9999999999999, which when raised to the fourth power generates a number like 9999.9999999. This is then truncated to the integer 9999 which is displayed.
Depending on your needs you may want to round either the source number or the final number prior to assignment into an int. For example: int i = pow(sections, 5- t -1) + 0.5; // Add 0.5 and truncate to round to nearest.
There must be some broken pow function in the global namespace. Then std::pow is "automatically" used instead in your second example because of ADL.
Either that or t is actually a floating-point quantity in your first example, and you're running into rounding errors.
You're assigning the result to an int. That coerces it, truncating the number.
This should work fine:
for(int t= 0; t < 5; t++){
double i = pow(sections, 5- t -1);
cout << i << endl;
}
What happens is that your answer is actually 99.9999 and not exactly 100. This is because pow is double. So, you can fix this by using i = ceil(pow()).
Your code should be:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = ceil(pow(sections, 5- t -1));
cout << i << endl;
}

Why pow(10,5) = 9,999 in C++

Recently i write a block of code:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = pow(sections, 5- t -1);
cout << i << endl;
}
And the result is wrong:
9999
1000
99
10
1
If i using just this code:
for(int t = 0; t < 5; t++){
cout << pow(sections,5-t-1) << endl;
}
The problem doesn't occur anymore:
10000
1000
100
10
1
Does anyone give me an explaination? thanks you very much!
Due to the representation of floating point values pow(10.0, 5) could be 9999.9999999 or something like this. When you assign that to an integer that got truncated.
EDIT: In case of cout << pow(10.0, 5); it looks like the output is rounded, but I don't have any supporting document right now confirming that.
EDIT 2: The comment made by BoBTFish and this question confirms that when pow(10.0, 5) is used directly in cout that is getting rounded.
When used with fractional exponents, pow(x,y) is commonly evaluated as exp(log(x)*y); such a formula would mathematically correct if evaluated with infinite precision, but may in practice result in rounding errors. As others have noted, a value of 9999.999999999 when cast to an integer will yield 9999. Some languages and libraries use such a formulation all the time when using an exponentiation operator with a floating-point exponent; others try to identify when the exponent is an integer and use iterated multiplication when appropriate. Looking up documentation for the pow function, it appears that it's supposed to work when x is negative and y has no fractional part (when x is negative and `y is even, the result should be pow(-x,y); when y is odd, the result should be -pow(-x,y). It would seem logical that when y has no fractional part a library which is going to go through the trouble of dealing with a negative x value should use iterated multiplication, but I don't know of any spec dictating that it must.
In any case, if you are trying to raise an integer to a power, it is almost certainly best to use integer maths for the computation or, if the integer to be raised is a constant or will always be small, simply use a lookup table (raising numbers from 0 to 15 by any power that would fit in a 64-bit integer would require only a 4,096-item table).
From Here
Looking at the pow() function: double pow (double base, double exponent); we know the parameters and return value are all double type. But the variable num, i and res are all int type in code above, when tranforming int to double or double to int, it may cause precision loss. For example (maybe not rigorous), the floating point unit (FPU) calculate pow(10, 4)=9999.99999999, then int(9999.9999999)=9999 by type transform in C++.
How to solve it?
Solution1
Change the code:
const int num = 10;
for(int i = 0; i < 5; ++i){
double res = pow(num, i);
cout << res << endl;
}
Solution2
Replace floating point unit (FPU) having higher calculation precision in double type. For example, we use SSE in Windows CPU. In Code::Block 13.12, we can do this steps to reach the goal: Setting -> Compiler setting -> GNU GCC Compile -> Other options, add
-mfpmath=sse -msse3
The picture is as follows:
(source: qiniudn.com)
Whats happens is the pow function returns a double so
when you do this
int i = pow(sections, 5- t -1);
the decimal .99999 cuts of and you get 9999.
while printing directly or comparing it with 10000 is not a problem because it is runded of in a sense.
If the code in your first example is the exact code you're running, then you have a buggy library. Regardless of whether you're picking up std::pow or C's pow which takes doubles, even if the double version is chosen, 10 is exactly representable as a double. As such the exponentiation is exactly representable as a double. No rounding or truncation or anything like that should occur.
With g++ 4.5 I couldn't reproduce your (strange) behavior even using -ffast-math and -O3.
Now what I suspect is happening is that sections is not being assigned the literal 10 directly but instead is being read or computed internally such that its value is something like 9.9999999999999, which when raised to the fourth power generates a number like 9999.9999999. This is then truncated to the integer 9999 which is displayed.
Depending on your needs you may want to round either the source number or the final number prior to assignment into an int. For example: int i = pow(sections, 5- t -1) + 0.5; // Add 0.5 and truncate to round to nearest.
There must be some broken pow function in the global namespace. Then std::pow is "automatically" used instead in your second example because of ADL.
Either that or t is actually a floating-point quantity in your first example, and you're running into rounding errors.
You're assigning the result to an int. That coerces it, truncating the number.
This should work fine:
for(int t= 0; t < 5; t++){
double i = pow(sections, 5- t -1);
cout << i << endl;
}
What happens is that your answer is actually 99.9999 and not exactly 100. This is because pow is double. So, you can fix this by using i = ceil(pow()).
Your code should be:
const int sections = 10;
for(int t= 0; t < 5; t++){
int i = ceil(pow(sections, 5- t -1));
cout << i << endl;
}