How to efficiently calculate double to two decimal precision in C/C++? - c++

Given two doubles I need to calculate the percentage and express them to upto 2 decimal places. What is the most efficient way to do so?
For example
x = 10.2476
y = 100
I would return 10.25.
Efficient as in runtime speed. It needs to express x/y*100 in 2 decimal places.

Use an integer representation, if you really need a fixed point representation and scale your numbers by 100. So x = 10.2476 becomes xi = 1025:
double x = 10.2476;
int xi = ( x + 0.005 ) * 100;
In many cases, floating point representation are not needed, even when numbers smaller than 1 are used.

"express them to upto 2 decimal places" means you have only 2 mantissa digits in the output.
10.2476-> 10
102.476 -> 1.0E+2, NOT 100!
0.00102476 -> 1.0E-3
So, working with mantissa, it is impossible to show the result not using floats or doubles.
In printf, the keys g and G set the number of significant digits. %2g - exactly 2 decimal places.
As for counting, the decimal division is not a problem operation (as + or -). So, simply divide them, multiply by 100, to show in percents - no problems.

Related

In C/C++ how many decimal places is the product / quotient of 2 doubles accurate to

If I have 2 doubles x and y and do z = x / y or z = x * y is the result accurate to 15 decimal places?
Edit: sorry, x and y and between 0 and 1.
Assuming the 2 input numbers are considered to be without error ....
If using binary64 for double and competent division/multiplication, the result should be expected correct within 1/2 the unit in the last place given the typical default of round-to nearest mode of the exact mathematical result.
The maximum relative error would be 1/2 bit in power(2,53) or 5.55e-17. But printing a double in decimal has its issues.
scanf can correctly convert all N significant decimal digit combinations into a double and back into the same string as long as N is at most DBL_DIG. This is 15 for binary64 double.
The relevant equation is in C11 ยง5.2.4.2.2 12. "floor((p-1)* log10 b)" --> floor((53-1)*log(2)) --> floor(15.65) --> 15.
Since our product/quotient is only accurate to 52.5 binary digits: floor((52.5-1)*log(2)) -> floor(15.35) --> 15.
IMO: The answer is correct to 15 significant decimal places.
Note: That is often written as a leading digit, decimal point and more 14 digits times some power-of-10.
[Edit]
Should be "only accurate to 53.5 binary digits:" floor((53.5-1)*log(2)) -> floor(15.80) --> 15. Same end result.

C++ floating-point console output issue

float x = 384.951257;
std::cout << std::fixed << std::setprecision(6) << x << std::endl;
The output is 384.951263. Why? I'm using gcc.
float is usually only 32-bit. With about 3 bits per decimal digit (210 roughly equals 103) that means it can't possibly represent more than about 11 decimal digits, and accounting for other information it also needs to represent, such as magnitude, let's say 6-7 decimal digits. Hey, that's what you got!
Check e.g. Wikipedia for details.
Use double or long double for better precision. double is the default in C++. E.g., the literal 3.14 is of type double.
Floats have a limited resolution. So it gets rounded when you assing the value to x.
All answers here talk as though the issue is due to floating-point numbers and their capacity, but those are just implementation details; the issue is deeper than that. This issue occurs when representing decimal numbers using binary number system. Even something as simple as 0.1)10 is not precisely representable in binary, since it can only represent those numbers as a finite fraction where the denominator is a power of 2. Unfortunately, this does not include most of the numbers that can be represented as finite fraction in base 10, like 0.1.
The single-precision float datatype usually gets mapped to binary32 as called by the IEEE 754 standard, has 32-bits which is partitioned into 1 sign bit, 8 exponent bits and 23 significand bits (excluding the hidden/implicit bit). Thus we've to calculate upto 24 bits when converting to binary32.
Other answers here evade the actual calculations involved, I'll try to do it. This method is explained in greater detail here. So lets convert the real number into a binary number:
Integer part 384)10 = 110000000)2 (using the usual method of successive division by 2)
Fractional part 0.951257)10 can be converted by successive multiplication by 2 and taking the integer part
0.951257 * 2 = 1.902514
0.902514 * 2 = 1.805028
0.805028 * 2 = 1.610056
0.610056 * 2 = 1.220112
0.220112 * 2 = 0.440224
0.440224 * 2 = 0.880448
0.880448 * 2 = 1.760896
0.760896 * 2 = 1.521792
0.521792 * 2 = 1.043584
0.043584 * 2 = 0.087168
0.087168 * 2 = 0.174336
0.174336 * 2 = 0.348672
0.348672 * 2 = 0.697344
0.697344 * 2 = 1.394688
0.394688 * 2 = 0.789376
Gathering the obtined fractional part in binary we've 0.111100111000010)2. The overall number in binary would be 110000000.111100111000010)2; this has 24 bits as required.
Converting this back to decimal would give you 384 + (15585 / 16384) = 384.951232)10. With the rounding mode (round to nearest) enabled this comes to, what you see, 384.951263)10.
This can be verified here.

Fortran fomat statement with highest precision in the system

Someone wanting less precision would write
999 format ('The answer is x = ', F8.3)
Others wanting higher output precision may write
999 format ('The answer is x = ', F18.12)
Thus it totally depends on what the user desires. What is the format
statement that exactly matches the precision used in the calculation?
(Note, this may vary from system to system)
It is a difficult question because you request "the precision of the calculation", which depends on so many factors. For example: if I solve f(x)=0 via Newton's method to a tolerance of 1E-6, would you want a format with seven digits?
On the other hand, if you mean the "highest precision attainable by the type" (e. g., double or single precision) then you can simply find the corresponding epsilon (machine eps, or precision) and use that as the format flag. If epsilon is 1E-15, then you can use a format flag that does not have more than 16 digits.
In Fortran you can use the EPSILON(X) function to get this number (the answer will depend on the type of X), the you can take the floor of the absolute value of the logarithm (base 10) of epsilon, and make that the number of decimals in your float representation.
For example, if epsilon is 1E-12, the log is -12, the abs is 12, and the floor is 12, so you want a format like 15.12F (12 decimals + 1 point + the zero + the sign = 15 places)
The problem with floating point numbers is that there is no precision as such: only significant digits.
For instance, if you are calculating longitudes in real*1, near the UK, you'd be accurate to 6 decimal places but if you were in Colorado Springs, it would only be accurate to 4 decimal places. It would not make any sense to print the number in F format it is just rubbish after the 4th decimal place.
If you wish to print to maximum precision, print in E format. Since it is always n.nn..nEnn, you get all the significant digits.
Edit - user4050's query
Try the following example
program main
real intpart, multiplier
integer ii
multiplier = 1
do ii = 1, 6
intpart = 9.87654321
intpart = intpart * multiplier
print '(F15.7 E15.7 G15.8)', intpart, intpart, intpart
multiplier = multiplier * 10
end do
stop
end program
What you will get is something like
9.8765430 0.9876543E+01 9.8765430
98.7654266 0.9876543E+02 98.765427
987.6542969 0.9876543E+03 987.65430
9876.5429688 0.9876543E+04 9876.5430
98765.4296875 0.9876543E+05 98765.430
987654.3125000 0.9876543E+06 987654.31
Notice that the precision changes as the number gets bigger because a float only has 7 significant figures.

Is floating-point addition and multiplication associative?

I had a problem when I was adding three floating point values and comparing them to 1.
cout << ((0.7 + 0.2 + 0.1)==1)<<endl; //output is 0
cout << ((0.7 + 0.1 + 0.2)==1)<<endl; //output is 1
Why would these values come out different?
Floating point addition is not necessarily associative. If you change the order in which you add things up, this can change the result.
The standard paper on the subject is What Every Computer Scientist Should Know about Floating Point Arithmetic. It gives the following example:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
What is likely, with currently popular machines and software, is:
The compiler encoded .7 as 0x1.6666666666666p-1 (this is the hexadecimal numeral 1.6666666666666 multiplied by 2 to the power of -1), .2 as 0x1.999999999999ap-3, and .1 as 0x1.999999999999ap-4. Each of these is the number representable in floating-point that is closest to the decimal numeral you wrote.
Observe that each of these hexadecimal floating-point constants has exactly 53 bits in its significand (the "fraction" part, often inaccurately called the mantissa). The hexadecimal numeral for the significand has a "1" and thirteen more hexadecimal digits (four bits each, 52 total, 53 including the "1"), which is what the IEEE-754 standard provides for, for 64-bit binary floating-point numbers.
Let's add the numbers for .7 and .2: 0x1.6666666666666p-1 and 0x1.999999999999ap-3. First, scale the exponent of the second number to match the first. To do this, we will multiply the exponent by 4 (changing "p-3" to "p-1") and multiply the significand by 1/4, giving 0x0.66666666666668p-1. Then add 0x1.6666666666666p-1 and 0x0.66666666666668p-1, giving 0x1.ccccccccccccc8p-1. Note that this number has more than 53 bits in the significand: The "8" is the 14th digit after the period. Floating-point cannot return a result with this many bits, so it has to be rounded to the nearest representable number. In this case, there are two numbers that are equally near, 0x1.cccccccccccccp-1 and 0x1.ccccccccccccdp-1. When there is a tie, the number with a zero in the lowest bit of the significand is used. "c" is even and "d" is odd, so "c" is used. The final result of the addition is 0x1.cccccccccccccp-1.
Next, add the number for .1 (0x1.999999999999ap-4) to that. Again, we scale to make the exponents match, so 0x1.999999999999ap-4 becomes 0x.33333333333334p-1. Then add that to 0x1.cccccccccccccp-1, giving 0x1.fffffffffffff4p-1. Rounding that to 53 bits gives 0x1.fffffffffffffp-1, and that is the final result of .7+.2+.1.
Now consider .7+.1+.2. For .7+.1, add 0x1.6666666666666p-1 and 0x1.999999999999ap-4. Recall the latter is scaled to 0x.33333333333334p-1. Then the exact sum is 0x1.99999999999994p-1. Rounding that to 53 bits gives 0x1.9999999999999p-1.
Then add the number for .2 (0x1.999999999999ap-3), which is scaled to 0x0.66666666666668p-1. The exact sum is 0x2.00000000000008p-1. Floating-point significands are always scaled to start with 1 (except for special cases: zero, infinity, and very small numbers at the bottom of the representable range), so we adjust this to 0x1.00000000000004p0. Finally, we round to 53 bits, giving 0x1.0000000000000p0.
Thus, because of errors that occur when rounding, .7+.2+.1 returns 0x1.fffffffffffffp-1 (very slightly less than 1), and .7+.1+.2 returns 0x1.0000000000000p0 (exactly 1).
Floating point multiplication is not associative in C or C++.
Proof:
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
using namespace std;
int main() {
int counter = 0;
srand(time(NULL));
while(counter++ < 10){
float a = rand() / 100000;
float b = rand() / 100000;
float c = rand() / 100000;
if (a*(b*c) != (a*b)*c){
printf("Not equal\n");
}
}
printf("DONE");
return 0;
}
In this program, about 30% of the time, (a*b)*c is not equal to a*(b*c).
Neither addition nor multiplication is associative with IEEE 743 double precision (64-bit) numbers. Here are examples for each (evaluated with Python 3.9.7):
>>> (.1 + .2) + .3
0.6000000000000001
>>> .1 + (.2 + .3)
0.6
>>> (.1 * .2) * .3
0.006000000000000001
>>> .1 * (.2 * .3)
0.006
Similar answer to Eric's, but for addition, and with Python.
import random
random.seed(0)
n = 1000
a = [random.random() for i in range(n)]
b = [random.random() for i in range(n)]
c = [random.random() for i in range(n)]
sum(1 if (a[i] + b[i]) + c[i] != a[i] + (b[i] + c[i]) else 0 for i in range(n))

Converting a decimal number in scientific notation to IEEE 754

I've read a few texts and threads showing how to convert from a decimal to IEEE 754 but I am still confused as to how I can convert the number without expanding the decimal (which is represented in scientific notation)
The number I am particularly working with is 9.07 * 10^23, but any number would do; I will figure out how to do it for my particular example.
I'm assuming you want the result to be the floating-point number closest to the decimal number, and that you are using double-precision floating-point numbers.
For most numbers, there is a way to do it relatively quickly. Here's how it works in a nutshell.
You need to split the number into either a product or a fraction of numbers that have an exact representation as a floating-point number. The largest power of 10 that is exactly representable is 10^22. So, to get 9.07e+23 in floating-point form, we can write:
9.07e+23 = 907 * 10^21
According to the IEEE-754 standard, a single floating-point operation is guaranteed to be correctly rounded, so the above product, computed as a product of 2 double precision floating-point numbers, will give the correctly rounded result.
If you were to use this in a conversion function, you would probably store the powers of 10 in an array.
Note that you can't use this method for 9.07e-23. This number equals 907 / 10^23, so the denominator would be too large to be exactly representable. In this situation, and other dealings with very large or very small numbers, you have to use some form of high-precision arithmetic.
See Fast Path Decimal to Floating-Point Conversion for further details and examples.
Converting a number from a decimal string to binary IEEE is fairly straight-forward if you know how to do IEEE floating-point addition and multiplication. (or if you're using any basic programming language like C/C++)
There's a lot of different approaches to this, but the easiest is to evaluate 9.07 * 10^23 directly.
First, start with 9.07:
9.07 = 9 + 0 * 10^-1 + 7 * 10^-2
Now evaluate 10^23. This can be done by starting with 10 and using any powering algorithm.
Then multiply the results together.
Here's a simple implementation in C/C++:
double mantissa = 9;
mantissa += 0 / 10.;
mantissa += 7 / 100.;
double exp = 1;
for (int i = 0; i < 23; i++){
exp *= 10;
}
double result = mantissa * exp;
Now, going backwards (IEEE -> to decimal) is a lot harder.
Again, there's also a lot of different approaches. Here's the easiest one I can think of it.
I'll use 1.0011101b * 2^40 as the example. (the mantissa is in binary)
First, convert the mantissa to decimal: (this should be easy, since there's no exponent)
1.0011101b * 2^40 = 1.22656 * 2^40
Now, "scale" the number such that the binary exponent vanishes. This is done by multiplying by an appropriate power of 10 to "get rid" of the binary exponent.
1.22656 * 2^40 = 1.22656 * (2^40 * 10^-12) * 10^12
= 1.22656 * (1.09951) * 10^12
= 1.34861 * 10^12
So the answer is:
1.0011101b * 2^40 = 1.34861 * 10^12
In this example, 10^12 was needed to "scale away" the 2^40. Determining the power of 10 that is needed is simply equal to:
power of 10 = (power of 2) * log(2)/log(10)