Why The IEEE-754's formula for determining the precision in bits for an extended binary interchange format is so complex? - ieee-754

For determining the precision in bits, p, for an extended binary interchange format with width k, the formula k – round(4×log2(k)) + 13 is used.
What is important about the logarithm and the round() function that they are used in this formula? Can't it simply be defined as k - 20 for example?

The original standard had only 32- and 64-bit values. But the committee realized that someday 128 might be implemented. So there should be a "formula" for that. At that point, why stop with nice, neat, powers of two? What about a 24-bit float? Or 96? Or 55?

Related

How can I test for how many significant figures a float has in C++?

How can I test how many significant figures a specified float has in c++? say if i write:
sigfigs(x);
x being the value of the float,
it would set an integer value to y, the number of sigfigs
how can i write a void function this way
this has been bugging me for some time, any answers appreciated
btw mysticial this is asking for a code to find the amount of sig figs in a float, not how many there are like the one you linked to as a duplicate -.-
This is a bit tricky, because as you should already know, floating-point numbers are often not exact, but rather some approximation of a number. For example, 10.1 ends up as 10.09999.... A double has about 15 digits of precision, so 15 is the largest value your sigfigs() function could reasonably return. And it will need to be overloaded for double and float, because of course float has only half as many digits of precision:
int sigfigs(double x); // returns 1 to 15
int sigfigs(float x); // returns 1 to 7
Now there may be more clever mathematical ways to do this, but one idea is:
int sigfigs(double x) {
int mag = log10(fabs(x));
double div = pow(10, mag);
char str[20];
int wrote = snprintf(str, sizeof(str), "%.15g", x/div);
return std::count_if(str, str + wrote, isdigit);
}
This is definitely missing some cases, but I think captures the idea: we first normalize large/small numbers so that we end up with something close to 1, then we print it with a suitable format string to allow the maximum usable precision to be displayed, then we count how many digits there are.
There are notable boundary-condition bugs at 0, 10, etc., which are left as an exercise to correct. Serendipitously, NAN produces 0 which is good; +/- infinity also produce 0.
One final note; this does not strictly conform to the usual definition of significant figures. In particular, trailing zeros after a decimal place are not accounted for, because there is no way to do so given only a double or float. For example, textual inputs of 10 and 10.00000 produce bitwise-identical results from atof(). Therefore, if you need a complete solution conforming to the academic definition of sigfigs, you will need to implement a user-defined type containing e.g. a double and an int for the sigfigs. You can then implement all the arithmetic operators to carry the sigfigs throughout your calculations, following the usual rules.
Are you trying to determine the number of bits of precision in a floating point number or the number of significant figures in a variable? C and C++ do not generally specify the format to be used for float and double, but if you know the floating point format in which the number is stored and processed, you can determine the number of bits of precision. Most hardware these days uses IEEE 754 format. Looking through the definition would be a good place to start.
Number of significant figures is an entirely different question. Definition of significant figures includes a notion of how many figures are actually meaningful, as opposed to the number of figures available due to the floating point representation. For example, if you sample a voltage with a 12-bit A/D converter (and good enough analog design that all the bits are significant) then the data that you read will have 12 significant bits, and storing it in a format with higher precision does not increase the number of significant figures. For example, you store it in a 16-bit integer or a 32-bit IEEE 754 floating point number, depending on what you plan to do with the data. In either case you still only have 12 significant bits, even though a 32-bit float has a 24-bit mantissa.
Goldberg's What Every Computer Scientist Should Now About Floating-Point Arithmetic pretty thoroughly covers the issue if significant figures and floating-point arithmetic.

Should I use bit manipulation on float point numbers

I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.
If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.
You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.
You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5
If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.
I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.

More Precise Floating point Data Types than double?

In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Conversion Big Integer <-> double in C++

I am writing my own long arithmetic library in C++ for fun and it is already pretty finished, I even implemented several Cryptogrphic algorithms with that library, but one important thing is still missing: I want to convert doubles (and floats/long doubles) into my number and vice versa. My numbers are represented as a variable sized array of unsigned long ints plus a sign bit.
I tried to find the answer with google, but the problem is that people rarely ever implement such things themselves, so I only find things about how to use Java BigInteger etc.
Conceptually, it is rather easy: I take the mantissa, shift it by the number of bits dictated by the exponent and set the sign. In the other direction I truncate it so that it fits into the mantissa and set the exponent depending on my log2 function.
But I am having a hard time to figure out the details, I could either play around with some bit patterns and cast it to a double, but I didn't find an elegant way to achieve that or I could "calculate" it by starting with 2, exponentiate, multiply etc, but that doesn't seem very efficient.
I would appreciate a solution that doesn't use any library calls because I am trying to avoid libraries for my project, otherwise I could just have used gmp, furthermore, I often have two solutions on several other occasions, one using inline assembler which is efficient and one that is more platform independent, so either answer is useful for me.
edit: I use uint64_t for my parts, but I would like to be able to change it depending on the machine, but I am willing to do some different implementations with some #ifdefs to achieve that.
I'm going to make non-portable assumption here: namely, that unsigned long long has more accurate digits than double. (This is true on all modern desktop systems that I know of.)
First, convert the most significant integer(s) into an unsigned long long. Then convert that to a double S. Let M be the number of integers less than those used in that first step. multiply S by(1ull << (sizeof(unsigned)*CHAR_BIT*M). (If shifting more than 63 bits, you will have to split those into seperate shifts and do some alrithmetic) Finally, if the original number was negative you multiply this result by -1.
This rounds a lot, but even with this rounding, due to the above assumption, no digits are lost that wouldn't be lost anyway with the conversion to a double. I think this is a similar process to what Mark Ransom said, but I'm not certain.
For converting from a double to a biginteger, first seperate the mantissa into a double M and the exponent into an int E, using frexp. Multiply M by UNSIGNED_MAX, and store that result in an unsigned R. If std::numeric_limits<double>::radix() is 2 (I don't know if it is or not for x86/x64), you can easily shift R left by E-(sizeof(unsigned)*CHAR_BIT) bits and you're done. Otherwise the result will instead beR*(E**(sizeof(unsigned)*CHAR_BIT)) (where ** means to the power of)
If performance is a concern, you can add an overload to your bignum class for multiplying by std::constant_integer<unsigned, 10>, which simply returns (LHS<<4)+(LHS<<2). You can similarly optimize other constants if you wish.
This blog post might help you Clarifying and optimizing Integer>>asFloat
Otherwise, you can yet have an idea of algorithm with this SO question Converting from unsigned long long to float with round to nearest even
You don't say explicitly, but I assume your library is integer only and the unsigned longs are 32 bit and binary (not decimal). The conversion to double is simple, so I'll tackle that first.
Start with a multiplier for the current piece; if the number is positive it will be 1.0, if negative it will be -1.0. For each of the unsigned long ints in your bignum, multiply by the current multiplier and add it to the result, then multiply your multiplier by pow(2.0, 32) (4294967296.0) for 32 bits or pow(2.0, 64) (18446744073709551616.0) for 64 bits.
You can optimize this process by working with only the 2 most significant values. You need to use 2 even if the number of bits in your integer type is larger than the precision of a double, since the number of used bits in the most significant value might only be 1. You can generate the multiplier by taking a power of 2 to the number of skipped bits, e.g. pow(2.0, most_significant_count*sizeof(bit_array[0])*8). You can't use a bit shift as given in another answer because it will overflow after the first value.
To convert from double, you can get the exponent and mantissa separated from each other with the frexp function. The mantissa will come as a floating point value between 0.5 and 1.0 so you'll want to multiply it by pow(2.0, 32) or pow(2.0, 64) to convert it to an integer, then adjust the exponent by -32 or -64 to compensate.
To go from a big integer to a double, just do it the same way you parse numbers. For example, you parse the number "531" as "1 + (3 * 10) + (5 * 100)". Compute each portion using doubles, starting with the least significant portion.
To go from a double to a big integer, do it the same way but in reverse starting with the most significant portion. So, to convert 531, you first see that it's more than 100 but less than 1000. You find the first digit by dividing by 100. Then you subtract to get the remainder of 31. Then find the next digit by dividing by 10. And so on.
Of course, you won't be using tens (unless you store your big integers as digits). Exactly how you break it apart depends on how your big integer class is constructed. For example, if it's uses 64-bit units, then you'll use powers of 2^64 instead of powers of 10.