Real numbers - how to determine whether float or double is required? - c++

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?

For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.

I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.

Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.

A very detailed post that may or may not answer your question.
An entire series in floating point complexities!

Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}

You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Related

How can I test for how many significant figures a float has in C++?

How can I test how many significant figures a specified float has in c++? say if i write:
sigfigs(x);
x being the value of the float,
it would set an integer value to y, the number of sigfigs
how can i write a void function this way
this has been bugging me for some time, any answers appreciated
btw mysticial this is asking for a code to find the amount of sig figs in a float, not how many there are like the one you linked to as a duplicate -.-
This is a bit tricky, because as you should already know, floating-point numbers are often not exact, but rather some approximation of a number. For example, 10.1 ends up as 10.09999.... A double has about 15 digits of precision, so 15 is the largest value your sigfigs() function could reasonably return. And it will need to be overloaded for double and float, because of course float has only half as many digits of precision:
int sigfigs(double x); // returns 1 to 15
int sigfigs(float x); // returns 1 to 7
Now there may be more clever mathematical ways to do this, but one idea is:
int sigfigs(double x) {
int mag = log10(fabs(x));
double div = pow(10, mag);
char str[20];
int wrote = snprintf(str, sizeof(str), "%.15g", x/div);
return std::count_if(str, str + wrote, isdigit);
}
This is definitely missing some cases, but I think captures the idea: we first normalize large/small numbers so that we end up with something close to 1, then we print it with a suitable format string to allow the maximum usable precision to be displayed, then we count how many digits there are.
There are notable boundary-condition bugs at 0, 10, etc., which are left as an exercise to correct. Serendipitously, NAN produces 0 which is good; +/- infinity also produce 0.
One final note; this does not strictly conform to the usual definition of significant figures. In particular, trailing zeros after a decimal place are not accounted for, because there is no way to do so given only a double or float. For example, textual inputs of 10 and 10.00000 produce bitwise-identical results from atof(). Therefore, if you need a complete solution conforming to the academic definition of sigfigs, you will need to implement a user-defined type containing e.g. a double and an int for the sigfigs. You can then implement all the arithmetic operators to carry the sigfigs throughout your calculations, following the usual rules.
Are you trying to determine the number of bits of precision in a floating point number or the number of significant figures in a variable? C and C++ do not generally specify the format to be used for float and double, but if you know the floating point format in which the number is stored and processed, you can determine the number of bits of precision. Most hardware these days uses IEEE 754 format. Looking through the definition would be a good place to start.
Number of significant figures is an entirely different question. Definition of significant figures includes a notion of how many figures are actually meaningful, as opposed to the number of figures available due to the floating point representation. For example, if you sample a voltage with a 12-bit A/D converter (and good enough analog design that all the bits are significant) then the data that you read will have 12 significant bits, and storing it in a format with higher precision does not increase the number of significant figures. For example, you store it in a 16-bit integer or a 32-bit IEEE 754 floating point number, depending on what you plan to do with the data. In either case you still only have 12 significant bits, even though a 32-bit float has a 24-bit mantissa.
Goldberg's What Every Computer Scientist Should Now About Floating-Point Arithmetic pretty thoroughly covers the issue if significant figures and floating-point arithmetic.

More Precise Floating point Data Types than double?

In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.

How can I tell if a double precision floating point number can be safely stored as a single precision one? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Real numbers - how to determine whether float or double is required?
I'm trying to check if a conversion from double to float will result in loss of precision. Obviously, I can do the conversion and convert the float back into double and compare it to the original value. I'm curious as to whether there's a more direct way.
Converting to float and back is generally the most efficient solution; on most common architectures it will require only a couple instructions, with a latency of a couple cycles each. This also has the virtue of being both simple and correct.
On platforms that do not have hardware support for floating-point, you can do the check more efficiently by taking apart the number, and checking whether the exponent and significand fit into single-precision, but that is a relatively uncommon corner-case, and this is much more error-prone and not portable to platforms that use different FP formats.
A floating point number has two parts, the mantissa and the exponent. A double has more bits assigned to both parts. Assigning a double to float will drop mantissa bits which gives you less digits of precision, which is to be expected. However if the double exponent doesn't fit in the float exponent, then the float will be a garbage value.

Inconsistencies with double data type in C++

This may be something really simple that I'm just missing, however I am having trouble using the double data type. It states here that a double in C++ is accurate to ~15 digits. Yet in the code:
double n = pow(2,1000);
cout << fixed << setprecision(0) << n << endl;
n stores the exact value of 2^1000, something that is 302 decimal digits long according to WolframAlpha. Yet, when I attempt to compute 100!, with the function:
double factorial(int number)
{
double product = 1.0;
for (int i = 1; i <= number; i++){product*=i;}
return product;
}
Inaccuracies begin to appear at the 16th digit. Can someone please explain this behaviour, as well as provide me with some kind of solution.
Thanks
You will need eventually to read the basics in What Every Computer Scientist Should Know About Floating-Point Arithmetic. The most common standard for floating-point hardware is generally IEEE 754; the current version is from 2008, but most CPUs do not implement the new decimal arithmetic featured in the 2008 edition.
However, a floating point number (double or float) stores an approximation to the value, using a fixed-size mantissa (fractional part) and an exponent (power of 2). The dynamic range of the exponents is such that the values that can be represented range from about 10-300 to 10+300 in decimal, with about 16 significant (decimal) digits of accuracy. People persistently try to print more digits than can be stored, and get interesting and machine-dependent (and library-dependent) results when they do.
So, what you are seeing is intrinsic to the nature of floating-point arithmetic on computers using fixed-precision values. If you need greater precision, you need to use one of the arbitrary-precision libraries - there are many of them available.
The double actually stores an exponent to be applied to 2 in it's internal representation. So of course it can store 2^1000 accurately. But try adding 1 to that.
IEEE 754 gives an algorithm for storing floating point data. Computers have a finite number of bits to store an infinite number of numbers, and thus introduces error when storing digits.
In this form of representation, there is less room for precision the larger the represented number gets (larger == absolute distance from zero). Probably at that point you are seeing the loss of precision, and as you get even larger numbers, it will have even larger loss of precision.

Why are c/c++ floating point types so oddly named?

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.
You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.
Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.
First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.
The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".
A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.
In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.
They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.
double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.
Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".
It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.
hence the %f for a float type, and a %lf for a long float which is the same as double.