Are all 32 bit floats exactly represented as f64?

Are all 32 bit floats exactly represented as f64? - casting

Can all IEEE 754 32 bit floating point numbers be represented exactly by a 64 bit floating point number? Stated another way, is a cast from f32 to f64 ever rounded?

Can all IEEE 754 32 bit floating point numbers be represented exactly by a 64 bit floating point number?
Yes. All numeric values of binary32 are in binary64.
Stated another way, is a cast from f32 to f64 ever rounded?
Not usually. Various language like C allow intermediate 32-bit FP calculations to employ wider math and so a cast may narrow (round) results. Yet if the value was truly f32, no rounding error would occur going to f64.
Aside:
The Not-a-number payload of a binary32 is 23 bits and that is fully encodable as a binary64, yet the detailed meaning of those is implementation dependent.

Related

Casting Double to Long is giving wrong value [duplicate]

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks

Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.

This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks

This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

Is hardcode float precise if it can be represented by binary format in IEEE 754?

for example, 0 , 0.5, 0.15625 , 1 , 2 , 3... are values converted from IEEE 754. Are their hardcode version precise?
for example:
is
float a=0;
if(a==0){
return true;
}
always return true? other example:
float a=0.5;
float b=0.25;
float c=0.125;
is a * b always equal to 0.125 and a * b==c always true? And one more example:
int a=123;
float b=0.5;
is a * b always be 61.5? or in general, is integer multiply by IEEE 754 binary float precise?
Or a more general question: if the value is hardcode and both the value and result can be represented by binary format in IEEE 754 (e.g.:0.5 - 0.125), is the value precise?

There is no inherent fuzzyness in floating-point numbers. It's just that some, but not all, real numbers can't be exactly represented.
Compare with a fixed-width decimal representation, let's say with three digits. The integer 1 can be represented, using 1.00, and 1/10 can be represented, using 0.10, but 1/3 can only be approximated, using 0.33.
If we instead use binary digits, the integer 1 would be represented as 1.00 (binary digits), 1/2 as 0.10, 1/4 as 0.01, but 1/3 can (again) only be approximated.
There are some things to remember, though:
It's not the same numbers as with decimal digits. 1/10 can be
written exactly as 0.1 using decimal digits, but not using binary
digits, no matter how many you use (short of infinity).
In practice, it is difficult to keep track of which numbers can be
represented and which can't. 0.5 can, but 0.4 can't. So when you need
exact numbers, such as (often) when working with money, you shouldn't
use floating-point numbers.
According to some sources, some processors do strange things
internally when performing floating-point calculations on numbers
that can't be exactly represented, causing results to vary in a way
that is, in practice, unpredictable.
(My opinion is that it's actually a reasonable first approximation to say that yes, floating-point numbers are inherently fuzzy, so unless you are sure your particular application can handle that, stay away from them.)
For more details than you probably need or want, read the famous What Every Computer Scientist Should Know About Floating-Point Arithmetic. Also, this somewhat more accessible website: The Floating-Point Guide.

No, but as Thomas Padron-McCarthy says, some numbers can be exactly represented using binary but not all of them can.
This is the way I explain it to non-developers who I work with (like Mahmut Ali I also work on an very old financial package): Imagine having a very large cake that is cut into 256 slices. Now you can give 1 person the whole cake, 2 people half of the slices but soon as you decide to split it between 3 you can't - it's either 85 or 86 - you can't split the cake any further. The same is with floating point. You can only get exact numbers on some representations - some numbers can only be closely approximated.

C++ does not require binary floating point representation. Built-in integers are required to have a binary representation, commonly two's complement, but one's complement and sign and magnitude are also supported. But floating point can be e.g. decimal.
This leaves open the question of whether C++ floating point can have a radix that does not have 2 as a prime factor, like 2 and 10. Are other radixes permitted? I don't know, and last time I tried to check that, I failed.
However, assuming that the radix must be 2 or 10, then all your examples involve values that are powers of 2 and therefore can be exactly represented.
This means that the single answer to most of your questions is “yes”. The exception is the question “is integer multiply by IEEE 754 binary float [exact]”. If the result exceeds the precision available, then it can't be exact, but otherwise it is.
See the classic “What Every Computer Scientist Should Know About Floating-Point Arithmetic” for background info about floating point representation & properties in general.
If a value can be exactly represented in 32-bit or 64-bit IEEE 754, then that doesn't mean that it can be exactly represented with some other floating point representation. That's because different 32-bit representations and different 64-bit representations use different number of bits to hold the mantissa and have different exponent ranges. So a number that can be exactly represented in one way, can be beyond the precision or range of some other representation.
You can use std::numeric_limits<T>::is_iec559 (where e.g. T is double) to check whether your implementation claims to be IEEE 754 compatible. However, when floating point optimizations are turned on, at least the g++ compiler (1)erroneously claims to be IEEE 754, while not treating e.g. NaN values correctly according to that standard. In effect, the is_iec559 only tells you whether the number representation is IEEE 754, and not whether the semantics conform.
(1) Essentially, instead of providing different types for different semantics, gcc and g++ try to accommodate different semantics via compiler options. And with separate compilation of parts of a program, that can't conform to the C++ standard.

In principle, this should be possible. If you restrict yourself to exactly this class of numbers with a finite 2-power representation.
But it is dangerous: what if someone takes your code and changes your 0.5 to 0.4 or your .0625 to .065 due to whatever reasons? Then your code is broken. And no, even excessive comments won't help about that - someone will always ignore them.

Range of representable values of 32-bit, 64-bit and 80-bit float IEEE-754?

In the C++ standard it says of floating literals:
If the scaled value is not in the range of representable values for its type, the program is ill-formed.
The scaled value is the significant part multiplied by 10 ^ exponent part.
Under x86-64:
float is a single-precision IEEE-754
double is a double-precision IEEE-754
long double is an 80-bit extended precision IEEE-754
In this context, what is the range of repsentable values for each of these three types? Where is this documented? or how is it calculated?

If you know the number of exponent bits and mantissa bits, then based on the IEEE-754 format, one can establish that the maximum absolute representable value is:
2^(2^(E-1)-1)) * (1 + (2^M-1)/2^M)
The minimum absolute value (not including zero or denormals) is:
2^(2-2^(E-1))
For single-precision, E is 8, M is 23.
For double-precision, E is 11, M is 52.
For extended-precision, I'm not sure. If you're referring to the 80-bit precision of the x87 FPU, then so far as I can tell, it's not's IEEE-754 compliant...

The answer (if you're on a machine with IEEE floating point) is
in float.h. FLT_MAX, DBL_MAX and LDBL_MAX. On a system
with full IEEE support, something around 3.4e+38, 1.8E+308 and
1.2E4932. (The exact values may vary, and may be expressed
differently, depending on how the compiler does its input and
rounding. g++, for example, defines them to be compiler
built-ins.)
EDIT:
WRT your question (since neither I nor the other responders
actually answered it): the range of representable values is
[-type_MAX...type], where
type is one of FLT, DBL, or LDBL.

I was looking for largest representable number by 64 bits and ending up making my own 500 digit floating point calculator. This is what I come up with if all 64 bits are turned on
18,446,744,073,709,551,615
18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand 615

Why are c/c++ floating point types so oddly named?

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.

The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.

You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.

Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.

First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.

The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".

A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.

In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.

They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.

double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.

Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".

It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.

hence the %f for a float type, and a %lf for a long float which is the same as double.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js