80-bit floating point and subnormal numbers - c++

I am trying to convert an 80-bit extended precision floating point number (in a buffer) to double.
The buffer basically contains the content of an x87 register.
This question helped me get started as I wasn't all that familiar with the IEEE standard.
Anyway, I am struggling to find useful info on subnormal (or denormalized) numbers in the 80-bit format.
What I know is that unlike float32 or float64 it doesn't have a hidden bit in the mantissa (no implied addition of 1.0), so one way to know if a number is normalized is to check if the highest bit in the mantissa is set. That leaves me with the following question:
From what wikipedia tells me, float32 and float64 indicate a subnormal number with a (biased) exponent of 0 and a non-zero mantissa.
What does that tell me in an 80-bit float?
Can 80-bit floats with a mantissa < 1.0 even have a non-zero exponent?
Alternatively, can 80-bit floats with an exponent of 0 even have a mantissa >= 1.0?
EDIT: I guess the question boils down to:
Can I expect the FPU to sanitize exponent and highest mantissa bit in x87 registers?
If not, what kind of number should the conversion result in? Should I ignore the exponent altogether in that case? Or is it qNaN?
EDIT:
I read the FPU section in the Intel manual (Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture) which was less scary than I had feared. As it turns out the following values are not defined:
exponent == 0 + mantissa with the highest bit set
exponent != 0 + mantissa without the highest bit set
It doesn't mention if these values can appear in the wild, nor if they are internally converted.
So I actually dusted off Ollydbg and manually set bits in the x87 registers.
I crafted ST(0) to contain all bits set in the exponent and a mantissa of 0. Then I made it execute
FSTP QWORD [ESP]
FLD QWORD [ESP]
The value stored at [ESP] was converted to a signaling NaN.
After the FLD, ST(0) contained a quiet NaN.
I guess that answers my question. I accepted J-16 SDiZ's solution because it's the most straight forward solution (although it doesn't explicitly explain some of the finer details).
Anyway, case solved. Thanks, everybody.

Try SoftFloat library, it have floatx80_to_float32, floatx80_to_float64 and floatx80_to_float128. Detect the native format, act accordingly.

The problem with finding information on sub-normal 80 bit numbers might be because the 8087 does not make use of any special denormalization for them. Found this on MSDNs page on Type float (C):
The values listed in this table apply only to normalized
floating-point numbers; denormalized floating-point numbers have a
smaller minimum value. Note that numbers retained in 80x87 registers
are always represented in 80-bit normalized form; numbers can only be
represented in denormalized form when stored in 32-bit or 64-bit
floating-point variables (variables of type float and type long).
Edit
The above might be true for how Microsoft make use of the FPUs registers. Found another source that indicate this:
FPU Data types:
The 80x87 FPU generally stores values in a normalized format. When a
floating point number is normalized, the H.O. bit is always one. In
the 32 and 64 bit floating point formats, the 80x87 does not actually
store this bit, the 80x87 always assumes that it is one. Therefore, 32
and 64 bit floating point numbers are always normalized. In the
extended precision 80 bit floating point format, the 80x87 does not
assume that the H.O. bit of the mantissa is one, the H.O. bit of the
number appears as part of the string of bits.
Normalized values provide the greatest precision for a given number of
bits. However, there are a large number of non-normalized values which
we can represent with the 80 bit format. These values are very close
to zero and represent the set of values whose mantissa H.O. bit is not
zero. The 80x87 FPUs support a special form of 80 bit known as
denormalized values.

Related

How does the CPU "cast" a floating point x87 (i think) value?

I just wanted to know how the CPU "Cast" a floating point number.
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong? (i couldn't find the answer) So, if this is the case and the floating point numbers are not emulated how does the compiler cast it?
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong?
On modern Intel processors, the compiler is likely to use the SSE/AVX registers. The FPU is often not in regular use.
I just wanted to know how the CPU "Cast" a floating point number.
Converting an integer to a floating-point number is a computation that is basically (glossing over some details):
Start with the binary (for unsigned types) or two’s complement (for signed types) representation of the integer.
If the number is zero, return all bits zero.
If it is negative, remember that and negate the number to make it positive.
Locate the highest bit set in the integer.
Locate the lowest bit that will fit in the significand of the destination format. (For example, for the IEEE-754 binary32 format commonly used for float, 24 bits fit in the significand, so the 25th bit after the highest bit set does not fit.)
Round the number at that position where the significand will end.
Calculate the exponent, which is a function of where the highest bit set is. Add a “bias” used in encoding the exponent (127 for binary32, 1023 for binary64).
Assemble a sign bit, bits for the exponent, and bits for the significand (omitting the high bit, because it is always one). Return those bits.
That computation prepares the bits that represent a floating-point number. (It omits details involving special cases like NaNs, infinities, and subnormal numbers because these do not occur when converting typical integer formats to typical floating-point formats.)
That computation may be performed “in software” (that is, with general instructions for shifting bits, testing values, and so on) or “in hardware” (that is, with special instructions for doing the conversion). All desktop computers have instructions for this. Small processors for special-purpose embedded use might not have such instructions.
It is not clear what do you mean by
"Cast" a floating point number. ?
If target architecture has FPU then compiler will issue FPU instructions in order to manipulate floating point variables, no mistery there...
In order to assign float variable to int variable, float must be truncated or rounded(up or down). Special instructions usually exists to serve this purpose.
If target architecture is "FPU-less" then compiler(toolchain) might provide software implementation of floating point operations using CPU instructions available. For example, expression like a = x * y; will be equivalent to a = fmul(x, y); Where fmul() is compiler provided special function(intrinsic) to do floating point operations without FPU. Ofcourse this is typically MUCH slower than using hardware FPU. Floating point arithmetic is not used on such platforms if performance matters, fixed point arithmetic https://en.wikipedia.org/wiki/Fixed-point_arithmetic could be used instead.

32-bit multiplication without using a 64-bit intermediate number

Is there any way to multiply two 32-bit floating point numbers without using a 64-bit intermediate value?
Background:
In an IEEE floating point number, 1-bit is devoted to the sign, 8-bits are devoted to the exponent, and 23-bits are devoted to the mantissa. When multiplying the two numbers, the mantissa's have to be multiplied separately. When doing this, you will end up with a 48-bit number (since the most significant bit of 1 is implied). After receiving a 48-bit number, that value should be truncated by 25-bits so that only the 23 most significant bits are retained in the result.
My question is that, to do this multiplication as is, you will need a 64-bit number to store the intermediate result. But, I'm assuming that there is a way to do this without using a 64-bit number since 32-bit architectures didn't have the luxury of using 64-bit numbers and they were still able to do 32-bit floating point number multiplication. So how can you do this without using a 64-bit intermediate number?
From https://isocpp.org/wiki/faq/newbie#floating-point-arith2 :
floating point calculations and comparisons are often performed by
special hardware that often contain special registers, and those
registers often have more bits than a double.
So even on a 32bit architecture you probably have more-than-32-bits registers for floating point operations.

Algorithm to convert double to long double

If double is a 64 bit IEEE-754 type and long double is either an 80 or 128 bit IEEE-754 type, what is the algorithm that is used by the hardware (or the compiler?) in order to perform the conversion:
double d = 3.14159;
long double ld = (long double) d;
Also, it would be amazing if someone could list a source for the algorithm, as I've had no luck finding one thus far.
For normal numbers like 3.14159, the procedure is as follows:
separate the number into sign, biased exponent, and significand
add the difference in the exponent biases for long double and double
(0x3fff - 0x3ff) to the exponent.
assemble the sign, new exponent, and significand (remembering to make the
leading bit explicit in the Intel 80-bit format).
In practice, on common hardware with the Intel 80-bit format, the “conversion” is just a load instruction to the x87 stack (FLD). One rarely needs to muck around with the actual representation details, unless targeting a platform without hardware support.
It's defined in the C Standard - google for N1570 to find a copy of the latest free draft. Since all "double" values can be represented in "long double", the result is a long double with the same value. I don't think you will find a precise description of the algorithm that the hardware uses, but it's quite straightforward and obvious if you look at the data formats:
Examine the exponent and mantissa bits to find if the number is Infinity, NaN, a normalized number, a denormalised number or a zero, produce a long double Infinity or NaN when needed, adjust the exponent of normalized numbers and shift the mantissa bits into the right place, adding an implicit highest mantissa bit, convert denormalised numbers to normalised numbers, and zeroes to long double zeroes.

Where's the 24th fraction bit on a single precision float? IEEE 754

I found myself today doing some bit manipulation and I decided to refresh my floating-point knowledge a little!
Things were going great until I saw this:
... 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits
I read it again and again but I still can't figure out where the 24th bit is, I noticed something about a binary point so I assumed that it's a point in the middle between the mantissa and the exponent.
I'm not really sure but I believe he author was talking about this bit:
Binary point?
|
s------e-----|-------------m----------
0 - 01111100 - 01000000000000000000000
^ this
The 24th bit is implicit due to normalization.
The significand is shifted left (and one subtracted from the exponent for each bit shift) until the leading bit of the significand is a 1.
Then, since the leading bit is a 1, only the other 23 bits are actually stored.
There is also the possibility of a denormal number. The exponent is stored as a "bias" format signed number, meaning that it's an unsigned number where the middle of the range is defined to mean 01. So, with 8 bits, it's stored as a number from 0..255, but 0 is interpreted to mean -128, 128 is interpreted to mean 0, and 255 is interpreted as 127 (I may have a fencepost error there, but you get the idea).
If, in the process of normalization, this is decremented to 0 (meaning an actual exponent value of -128), then normalization stops, and the significand is stored as-is. In this case, the implicit bit from normalization it taken to be a 0 instead of a 1.
Most floating point hardware is designed to basically assume numbers will be normalized, so they assume that implicit bit is a 1. During the computation, they check for the possibility of a denormal number, and in that case they do roughly the equivalent of throwing an exception, and re-start the calculation with that taken into account. This is why computation with denormals often gets drastically slower than otherwise.
In case you wonder why it uses this strange format: IEEE floating point (like many others) is designed to ensure that if you treat its bit pattern as an integer of the same size, you can compare them as signed, 2's complement integers and they'll still sort into the correct order as floating point numbers. Since the sign of the number is in the most significant bit (where it is for a 2's complement integer) that's treated as the sign bit. The bits of the exponent are stored as the next most significant bits -- but if we used 2's complement for them, an exponent less than 0 would set the second most significant bit of the number, which would result in what looked like a big number as an integer. By using bias format, a smaller exponent leaves that bit clear, and a larger exponent sets it, so the order as an integer reflects the order as a floating point.
Normally (pardon the pun), the leading bit of a floating point number is always 1; thus, it doesn't need to be stored anywhere. The reason is that, if it weren't 1, that would mean you had chosen the wrong exponent to represent it; you could get more precision by shifting the mantissa bits left and using a smaller exponent.
The one exception is denormal/subnormal numbers, which are represented by all zero bits in the exponent field (the lowest possible exponent). In this case, there is no implicit leading 1 in the mantissa, and you have diminishing precision as the value approaches zero.
For normal floating point numbers, the number stored in the floating point variable is (ignoring sign) 1. mantissa * 2exponent-offset. The leading 1 is not stored in the variable.

Some floating point precision and numeric limits question

I know that there are tons of questions like this one, but I couldn't find my answers. Please read before voting to close (:
According to PC ASM:
The numeric coprocessor has eight floating point registers.
Each register holds 80 bits of data.
Floating point numbers are always stored as 80-bit
extended precision numbers in these registers.
How is that possible, when sizeof shows different things. For example, on x64 architecture, the sizeof double is 8 and this is far away from 80bits.
why does std::numeric_limits< long double >::max() gives me 1.18973e+4932 ?! This is huuuuuuuuuuge number. If this is not the way to get max of floating point numbers, then why this compiles at all, and even more - why does this returns a value.
what does this mean:
Double precision magnitudes can range from approximately 10^−308 to 10^308
These are huge numbers, you cannot store them into 8B or even 16B (which is extended precision and it is only 128bits)?
Obviously, I'm missing something. Actually, obviously, a lot of things.
1) sizeof is the size in memory, not in a register. sizeof is in bytes, so 8 bytes = 64 bits. When doubles are calculated in memory (on this architecture), they get an extra 16 bits for more precise intermediate calculations. When the value is copied back to memory, the extra 16 bits are lost.
2) Why do you think long double doesn't go up to 1.18973e+4932?
3) Why can't you store 10^308 in 8 bytes? I only need 13 bits: 4 to store the 10, and 9 to store the 308.
A double is not an intel coprocessor 80 bit floating point, it is a IEEE 754 64 bit floating point. With sizeof(double) you will get the size of the latter.
This is the correct way to get the maximum value for long double, so your question is pointless.
You are probably missing that floating point numbers are not exact numbers. 10^308 doesn't store 308 digits, only about 19 digits.
The size of space that the FPU uses and the amount of space used in memory to represent double are two different things. IEEE 754 (which probably most architectures use) specifies 32-bit single precision and 64-bit double precision numbers, which is why sizeof(double) gives you 8 bytes. Intel x86 does floating point math internally using 80 bits.
std::numeric_limits< long double >::max() is giving you the correct size for long double which is typically 80 bits. If you want the max size for 64 bit double you should use that as the template parameter.
For the question about ranges, why do you think you can't store them in 8 bytes? They do in fact fit, and what you're missing is that at the extremes of the range there are number that can't be represented (for example exponent nearing 308, there are many many integers that cant' be represented at all).
See also http://floating-point-gui.de/ for info about floating point.
Floating point number on computer are represented according to the IEEE 754-2008.
It defines several formats, amongst which
binary32 = Single precision,
binary64 = Double precision and
binary128 = Quadruple precision are the most common.
http://en.wikipedia.org/wiki/IEEE_754-2008#Basic_formats
Double precision number have 52 bits for the digit, which gives the precision, and 10 bits for the exponent, which gives the size of the number.
So doubles are 1.xxx(52 binary digits) * 2 ^ exponent(10 binary digits, so up to 2^10=1024)
And 2^1024 = 1,79 * 10^308
Which is why this is the largest value you can store in a double.
When using a quadruple precision number, they are 112 bits of precision and 14 digits for the exponent, so the largest exponent is 16384.
As 2^16384 gives 1,18 * 10^4932 you see that your C++ test was perfectly correct and that on x64 your double is actually stored in a quadruple precision number.