32-bit multiplication without using a 64-bit intermediate number

32-bit multiplication without using a 64-bit intermediate number - c++

Is there any way to multiply two 32-bit floating point numbers without using a 64-bit intermediate value?
Background:
In an IEEE floating point number, 1-bit is devoted to the sign, 8-bits are devoted to the exponent, and 23-bits are devoted to the mantissa. When multiplying the two numbers, the mantissa's have to be multiplied separately. When doing this, you will end up with a 48-bit number (since the most significant bit of 1 is implied). After receiving a 48-bit number, that value should be truncated by 25-bits so that only the 23 most significant bits are retained in the result.
My question is that, to do this multiplication as is, you will need a 64-bit number to store the intermediate result. But, I'm assuming that there is a way to do this without using a 64-bit number since 32-bit architectures didn't have the luxury of using 64-bit numbers and they were still able to do 32-bit floating point number multiplication. So how can you do this without using a 64-bit intermediate number?

From https://isocpp.org/wiki/faq/newbie#floating-point-arith2 :
floating point calculations and comparisons are often performed by
special hardware that often contain special registers, and those
registers often have more bits than a double.
So even on a 32bit architecture you probably have more-than-32-bits registers for floating point operations.

Related

How does the CPU "cast" a floating point x87 (i think) value?

I just wanted to know how the CPU "Cast" a floating point number.
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong? (i couldn't find the answer) So, if this is the case and the floating point numbers are not emulated how does the compiler cast it?

I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong?
On modern Intel processors, the compiler is likely to use the SSE/AVX registers. The FPU is often not in regular use.
I just wanted to know how the CPU "Cast" a floating point number.
Converting an integer to a floating-point number is a computation that is basically (glossing over some details):
Start with the binary (for unsigned types) or two’s complement (for signed types) representation of the integer.
If the number is zero, return all bits zero.
If it is negative, remember that and negate the number to make it positive.
Locate the highest bit set in the integer.
Locate the lowest bit that will fit in the significand of the destination format. (For example, for the IEEE-754 binary32 format commonly used for float, 24 bits fit in the significand, so the 25th bit after the highest bit set does not fit.)
Round the number at that position where the significand will end.
Calculate the exponent, which is a function of where the highest bit set is. Add a “bias” used in encoding the exponent (127 for binary32, 1023 for binary64).
Assemble a sign bit, bits for the exponent, and bits for the significand (omitting the high bit, because it is always one). Return those bits.
That computation prepares the bits that represent a floating-point number. (It omits details involving special cases like NaNs, infinities, and subnormal numbers because these do not occur when converting typical integer formats to typical floating-point formats.)
That computation may be performed “in software” (that is, with general instructions for shifting bits, testing values, and so on) or “in hardware” (that is, with special instructions for doing the conversion). All desktop computers have instructions for this. Small processors for special-purpose embedded use might not have such instructions.

It is not clear what do you mean by
"Cast" a floating point number. ?
If target architecture has FPU then compiler will issue FPU instructions in order to manipulate floating point variables, no mistery there...
In order to assign float variable to int variable, float must be truncated or rounded(up or down). Special instructions usually exists to serve this purpose.
If target architecture is "FPU-less" then compiler(toolchain) might provide software implementation of floating point operations using CPU instructions available. For example, expression like a = x * y; will be equivalent to a = fmul(x, y); Where fmul() is compiler provided special function(intrinsic) to do floating point operations without FPU. Ofcourse this is typically MUCH slower than using hardware FPU. Floating point arithmetic is not used on such platforms if performance matters, fixed point arithmetic https://en.wikipedia.org/wiki/Fixed-point_arithmetic could be used instead.

Find float type's memory format

How to find the memory format of float and double types?
I mean the numbers of bits for sign, exponent, and fraction.

I mean the numbers of bits for sign, exponent, and fraction.
You can use std::numeric_limits<T>::digits to get mantissa bits, std::numeric_limits<T>::is_signed to get the sign bits.
You and deduct their sum from sizeof(T)*CHAR_BIT to guess the exponent digits, but this may not be correct if the type has padding. This is typical for long double for example.
Find float type's memory format
Of course, not only the number of bits matter, but also the order. These functions do not help you with that. They also assume that std::numeric_limits<T>::radix == 2.
For the exact format, you will have to consult the manual of the cpu architechture.
I want to save floating point numbers to hard disk and wonder if there is a good way to make the file cross platform.
The most typical solution is to convert the floating point to a textual representation when saving to disk: 0x1.8p+0. It's not most efficient, but it is portable (although, you do have to decide what character encoding the file is going to use, and if that is not native to the system, there needs to be a conversion).

To save floating point numbers to disk in a cross platform manner, look up my github project
https://github.com/MalcolmMcLean/ieee754
I've also put in functions for reading binary integers portably, as it is slightly more involved than appears at first sight.
If you want to test memory format, you need to create floats with certain strategic values, then query the bit patterns. Zero should be all bits zero and a special case, unless you're on weird and wonderful hardware. 1 and -1 should differ by one bit, which is you sign bit. 2 and 1 should differ in the exponent, and testing powers of two should tell you the exponent bits and when they run out. Test powers of powers of two for speed.
You can then get the mantissa by storing 1 off a power of 2, e.g 3, 7 and so on.
By probing it you can build up a memory pattern, but it's almost always IEEE 754.

How does a 32-bit machine compute a double precision number

If i only have 32-bit machine, how do does the cpu compute a double precision number? This number is 64 bit wide. How does a FPU handle it?
The more general question would be, how to compute something which is wider, then my alu. However i fully understood the integer way. You can simply split them up. Yet with floating point numbers, you have the exponent and the mantissa, which should be handled differnetly.

Not everything in a "32-bit machine" has to be 32bit. The x87 style FPU hasn't been "32-bit" from its inception, which was a very long time before AMD64 was created. It was always capable of doing math on 80-bit extended doubles, and it used to be a separate chip, so no chance of using the main ALU at all.
It's wider than the ALU yes, but it doesn't go through the ALU, the floating point unit(s) use their own circuits which are as wide as they need to be. These circuits are also much more complicated than the integer circuits, and they don't really overlap with integer ALUs in their components

There are a several different concepts in a computer architecture that can be measured in bits, but none of them prevent handling 64 bit floating point numbers. Although these concepts may be correlated, it is worth considering them separately for this question.
Often, "32 bit" means that addresses are 32 bits. That limits each process's virtual memory to 2^32 addresses. It is the measure that makes the most direct difference to programs, because it affects the size of a pointer and the maximum size of in-memory data. It is completely irrelevant to the handling of floating point numbers.
Another possible meaning is the width of the paths that transfer data between memory and the CPU. That is not a hard limit on the sizes of data structures - one data item may take multiple transfers. For example, the Java Language Specification does not require atomic loads and stores of double or long. See 17.7. Non-Atomic Treatment of double and long. A double can be moved between memory and the processor using two separate 32 bit transfers.
A third meaning is the general register size. Many architectures use separate registers for floating point. Even if the general registers are only 32 bits the floating point registers can be wider, or it may be possible to pair two 32 bit floating point registers to represent one 64-bit number.
A typical relationship between these concepts is that a computer with 64 bit memory addresses will usually have 64 bit general registers, so that a pointer can fit in one general register.

Even 8 bit computers provided extended precision (80 bit) floating point arithmetic, by writing code to do the calculations.
Modern 32 bit computers (x86, ARM, older PowerPC etc.) have 32 bit integer and 64 or 80 bit floating-point hardware.

Let's look at integer arithmetic first, since it is simpler. Inside of you 32 bit ALU there are 32 individual logic units with carry bits that will spill up the chain. 1 + 1 -> 10, the carry but carried over to the second logic unit. The entire ALU will also have a carry bit output, and you can use this to do arbitrary length math. The only real limitation for the but width is how many bits you can work with in one cycle. To do 64 bit math you need 2 or more cycles and need to do the carry logic yourself.

It seems that the question is just "how does FPU work?", regardless of bit widths.
FPU does addition, multiplication, division, etc. Each of them has a different algorithm.
Addition
(also subtraction)
Given two numbers with exponent and mantissa:
x1 = m1 * 2 ^ e1
x2 = m2 * 2 ^ e2
, the first step is normalization:
x1 = m1 * 2 ^ e1
x2 = (m2 * 2 ^ (e2 - e1)) * 2 ^ e1 (assuming e2 > e1)
Then one can add the mantissas:
x1 + x2 = (whatever) * 2 ^ e1
Then, one should convert the result to a valid mantissa/exponent form (e.g., the (whatever) part might be required to be between 2^23 and 2^24). This is called "renormalization" if I am not mistaken. Here one should also check for overflow and underflow.
Multiplication
Just multiply the mantissas and add the exponents. Then renormalize the multiplied mantissas.
Division
Do a "long division" algorithm on the mantissas, then subtract the exponents. Renormalization might not be necessary (depending on how you implement the long division).
Sine/Cosine
Convert the input to a range [0...π/2], then run the CORDIC algorithm on it.
Etc.

Handling endianness of floating point values when there is no fixed size floating point type available

I'm writing a binary file reader/writer and have decided that to handle the issue of endianness I will convert all data to "network" (big) endianness on writing and to host endianness on reading. I'm avoiding hton* because I don't want to link with winsock for just those functions.
My main point of confusion comes from how to handle floating point values. For all integral values I have the sized types in <cstdint> (uint32_t, etc.), but from my research no such equivalent exists for floating point types. I'd like to convert all floating point values to a 32 bit representation on writing and convert back to whatever precision is used on the host (32 bit is enough for my application). This way I will know precisely how many bytes to write and read for floating point values; as opposed to if I used sizeof(float) and sizeof(float) was different on the machine loading the file than the machine that wrote it.
I was just made aware of the possibility of using frexp to get the mantissa and exponent in integer terms, writing those integers out (with some fixed size), then reading the integers in and reconstructing the floating point value using ldexp. This looks promising, but I am wondering if there is any generally accepted or recommended method for handling float endianness without htonf/ntohf.
I know with almost certainly any platform I'll be targeting anytime soon will have float represented with 32-bits, but I'd like to make the code I write now as compatible as I can for use in future projects.

If you want to be completely cross-platform and standards-compliant, then the frexp/ldexp solution is the best way to go. (Although you might need to consider the highly theoretical case where either the source or the target hardware uses decimal floating point.)
Suppose that one or the other machine did not have a 32-bit floating point representation. Then there is no datatype on that machine bit-compatible with a 32-bit floating pointer number, regardless of endianness. So there is then no standard way of converting the non-32-bit float to a transmittable 32-bit representation, or to convert the transmitted 32-bit representation to a native non-32-bit floating point number.
You could restrict your scope to machines which have a 32-bit floating point representation, but then you will need to assume that both machines have the same number and order of bits dedicated to sign, exponent and mantissa. That's likely to be the case, since IEEE-754 format is almost universal these days, but C++ does not insist on it and it is at least conceivable that there is a machine which implements 1/8/23-bit floating point numbers with the sign bit at the low-order end instead of the high-order end.
In short, endianness is only one of the possible incompatibilities between binary floating point formats. Reducing every floating point number to two integers, however, avoids having to deal with other incompatibilities (other than radix).

80-bit floating point and subnormal numbers

I am trying to convert an 80-bit extended precision floating point number (in a buffer) to double.
The buffer basically contains the content of an x87 register.
This question helped me get started as I wasn't all that familiar with the IEEE standard.
Anyway, I am struggling to find useful info on subnormal (or denormalized) numbers in the 80-bit format.
What I know is that unlike float32 or float64 it doesn't have a hidden bit in the mantissa (no implied addition of 1.0), so one way to know if a number is normalized is to check if the highest bit in the mantissa is set. That leaves me with the following question:
From what wikipedia tells me, float32 and float64 indicate a subnormal number with a (biased) exponent of 0 and a non-zero mantissa.
What does that tell me in an 80-bit float?
Can 80-bit floats with a mantissa < 1.0 even have a non-zero exponent?
Alternatively, can 80-bit floats with an exponent of 0 even have a mantissa >= 1.0?
EDIT: I guess the question boils down to:
Can I expect the FPU to sanitize exponent and highest mantissa bit in x87 registers?
If not, what kind of number should the conversion result in? Should I ignore the exponent altogether in that case? Or is it qNaN?
EDIT:
I read the FPU section in the Intel manual (Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture) which was less scary than I had feared. As it turns out the following values are not defined:
exponent == 0 + mantissa with the highest bit set
exponent != 0 + mantissa without the highest bit set
It doesn't mention if these values can appear in the wild, nor if they are internally converted.
So I actually dusted off Ollydbg and manually set bits in the x87 registers.
I crafted ST(0) to contain all bits set in the exponent and a mantissa of 0. Then I made it execute
FSTP QWORD [ESP]
FLD QWORD [ESP]
The value stored at [ESP] was converted to a signaling NaN.
After the FLD, ST(0) contained a quiet NaN.
I guess that answers my question. I accepted J-16 SDiZ's solution because it's the most straight forward solution (although it doesn't explicitly explain some of the finer details).
Anyway, case solved. Thanks, everybody.

Try SoftFloat library, it have floatx80_to_float32, floatx80_to_float64 and floatx80_to_float128. Detect the native format, act accordingly.

The problem with finding information on sub-normal 80 bit numbers might be because the 8087 does not make use of any special denormalization for them. Found this on MSDNs page on Type float (C):
The values listed in this table apply only to normalized
floating-point numbers; denormalized floating-point numbers have a
smaller minimum value. Note that numbers retained in 80x87 registers
are always represented in 80-bit normalized form; numbers can only be
represented in denormalized form when stored in 32-bit or 64-bit
floating-point variables (variables of type float and type long).
Edit
The above might be true for how Microsoft make use of the FPUs registers. Found another source that indicate this:
FPU Data types:
The 80x87 FPU generally stores values in a normalized format. When a
floating point number is normalized, the H.O. bit is always one. In
the 32 and 64 bit floating point formats, the 80x87 does not actually
store this bit, the 80x87 always assumes that it is one. Therefore, 32
and 64 bit floating point numbers are always normalized. In the
extended precision 80 bit floating point format, the 80x87 does not
assume that the H.O. bit of the mantissa is one, the H.O. bit of the
number appears as part of the string of bits.
Normalized values provide the greatest precision for a given number of
bits. However, there are a large number of non-normalized values which
we can represent with the 80 bit format. These values are very close
to zero and represent the set of values whose mantissa H.O. bit is not
zero. The 80x87 FPUs support a special form of 80 bit known as
denormalized values.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js