How does a 32-bit machine compute a double precision number - c++

If i only have 32-bit machine, how do does the cpu compute a double precision number? This number is 64 bit wide. How does a FPU handle it?
The more general question would be, how to compute something which is wider, then my alu. However i fully understood the integer way. You can simply split them up. Yet with floating point numbers, you have the exponent and the mantissa, which should be handled differnetly.

Not everything in a "32-bit machine" has to be 32bit. The x87 style FPU hasn't been "32-bit" from its inception, which was a very long time before AMD64 was created. It was always capable of doing math on 80-bit extended doubles, and it used to be a separate chip, so no chance of using the main ALU at all.
It's wider than the ALU yes, but it doesn't go through the ALU, the floating point unit(s) use their own circuits which are as wide as they need to be. These circuits are also much more complicated than the integer circuits, and they don't really overlap with integer ALUs in their components

There are a several different concepts in a computer architecture that can be measured in bits, but none of them prevent handling 64 bit floating point numbers. Although these concepts may be correlated, it is worth considering them separately for this question.
Often, "32 bit" means that addresses are 32 bits. That limits each process's virtual memory to 2^32 addresses. It is the measure that makes the most direct difference to programs, because it affects the size of a pointer and the maximum size of in-memory data. It is completely irrelevant to the handling of floating point numbers.
Another possible meaning is the width of the paths that transfer data between memory and the CPU. That is not a hard limit on the sizes of data structures - one data item may take multiple transfers. For example, the Java Language Specification does not require atomic loads and stores of double or long. See 17.7. Non-Atomic Treatment of double and long. A double can be moved between memory and the processor using two separate 32 bit transfers.
A third meaning is the general register size. Many architectures use separate registers for floating point. Even if the general registers are only 32 bits the floating point registers can be wider, or it may be possible to pair two 32 bit floating point registers to represent one 64-bit number.
A typical relationship between these concepts is that a computer with 64 bit memory addresses will usually have 64 bit general registers, so that a pointer can fit in one general register.

Even 8 bit computers provided extended precision (80 bit) floating point arithmetic, by writing code to do the calculations.
Modern 32 bit computers (x86, ARM, older PowerPC etc.) have 32 bit integer and 64 or 80 bit floating-point hardware.

Let's look at integer arithmetic first, since it is simpler. Inside of you 32 bit ALU there are 32 individual logic units with carry bits that will spill up the chain. 1 + 1 -> 10, the carry but carried over to the second logic unit. The entire ALU will also have a carry bit output, and you can use this to do arbitrary length math. The only real limitation for the but width is how many bits you can work with in one cycle. To do 64 bit math you need 2 or more cycles and need to do the carry logic yourself.

It seems that the question is just "how does FPU work?", regardless of bit widths.
FPU does addition, multiplication, division, etc. Each of them has a different algorithm.
Addition
(also subtraction)
Given two numbers with exponent and mantissa:
x1 = m1 * 2 ^ e1
x2 = m2 * 2 ^ e2
, the first step is normalization:
x1 = m1 * 2 ^ e1
x2 = (m2 * 2 ^ (e2 - e1)) * 2 ^ e1 (assuming e2 > e1)
Then one can add the mantissas:
x1 + x2 = (whatever) * 2 ^ e1
Then, one should convert the result to a valid mantissa/exponent form (e.g., the (whatever) part might be required to be between 2^23 and 2^24). This is called "renormalization" if I am not mistaken. Here one should also check for overflow and underflow.
Multiplication
Just multiply the mantissas and add the exponents. Then renormalize the multiplied mantissas.
Division
Do a "long division" algorithm on the mantissas, then subtract the exponents. Renormalization might not be necessary (depending on how you implement the long division).
Sine/Cosine
Convert the input to a range [0...π/2], then run the CORDIC algorithm on it.
Etc.

Related

How does the CPU "cast" a floating point x87 (i think) value?

I just wanted to know how the CPU "Cast" a floating point number.
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong? (i couldn't find the answer) So, if this is the case and the floating point numbers are not emulated how does the compiler cast it?
I mean, i suppouse that when when we use a "float" or "double" in C/C++ the compiler is using the x87 unit, or am i wrong?
On modern Intel processors, the compiler is likely to use the SSE/AVX registers. The FPU is often not in regular use.
I just wanted to know how the CPU "Cast" a floating point number.
Converting an integer to a floating-point number is a computation that is basically (glossing over some details):
Start with the binary (for unsigned types) or two’s complement (for signed types) representation of the integer.
If the number is zero, return all bits zero.
If it is negative, remember that and negate the number to make it positive.
Locate the highest bit set in the integer.
Locate the lowest bit that will fit in the significand of the destination format. (For example, for the IEEE-754 binary32 format commonly used for float, 24 bits fit in the significand, so the 25th bit after the highest bit set does not fit.)
Round the number at that position where the significand will end.
Calculate the exponent, which is a function of where the highest bit set is. Add a “bias” used in encoding the exponent (127 for binary32, 1023 for binary64).
Assemble a sign bit, bits for the exponent, and bits for the significand (omitting the high bit, because it is always one). Return those bits.
That computation prepares the bits that represent a floating-point number. (It omits details involving special cases like NaNs, infinities, and subnormal numbers because these do not occur when converting typical integer formats to typical floating-point formats.)
That computation may be performed “in software” (that is, with general instructions for shifting bits, testing values, and so on) or “in hardware” (that is, with special instructions for doing the conversion). All desktop computers have instructions for this. Small processors for special-purpose embedded use might not have such instructions.
It is not clear what do you mean by
"Cast" a floating point number. ?
If target architecture has FPU then compiler will issue FPU instructions in order to manipulate floating point variables, no mistery there...
In order to assign float variable to int variable, float must be truncated or rounded(up or down). Special instructions usually exists to serve this purpose.
If target architecture is "FPU-less" then compiler(toolchain) might provide software implementation of floating point operations using CPU instructions available. For example, expression like a = x * y; will be equivalent to a = fmul(x, y); Where fmul() is compiler provided special function(intrinsic) to do floating point operations without FPU. Ofcourse this is typically MUCH slower than using hardware FPU. Floating point arithmetic is not used on such platforms if performance matters, fixed point arithmetic https://en.wikipedia.org/wiki/Fixed-point_arithmetic could be used instead.

Is it faster to multiply low numbers in C/C++ (as opposed to high numbers)?

Example of question:
Is calculating 123 * 456 faster than calculating 123456 * 7890? Or is it the same speed?
I'm wondering about 32 bit unsigned integers, but I won't ignore answers about other types (64 bit, signed, float, etc.). If it is different, what is the difference due to? Whether or not the bits are 0/1?
Edit: If it makes a difference, I should clarify that I'm referring to any number (two random numbers lower than 100 vs two random numbers higher than 1000)
For builtin types up to at least the architecture's word size (e.g. 64 bit on a modern PC, 32 or 16 bit on most low-cost general purpose CPUs from the last couple decades), for every compiler/implementation/version and CPU I've ever heard of, the CPU opcode for multiplication of a particular integral size takes a certain number of clock cycles irrespective of the quantities involved. Multiplications of data with different sizes, performs differently on some CPUs (e.g. AMD K7 has 3 cycles latency for 16 bit IMUL, vs 4 for 32 bit).
It is possible that on some architecture and compiler/flags combination, a type like long long int has more bits than the CPU opcodes can operate on in one instruction, so the compiler may emit code to do the multiplication in stages and that will be slower than multiplication of CPU-supported types. But again, a small value stored at run-time in a wider type is unlikely to be treated - or perform - any differently than a larger value.
All that said, if one or both values are compile-time constants, the compiler is able to avoid the CPU multiplication operator and optimise to addition or bit shifting operators for certain values (e.g. 1 is obviously a no-op, either side 0 ==> 0 result, * 4 can sometimes be implemented as << 2). There's nothing in particular stopping techniques like bit shifting being used for larger numbers, but a smaller percentage of such numbers can be optimised to the same degree (e.g. there're more powers of two - for which multiplication can be performed using bit shifting left - between 0 and 1000 than between 1000 and 2000).
This is highly dependendent on the processor architecture and model.
In the old days (ca 1980-1990), the number of ones in the two numbers would be a factor - the more ones, the longer it took to multiply [after sign adjustment, so multiplying by -1 wasn't slower than multiplying by 1, but multiplying by 32767 (15 ones) was notably slower than multiplying by 17 (2 ones)]. That's because a multiply is essentially:
unsigned int multiply(unsigned int a, unsigned int b)
{
res = 0;
for(number of bits)
{
if (b & 1)
{
res += a;
}
a <<= 1;
b >>= 1;
}
}
In modern processors, multiply is quite fast either way, but 64-bit multiply can be a clock cycle or two slower than a 32-bit value. Simply because modern processors can "afford" to put down the whole logic for doing this in a single cycle - both when it comes to speed of transistors themselves, and the area that those transistors take up.
Further, in the old days, there was often instructions to do 16 x 16 -> 32 bit results, but if you wanted 32 x 32 -> 32 (or 64), the compiler would have to call a library function [or inline such a function]. Today, I'm not aware of any modern high end processor [x86, ARM, PowerPC] that can't do at least 64 x 64 -> 64, some do 64 x 64 -> 128, all in a single instruction (not always a single cycle tho').
Note that I'm completely ignoring the fact that "if the data is in cache is an important factor". Yes, that is a factor - and it's a bit like ignoring wind resistance when traveling at 200 km/h - it's not at all something you ignore in the real world. However, it is quite unimportant for THIS discussion. Just like people making sports cars care about aerodynamics, to get complex [or simple] software to run fast involves a certain amount of caring about the cache-content.
For all intents and purposes, the same speed (even if there were differences in computation speed, they would be immeasurable). Here is a reference benchmarking different CPU operations if you're curious: http://www.agner.org/optimize/instruction_tables.pdf.

32-bit multiplication without using a 64-bit intermediate number

Is there any way to multiply two 32-bit floating point numbers without using a 64-bit intermediate value?
Background:
In an IEEE floating point number, 1-bit is devoted to the sign, 8-bits are devoted to the exponent, and 23-bits are devoted to the mantissa. When multiplying the two numbers, the mantissa's have to be multiplied separately. When doing this, you will end up with a 48-bit number (since the most significant bit of 1 is implied). After receiving a 48-bit number, that value should be truncated by 25-bits so that only the 23 most significant bits are retained in the result.
My question is that, to do this multiplication as is, you will need a 64-bit number to store the intermediate result. But, I'm assuming that there is a way to do this without using a 64-bit number since 32-bit architectures didn't have the luxury of using 64-bit numbers and they were still able to do 32-bit floating point number multiplication. So how can you do this without using a 64-bit intermediate number?
From https://isocpp.org/wiki/faq/newbie#floating-point-arith2 :
floating point calculations and comparisons are often performed by
special hardware that often contain special registers, and those
registers often have more bits than a double.
So even on a 32bit architecture you probably have more-than-32-bits registers for floating point operations.

More Precise Floating point Data Types than double?

In my project I have to compute division, multiplication, subtraction, addition on a matrix of double elements.
The problem is that when the size of matrix increases the accuracy of my output is drastically getting affected.
Currently I am using double for each element which I believe uses 8 bytes of memory & has accuracy of 16 digits irrespective of decimal position.
Even for large size of matrix the memory occupied by all the elements is in the range of few kilobytes. So I can afford to use datatypes which require more memory.
So I wanted to know which data type is more precise than double.
I tried searching in some books & I could find long double.
But I dont know what is its precision.
And what if I want more precision than that?
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision long double, which is 80 bits padded to 16 bytes in memory, has 64 bits mantissa, with no implicit bit, which gets you 19.26 decimal digits. This has been the almost universal standard for long double for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an implicit bit, which gets you 34 decimal digits. GCC implements this as the __float128 type and there is (if memory serves) a compiler option to set long double to it.
You might want to consider the sequence of operations, i.e. do the additions in an ordered sequence starting with the smallest values first. This will increase overall accuracy of the results using the same precision in the mantissa:
1e00 + 1e-16 + ... + 1e-16 (1e16 times) = 1e00
1e-16 + ... + 1e-16 (1e16 times) + 1e00 = 2e00
The point is that adding small numbers to a large number will make them disappear. So the latter approach reduces the numerical error
Floating point data types with greater precision than double are going to depend on your compiler and architecture.
In order to get more than double precision, you may need to rely on some math library that supports arbitrary precision calculations. These probably won't be fast though.
On Intel architectures the precision of long double is 80bits.
What kind of values do you want to represent? Maybe you are better off using fixed precision.

Floating point versus fixed point: what are the pros/cons?

Floating point type represents a number by storing its significant digits and its exponent separately on separate binary words so it fits in 16, 32, 64 or 128 bits.
Fixed point type stores numbers with 2 words, one representing the integer part, another representing the part past the radix, in negative exponents, 2^-1, 2^-2, 2^-3, etc.
Float are better because they have wider range in an exponent sense, but not if one wants to store number with more precision for a certain range, for example only using integer from -16 to 16, thus using more bits to hold digits past the radix.
In terms of performances, which one has the best performance, or are there cases where some is faster than the other ?
In video game programming, does everybody use floating point because the FPU makes it faster, or because the performance drop is just negligible, or do they make their own fixed type ?
Why isn't there any fixed type in C/C++ ?
That definition covers a very limited subset of fixed point implementations.
It would be more correct to say that in fixed point only the mantissa is stored and the exponent is a constant determined a-priori. There is no requirement for the binary point to fall inside the mantissa, and definitely no requirement that it fall on a word boundary. For example, all of the following are "fixed point":
64 bit mantissa, scaled by 2-32 (this fits the definition listed in the question)
64 bit mantissa, scaled by 2-33 (now the integer and fractional parts cannot be separated by an octet boundary)
32 bit mantissa, scaled by 24 (now there is no fractional part)
32 bit mantissa, scaled by 2-40 (now there is no integer part)
GPUs tend to use fixed point with no integer part (typically 32-bit mantissa scaled by 2-32). Therefore APIs such as OpenGL and Direct3D often use floating-point types which are capable of holding these values. However, manipulating the integer mantissa is often more efficient so these APIs allow specifying coordinates (in texture space, color space, etc) this way as well.
As for your claim that C++ doesn't have a fixed point type, I disagree. All integer types in C++ are fixed point types. The exponent is often assumed to be zero, but this isn't required and I have quite a bit of fixed-point DSP code implemented in C++ this way.
At the code level, fixed-point arithmetic is simply integer arithmetic with an implied denominator.
For many simple arithmetic operations, fixed-point and integer operations are essentially the same. However, there are some operations which the intermediate values must be represented with a higher number of bits and then rounded off. For example, to multiply two 16-bit fixed-point numbers, the result must be temporarily stored in 32-bit before renormalizing (or saturating) back to 16-bit fixed-point.
When the software does not take advantage of vectorization (such as CPU-based SIMD or GPGPU), integer and fixed-point arithmeric is faster than FPU. When vectorization is used, the efficiency of vectorization matters a lot more, such that the performance differences between fixed-point and floating-point is moot.
Some architectures provide hardware implementations for certain math functions, such as sin, cos, atan, sqrt, for floating-point types only. Some architectures do not provide any hardware implementation at all. In both cases, specialized math software libraries may provide those functions by using only integer or fixed-point arithmetic. Often, such libraries will provide multiple level of precisions, for example, answers which are only accurate up to N-bits of precision, which is less than the full precision of the representation. The limited-precision versions may be faster than the highest-precision version.
Fixed point is widely used in DSP and embedded-systems where often the target processor has no FPU, and fixed point can be implemented reasonably efficiently using an integer ALU.
In terms of performance, that is likley to vary depending on the target architecture and application. Obviously if there is no FPU, then fixed point will be considerably faster. When you have an FPU it will depend on the application too. For example performing some functions such as sqrt() or log() will be much faster when directly supported in the instruction set rather thna implemented algorithmically.
There is no built-in fixed point type in C or C++ I imagine because they (or at least C) were envisaged as systems level languages and the need fixed point is somewhat domain specific, and also perhaps because on a general purpose processor there is typically no direct hardware support for fixed point.
In C++ defining a fixed-point data type class with suitable operator overloads and associated math functions can easily overcome this shortcomming. However there are good and bad solutions to this problem. A good example can be found here: http://www.drdobbs.com/cpp/207000448. The link to the code in that article is broken, but I tracked it down to ftp://66.77.27.238/sourcecode/ddj/2008/0804.zip
You need to be careful when discussing "precision" in this context.
For the same number of bits in representation the maximum fixed point value has more significant bits than any floating point value (because the floating point format has to give some bits away to the exponent), but the minimum fixed point value has fewer than any non-denormalized floating point value (because the fixed point value wastes most of its mantissa in leading zeros).
Also depending on the way you divide the fixed point number up, the floating point value may be able to represent smaller numbers meaning that it has a more precise representation of "tiny but non-zero".
And so on.
The diferrence between floating point and integer math depends on the CPU you have in mind. On Intel chips the difference is not big in clockticks. Int math is still faster because there are multiple integer ALU's that can work in parallel. Compilers are also smart to use special adress calculation instructions to optimize add/multiply in a single instruction. Conversion counts as an operation too, so just choose your type and stick with it.
In C++ you can build your own type for fixed point math. You just define as struct with one int and override the appropriate overloads, and make them do what they normally do plus a shift to put the comma back to the right position.
You dont use float in games because it is faster or slower you use it because it is easier to implement the algorithms in floating point than in fixed point. You are assuming the reason has to do with computing speed and that is not the reason, it has to do with ease of programming.
For example you may define the width of the screen/viewport as going from 0.0 to 1.0, the height of the screen 0.0 to 1.0. The depth of the word 0.0 to 1.0. and so on. Matrix math, etc makes things real easy to implement. Do all of the math that way up to the point where you need to compute real pixels on a real screen size, say 800x400. Project the ray from the eye to the point on the object in the world and compute where it pierces the screen, using 0 to 1 math, then multiply x by 800, y times 400 and place that pixel.
floating point does not store the exponent and mantissa separately and the mantissa is a goofy number, what is left over after the exponent and sign, like 23 bits, not 16 or 32 or 64 bits.
floating point math at its core uses fixed point logic with extra logic and extra steps required. By definition compared apples to apples fixed point math is cheaper because you dont have to manipulate the data on the way into the alu and dont have to manipulate the data on the way out (normalize). When you add in IEEE and all of its garbage that adds even more logic, more clock cycles, etc. (properly signed infinity, quiet and signaling nans, different results for same operation if there is an exception handler enabled). As someone pointed out in a comment in a real system where you can do fixed and float in parallel, you can take advantage of some or all of the processors and recover some clocks that way. both with float and fixed clock rate can be increased by using vast quantities of chip real estate, fixed will remain cheaper, but float can approach fixed speeds using these kinds of tricks as well as parallel operation.
One issue not covered is the answers is a power consumption. Though it highly depends on specific hardware architecture, usually FPU consumes much more energy than ALU in CPU thus if you target mobile applications where power consumption is important it's worth consider fixed point impelementation of the algorithm.
It depends on what you're working on. If you're using fixed point then you lose precision; you have to select the number of places after the decimal place (which may not always be good enough). In floating point you don't need to worry about this as the precision offered is nearly always good enough for the task in hand - uses a standard form implementation to represent the number.
The pros and cons come down to speed and resources. On modern 32bit and 64bit platforms there is really no need to use fixed point. Most systems come with built in FPUs that are hardwired to be optimised for fixed point operations. Furthermore, most modern CPU intrinsics come with operations such as the SIMD set which help optimise vector based methods via vectorisation and unrolling. So fixed point only comes with a down side.
On embedded systems and small microcontrollers (8bit and 16bit) you may not have an FPU nor extended instruction sets. In which case you may be forced to use fixed point methods or the limited floating point instruction sets that are not very fast. So in these circumstances fixed point will be a better - or even your only - choice.