Algorithm to convert double to long double

Algorithm to convert double to long double - c++

If double is a 64 bit IEEE-754 type and long double is either an 80 or 128 bit IEEE-754 type, what is the algorithm that is used by the hardware (or the compiler?) in order to perform the conversion:
double d = 3.14159;
long double ld = (long double) d;
Also, it would be amazing if someone could list a source for the algorithm, as I've had no luck finding one thus far.

For normal numbers like 3.14159, the procedure is as follows:
separate the number into sign, biased exponent, and significand
add the difference in the exponent biases for long double and double
(0x3fff - 0x3ff) to the exponent.
assemble the sign, new exponent, and significand (remembering to make the
leading bit explicit in the Intel 80-bit format).
In practice, on common hardware with the Intel 80-bit format, the “conversion” is just a load instruction to the x87 stack (FLD). One rarely needs to muck around with the actual representation details, unless targeting a platform without hardware support.

It's defined in the C Standard - google for N1570 to find a copy of the latest free draft. Since all "double" values can be represented in "long double", the result is a long double with the same value. I don't think you will find a precise description of the algorithm that the hardware uses, but it's quite straightforward and obvious if you look at the data formats:
Examine the exponent and mantissa bits to find if the number is Infinity, NaN, a normalized number, a denormalised number or a zero, produce a long double Infinity or NaN when needed, adjust the exponent of normalized numbers and shift the mantissa bits into the right place, adding an implicit highest mantissa bit, convert denormalised numbers to normalised numbers, and zeroes to long double zeroes.

Related

Casting Double to Long is giving wrong value [duplicate]

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks

Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.

This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks

This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?

For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.

I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.

Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.

A very detailed post that may or may not answer your question.
An entire series in floating point complexities!

Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}

You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Some floating point precision and numeric limits question

I know that there are tons of questions like this one, but I couldn't find my answers. Please read before voting to close (:
According to PC ASM:
The numeric coprocessor has eight floating point registers.
Each register holds 80 bits of data.
Floating point numbers are always stored as 80-bit
extended precision numbers in these registers.
How is that possible, when sizeof shows different things. For example, on x64 architecture, the sizeof double is 8 and this is far away from 80bits.
why does std::numeric_limits< long double >::max() gives me 1.18973e+4932 ?! This is huuuuuuuuuuge number. If this is not the way to get max of floating point numbers, then why this compiles at all, and even more - why does this returns a value.
what does this mean:
Double precision magnitudes can range from approximately 10^−308 to 10^308
These are huge numbers, you cannot store them into 8B or even 16B (which is extended precision and it is only 128bits)?
Obviously, I'm missing something. Actually, obviously, a lot of things.

1) sizeof is the size in memory, not in a register. sizeof is in bytes, so 8 bytes = 64 bits. When doubles are calculated in memory (on this architecture), they get an extra 16 bits for more precise intermediate calculations. When the value is copied back to memory, the extra 16 bits are lost.
2) Why do you think long double doesn't go up to 1.18973e+4932?
3) Why can't you store 10^308 in 8 bytes? I only need 13 bits: 4 to store the 10, and 9 to store the 308.

A double is not an intel coprocessor 80 bit floating point, it is a IEEE 754 64 bit floating point. With sizeof(double) you will get the size of the latter.
This is the correct way to get the maximum value for long double, so your question is pointless.
You are probably missing that floating point numbers are not exact numbers. 10^308 doesn't store 308 digits, only about 19 digits.

The size of space that the FPU uses and the amount of space used in memory to represent double are two different things. IEEE 754 (which probably most architectures use) specifies 32-bit single precision and 64-bit double precision numbers, which is why sizeof(double) gives you 8 bytes. Intel x86 does floating point math internally using 80 bits.
std::numeric_limits< long double >::max() is giving you the correct size for long double which is typically 80 bits. If you want the max size for 64 bit double you should use that as the template parameter.
For the question about ranges, why do you think you can't store them in 8 bytes? They do in fact fit, and what you're missing is that at the extremes of the range there are number that can't be represented (for example exponent nearing 308, there are many many integers that cant' be represented at all).
See also http://floating-point-gui.de/ for info about floating point.

Floating point number on computer are represented according to the IEEE 754-2008.
It defines several formats, amongst which
binary32 = Single precision,
binary64 = Double precision and
binary128 = Quadruple precision are the most common.
http://en.wikipedia.org/wiki/IEEE_754-2008#Basic_formats
Double precision number have 52 bits for the digit, which gives the precision, and 10 bits for the exponent, which gives the size of the number.
So doubles are 1.xxx(52 binary digits) * 2 ^ exponent(10 binary digits, so up to 2^10=1024)
And 2^1024 = 1,79 * 10^308
Which is why this is the largest value you can store in a double.
When using a quadruple precision number, they are 112 bits of precision and 14 digits for the exponent, so the largest exponent is 16384.
As 2^16384 gives 1,18 * 10^4932 you see that your C++ test was perfectly correct and that on x64 your double is actually stored in a quadruple precision number.

Why are c/c++ floating point types so oddly named?

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.

The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.

You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.

Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.

First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.

The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".

A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.

In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.

They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.

double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.

Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".

It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.

hence the %f for a float type, and a %lf for a long float which is the same as double.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js