Why are c/c++ floating point types so oddly named? - c++

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.

The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.

You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.

Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.

First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.

The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".

A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.

In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.

They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.

double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.

Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".

It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.

hence the %f for a float type, and a %lf for a long float which is the same as double.

Related

Casting Double to Long is giving wrong value [duplicate]

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of
precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.
This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

Z3 - Floating point arithmetic API function Z3_mk_fpa_to_ubv

I am playing around with Z3-4.6.0 C++ for the first time. Sorry for the noob questions.
My question has 2 parts.
If I have a floating point number, and I use Z3_mk_fpa_to_ubv(...) function to create an unsigned bit-vector.
How much precision is lost?
If the precision is not lost, can I use this new unsigned bit-vector as a regular bit-vector and apply all operations defined for it for e.g., Z3_mk_bvadd(....)?
I know I can use Z3_mk_fpa_to_ieee_bv(....) for graceful, and IEEE-754 compliant conversion. Afterwards I can add,sub etc the bit-vectors.
Just being curious.
Thank you very much.
I'm afraid you're misinterpreting the role of these functions. A good reference to keep open while working with SMTLib floats is: http://smtlib.cs.uiowa.edu/papers/BTRW15.pdf
mk_fpa_to_ubv
This function corresponds to the FPToUInt function in the cited paper. It's defined as follows:
(The NaN choice above is misleading: It should be read as "undefined.")
Note that the precision loss can be huge here, depending on what the FP value is and the bit-width of the vector. Imagine converting a double-precision floating point value to an 8-bit word: You're smashing values in the range ±2.23×10^−308 to ±1.80×10^308 to a mere 256 different values. This means a large number of conversions simply will go through massive rounding cancelations.
You should think of this as "casting" in C like languages:
unsigned char c;
double f;
c = (char) f;
This is the essence of conversion from double-precision to unsigned byte, which will suffer major precision loss. In the other direction, if you convert to a really large bit-vector (say one that has a thousand bits), then your conversion will still be losing precision per the rounding mode, though you'll be able to cover all the integer values precisely in the range. So, it really depends on what BV-type you convert to and the rounding mode you choose.
mk_fpa_to_ieee_bv
This function has nothing to do with "preserving" the value. So asking "precision loss" here is irrelevant. What it does is that it gives you the underlying bit-vector representation of the floating-point value, per the IEEE-754 spec. The wikipedia article has a good discussion on this representation: https://en.wikipedia.org/wiki/Double-precision_floating-point_format#IEEE_754_double-precision_binary_floating-point_format:_binary64
In particular, if you interpret the output of this function as a two's complement integer value, you'll get a completely irrelevant value that has nothing to do with the value of the floating-point number itself. (Also, this conversion is not unique since NaN has multiple corresponding bit-vector patterns.)
Summary
Long story short, conversions from floats to bit-vectors will suffer from precision loss not only due to losing the "fractional" part due to rounding, but also due to the limited range, unless you pick a very-large bit-vector size. The IEEE-754 representation conversion does not preserve value, and thus doing arithmetic on values converted via this function is more or less meaningless.

int64_t to double to int64_t again, loss of precision

I need to parse a given type (eg: long long integer) which is represented with scientific notation. Examples:
123456789012345678.3e-3
123456789012345678.3
I know the type of the given string but I can't use strtoll since number is given in scientific notation. What I do is that I convert it using strtod, do error checks with respect to int64_t and cast it back to int64_t. ErrCheckInt and ErrCheckDouble does error checks (overflow, underflow, etc) for integral and floating types and casts the number into whatever type it was. .
double res = strtod(processedStr, &end);
return (std::is_floating_point<OUT_T>::value) ? ErrCheckFloat<double, OUT_T>(res, out) : ErrCheckInt<double, OUT_T>(res, out);
Problem is when I parse int64_t with double, I get a floating point number with correct scientific notation, 1 significand. When I cast the number to int64_t again, I loss precision. The example number:
input: 123456789012345678.3
double_converted: 1.23456789012346E+17
cast_to_int64_t: 123456789012345680
expected: 123456789012345678
I know that number is long enough to be represented correctly with double precision. I can use long double but that won't solve the problem.
I can evaluate the string and remove / add digits with respect to e notation in the end but processing should be very, very fast since code will run in embedded rtos. I already do a lot of checks and strtod will take its time as well.
I know the type of the given string but I can't use strtoll since number is given in scientific notation.
You only need to call it once, use the resulting pointer to detect whether the number is in xxxeyyy form, and call strtoll again to parse the exponent. Much simpler than going through floating-point in my opinion.
I know that number is long enough to be represented correctly with double precision.
No, you don't know that since your example input is “123456789012345678”, which is not representable in IEEE 754 double-precision.
I can use long double but that won't solve the problem.
Actually, if your compiler maps long double to “80-bit extended precision with 64 bit significand”, it will solve the problem: all 64-bit integers are representable in that format. GCC and Clang make the historical 80-bit floating-point format available through long double on Linux, but it is so inconvenient as to be practically considered not available on Windows (you would need to change the FPU control word, and restore it everytime you have library functions to call, and write your own math functions to operate on 80-bit floating-point values. Starting with strtold.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.