Double versus float - c++

I have a constant for pi in my code:
const float PI = acos(-1);
Would it be better to declare it as a double? An answer to another question on this site said floating point operations aren't exactly precise, and I'd like the constant to be accurate.

"precise" is not a boolean concept. float provides a certain amount of precision. Whether or not that amount is sufficient for your application depends on, well, your application.
most applications don't need more precision than float provides, though many prefer to use double to (try and) gloss over problems with unstable algorithms or "just because" due to misconceptions like "floating point operations aren't exactly precise".
In most cases when a float is "not precise enough", the problem is not float, it's the code that uses it.
Edit: That being said, most modern CPUs only do calculations in double precision or greater anyway, so you might as well use double unless you're working with large arrays and memory usage is an issue.

From standard:
There are three floating point types:
float, double, and long double. The
type double provides at least as much
precision as float, and the type long
double provides at least as much
precision as double.
Of the three (notice that this goes hand in hand with the 3 versions of acos) you should choose long double if what you are aiming for is precision (but you should also know that after some degree, further precision may be redundant in some cases).
So you should use this to get the most precise result from acos
long double result = acos(-1L);
(Note: There might be some platform specific types or some user defined types which provide more precision)

I'd like the constant to be accurate.
There is nothing like accurate floating point values. They cannot be stored with perfect precision, because of their representation in memory. This is only possible with integers. double give you double the precision a float offers (who would have guessed). double should fit your needs in almost every case.
I would recommend using M_PI from <cmath>, which should be available in all POSIX compliant implementations of the standard.

It depends exactly how precise you need to be. I've never had to you doubles because floats are not precise enough.
The most accurate representation of pi is M_PI from math.h

The question boils down to: how much accuracy do you need?
Let's quote wikipedia:
For example, the decimal
representation of π truncated to 11
decimal places is good enough to
estimate the circumference of any
circle that fits inside the Earth with
an error of less than one millimetre,
and the decimal representation of π
truncated to 39 decimal places is
sufficient to estimate the
circumference of any circle that fits
in the observable universe with
precision comparable to the radius of
a hydrogen atom.
I've written a small java program, here's its output:
As string: 3.14159265358979323846264338327950288419716939937510
As double: 3.141592653589793
As float: 3.1415927
Remember, that if you want to have the double precision of a double, all your numbers you're calculating with need also to be doubles. (That is not entierly true, but is close enough.)

For most applications, float would do just fine for PI. Double is definitely has more precision, but it doesn't guarantee precision anymore than floats can. By that I mean that the number 1.0 represented in binary is not a rational number. Therefore, if you try to represent it, you'll only succeed to an nth digit where n is determined by how many bytes you use to represent that number.
Unfortunately to contain many digits of PI, you'd probably need to hold it in a string. Though now we're talking about some impressive number crunching here that you might see in molecule simulations. You're probably not going to need that level of precision.

As this site says, there are three overloaded versions of acos function.
Therefore the call acos(-1) is ambiguous.
Having said that, you should declare PI as long double to avoid any loss of precision, by using
const long double PI = acos(-1.0L);

Related

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

cpp division - how to get most accurate outcome?

I want to divide two ull variables and get the most accurate outcome.
what is the best way to do that?
i.e. 5000034 / 5000000 = 1.0000068
If you want "most accurate precision" - you should avoid floating point arithmetics.
You might want to use some big decimal library [whcih usually implements fixed point arithmetic], and will allow you to define the precision you are seeking.
You should avoid floating point arithmetic because thet are not exact [you have finite number of bits to represent infinite number of numbers in every range, so some slicing must occure...]. Fixed point arithmetic [as usually implemented in big decimal libraries] allows you to allocate more bits "on the fly" to represent the number in the desired accuracy.
More info on the floating point issue can be found in this [a bit advanced] article: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Instead of (double)(N) / D, do 1 + ( (double)(N - D) / D)
I'm afraid that “the most accurate outcome” doesn't mean
much. No finite representation can represent all real numbers exactly;
how precise the representation can be depends on the size of the type
and its internal representation. On most implementations, double will
give about 17 decimal digits precision, which is usually several orders
more precise than the input; for a single multiplicatio or division,
double is usually fine. (Problems occur with addition and subtraction
when the difference between the two values is extreme.) There exist
packages which offer larger precision (BigDecimal, BigFloat and the
like), but they are never exact: in the end, the precision is limited by
the amount of memory you're willing to let them use. They're also much
slower than double, and generally (slightly) more difficult to use
correctly (since they have more options, e.g. just how much precision do
you want). The only real answer to your question is another question:
how much precision do you need? And for what sequence of operations?
Rounding errors accumulate, so while double may be largely sufficient
for a single division, it may cause problems if used naïvely for
iterative procedures. Although in such cases, the solution isn't
usually to increase the precision, but to change the algorithm in a way
to avoid the problems. If double gives you the precision you need,
use it in preference to any extended type. If it doesn't, and you don't
have a choice, then choose one of the existing arbitrary precision
libraries, such as GMP.
(You might also have an issue with the way rounding is handled. For
bookkeeping purposes, for example, most jurisdictions have very strict
laws concerning how to round monitary values, and their rules are based
on decimal arithmetic. In such cases, you'll need a numeric type which
does decimal arithmetic in order for the rounding to conform in all
cases.)
Floating point numbers are probably most accurate for multiplication and division, while integers and fixed point numbers are the best choice for addition and subtraction. This follows from the fact that multiplication and division changes the order of magnitude which floating point numbers handle better, while addition and subtraction is some kind of step, which integers and fixed point numbers handle better.
If you want the best accuracy when dividing integers, implement a RationalNumber class containing the numerator and denominator. This way your reslut will always be exact if you avoid arithmetic overflow. This requires that you accept output in fractional form.

About precision on C++ calculus

I have gotten a prototype function that make some calculus (integrals of gamma function) in C++ and I need to convert it to C language. The author used float variables with suffix f in every calculus. Like these sentences...
float a1=.083333333f;
float vv=dif*i/1.414214f;
The program makes use of truncated series on many lines by multiplication of some of that variables.
My question so is... Don`t I get more precision if I use double precision variables? Why the sufix could be necessary on that case?
Thanks in advance!
You would get more precision with double precision, and you don't need any special suffix in C/C++. So, your code could look like
double a1=.083333333;
double vv=dif*i/1.414214;
Also, you are free to use more accurate floating-point literals if you want... so add more "3"s and expand "1.414214" to your heart's content. Bear in mind, however, that not even doubles are perfectly accurate.
float, in modern systems, is a 32 bit floating point type.
double, in modern systems, is a 64 bit floating point type.
The accuracy of a float is much less than a 64 bit floating point, but is still very useful for speed and because it occupies less bytes of ram.
Now with 64 bit system the difference between float and double are a lot less noticeable, but in the past, they could be really different.
You have always to find a trade-off between performance and precision: the choice between double and float is exactly this: do you want high precision or low precision but better performance?
In 3D games usually we use float, in calculus applications we usually use double.
See if the precision of double suits you, if not, i would suggest you to search a C++ library for numbers with abritrary precision, though, to continue our discussion about performances, the performances of this objects are really bad compared to native double or float.

Why floating point value such as 3.14 are considered as double by default in MSVC?

Why do I need to put 3.14f instead of 3.14 to disable all those warnings ?
Is there a coherent reason reason for this ?
That's what the C++ (and C) standard decided. Floating point literals are of type double, and if you need them to be floats, you suffix them with a f. There doesn't appear to be any specifically stated reason as to why, but I'd guess it's a) For compatibility with C, and b) A trade-off between precision and storage.
2.13.3 Floating literals The type
of a floating literal is double unless
explicitly specified by a suffix. The
suffixes f and F specify float, the suffixes l and L specify long double.
If the scaled value is not in the
range of
representable values for its type, the program is ill-formed.
C and C++ prefer double to float in a couple of ways. As you noticed, fractional literals are double unless explicitly made floats. Also, floats can't be passed in varargs, they're always promoted to double (in the same way char and short are promoted to int in varargs).
It's probably better to think of float as being a contracted double, rather than double being an extended float. That is, double is the preferred floating point type, and float is used whenever a smaller version of double is required for some particular case. That's the closest I know to a coherent reason, and then the rule makes sense, even if you happen to be in the case where you need a smaller version.
This is not peculiar to MSVC, it is required by the language standard.
I would suggest that it made sense not to reduce precision unless explicitly requested, so the default is double.
The 6 significant digits of precision that a single-precision float provides is seldom sufficient for general use and certainly on a modern desktop processor would be used as a hand coded optimisation where the writer has determined that it is sufficient and necessary; so it makes sense that an explicit visible marker is required to specify a single-precision literal.
This is probably a standard in C world. Double is preferred since it's more precise and you probably won't see any performance differences. Read this post.
Because double can approximate 3.14 much better than float, maybe? Here are the exact values:
3.140000000000000124344978758017532527446746826171875 (double)
3.1400001049041748046875 (float)

Why are c/c++ floating point types so oddly named?

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.
You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.
Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.
First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.
The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".
A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.
In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.
They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.
double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.
Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".
It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.
hence the %f for a float type, and a %lf for a long float which is the same as double.