This question already has answers here:
C++ calculating more precise than double or long double
(2 answers)
Closed 6 years ago.
Is there any floating point type that stores more digits after the decimal point than double in c++ (or any alternative, which makes double stored more digits)?
I've read that long double is maybe more accurate.
In my program we can zoom into the Mandelbrot set, but after some zoom the picture gets pixelated. I think it is because the length between two complex numbers associated with two neighboring pixels is less than the difference between two consecutive value of double. In the program I used long double.
If it's important, then the processor of my computer is Intel® Core™ i3 CPU M 380 # 2.53GHz × 4, the computer is 64 bit, the operating system is Ubuntu and the compiler is gcc.
You should take a look at third party libraries like boost.multiprecision or even GMP.
You can also do it "by hand" but that would be a lot of work. You would have to keep numbers as their string representation and manually make the arithmetic operations yourself.
Related
This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 3 years ago.
Can somebody give me an in-depth explanation of what's going on?
The system considers approximation right?(correct me if I'm wrong)
I would like to know how the computer behaves in these kinds of situation. Thank you.
Normal numbers in computers are stored with only so many bits of precision. A float in C++ is typically 4 bytes. 32 bits can't store that many 9s of precision, so the compiler does rounding to the precision it can handle.
Basically, you get approximately 10 digits of precision in total, and you have a lot more 9s than that.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm -kind of- a beginner to C.
I made several searches but I haven't seen this question asked.
When I try to calculate very big numbers (let's say... Adding 45235432412321312 to 5495034095872309238) my calculator gives answers which are not true. (The answer of my calculator was -2 for the numbers I've given in the previous sentence).
But both Linux's and Windows's own calculators calculate these numbers precisely.
What causes my calculator written in C/C++ to give these wrong answers with big numbers? What can I do to calculate these?
Digital data is represented as binary information. So, for an 8-bit integer you would have: 0=00000000, 1=00000001, 2=00000010, 3=00000011, etc. As you can imagine, the larger the numbers grow, the more storage is required to represent this information in binary form. What happens with your calculator is called overflow, where the resulting number is simply too large to represent in binary form (there are not enough bits to hold the information).
Now as to why you get accurate results in your computer, it depends on how their software is implemented. Possible explanations are that they either use higher precision arithmetic (they dedicate more bits) the use multiple precision arithmetic, or perform floating point calculations internally. My money would be on multiple precision arithmetic though.
Simply put, the built-in numeric data types that you're using within C and C++, such as float, int, etc., are limited due to them being represented with a finite and fixed amount of bits, such as 32, 64, etc. bits. You can't "stuff" more information into 32 bits than you can, that's the theory of information (read up). Now, when you add two "very big" numbers, due to the machine representation, a so-called "overflow" occurs (read up), which means that a bit sequence is being created as a result of the operation that represents a "meaningless" number; and if the data type is signed, a negative number is likely to appear (again, due to the internal representation).
Now your calculators use so-called "big numbers arithmetic", or "long numbers arithmetic", implemented in the corresponding libraries. With this approach, the number is represented as an array of numbers, and is thus virtually unlimited (of course, there are limits to the length of an array too, but the range that you can represent this way is a lot wider than that of the built-in types.)
To sum up, read on:
theory of information
binary number system and conversions decimal <-> binary
binary arithmetics with signed numbers
big number arithmetics
Short answer (and I'm not sure why you didn't find it, because it's been asked many, many times): you want a multiple-precision arithmetic library, such as GMP.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In most of the code I see around, double is favourite against float, even when a high precision is not needed.
Since there are performance penalties when using double types (CPU/GPU/memory/bus/cache/...), what is the reason of this double overuse?
Example: in computational fluid dynamics all the software I worked with uses doubles. In this case a high precision is useless (because of the errors due to the approximations in the mathematical model), and there is a huge amount of data to be moved around, which could be cut in half using floats.
The fact that today's computers are powerful is meaningless, because they are used to solve more and more complex problems.
Among others:
The savings are hardly ever worth it (number-crunching is not typical).
Rounding errors accumulate, so better go to higher precision than needed from the start (experts may know it is precise enough anyway, and there are calculations which can be done exactly).
Common floating operations using the fpu internally often work on double or higher precision anyway.
C and C++ can implicitly convert from float to double, the other way needs an explicit cast.
Variadic and no-prototype functions always get double, not float. (second one is only in ancient C and actively discouraged)
You may commonly do an operation with more than needed precision, but seldom with less, so libraries generally favor higher precision too.
But in the end, YMMV: Measure, test, and decide for yourself and your specific situation.
BTW: There's even more for performance fanatics: Use the IEEE half precision type. Little hardware or compiler support for it exists, but it cuts your bandwidth requirements in half yet again.
In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
Habit
Culture
To match library function signatures
To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.
double is, in some ways, the "natural" floating point type in the C language, which also influences C++. Consider that:
an unadorned, ordinary floating-point constant like 13.9 has type double. To make it float, we have to add an extra suffix f or F.
default argument promotion in C converts float function arguments* to double: this takes place when no declaration exists for an argument, such as when a function is declared as variadic (e.g. printf) or no declaration exists (old style C, not permitted in C++).
The %f conversion specifier of printf takes a double argument, not float. There is no dedicated way to print float-s; a float argument default-promotes to double and so matches %f.
On modern hardware, float and double are usually mapped, respectively, to 32 bit and 64 bit IEEE 754 types. The hardware works with the 64 bit values "natively": the floating-point registers are 64 bits wide, and the operations are built around the more precise type (or internally may be even more precise than that). Since double is mapped to that type, it is the "natural" floating-point type.
The precision of float is poor for any serious numerical work, and the reduced range could be a problem also. The IEEE 32 bit type has only 23 bits of mantissa (8 bits are consumed by the exponent field and one bit for the sign). The float type is useful for saving storage in large arrays of floating-point values provided that the loss of precision and range isn't a problem in the given application. For example, 32 bit floating-point values are sometimes used in audio for representing samples.
It is true that the use of a 64 bit type over 32 bit type doubles the raw memory bandwidth. However, that only affects programs which with a large arrays of data, which are accessed in a pattern that shows poor locality. The superior precision of the 64 bit floating-point type trumps issues of optimization. Quality of numerical results is more important than shaving cycles off the running time, in accordance with the principle of "get it right first, then make it fast".
* Note, however, that there is no general automatic promotion from float expressions to double; the only promotion of that kind is integral promotion: char, short and bitfields going to int.
This is mostly hardware dependent, but consider that the most common CPU (x86/x87 based) have internal FPU that operate on 80bits floating point precision (which exceeds both floats and doubles).
If you have to store in memory some intermediate calculations, double is the good average from internal precision and external space. Performance is more or less the same, on single values. It may be affected by the memory bandwidth on large numeric pipes (since they will have double length).
Consider that floats have a precision that approximate 6 decimal digits. On a N-cubed complexity problem (like a matrix inversion or transformation), you lose two or three more in mul and div, remaining with just 3 meaningful digits. On a 1920 pixel wide display they are simply not enough (you need at least 5 to match a pixel properly).
This roughly makes double to be preferable.
It is often relatively easy to determine that double is sufficient, even in cases where it would take significant numerical analysis effort to show that float is sufficient. That saves development cost, and the risk of incorrect results if the analysis is not done correctly.
Also any performance gain by using float is usually relatively slighter than using double,that is because most of the popular processors do all floating point arithmetic in one format that is even wider than double.
I think higher precision is the only reason. Actually most people don't think a lot about it, they just use double.
I think if float precision is good enough for particular task there is no reason to use double.
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
I heard that C/C++ has a problem with the management of the float point numbers.
I've implemented a simple program to try it. It consists in a change machine: the user enter the quantity to charge and the quantity paid, and the program calculates the number of coins for each coin type to give as change.
Here is the code: Link to my google drive folder with the code
The thing is, when you insert a non-integer value, the program enter in a loop and never ends.
I've printed the content of the variables to find out what's going on, and, somehow, from a 2 decimal value let's say: 0.10, the program changes its value to a 0.0999998.
Then, the remaining change to be processed never is 0 and it enters in a infinite loop.
I've heard that this is due to the machine representation of the float point numbers. I've experimented the same either windows and Linux; and also programming it in Java, but I don't remember to have had the same issue in pascal.
Well, Now the question is: what is the best workaround for this?
I've thought that one possible solution is using fixed point representation, via external libraries as: http://www.trenki.net/content/view/17/1/ or http://www.codef00.com/code/Fixed.h . Other maybe is to use a precision arithmetic library as: GMP
Neither C nor C++ has a problem with floating point values. You as the programmer are trusted to use floating point appropriately in any language supporting it.
While integer variables cannot store fractions nor out of bounds values, floating point can only store a specific subset of fractions. A high quality floating point implementation also gives tight guarantees for the accuracy of calculation.
Floating point numbers are not rational numbers, which would need infinite space to store reliably.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Should I use double or float ?
When would I rather use double and when should I use float?
It's all about precision.
If you need to store very precise numbers then use a double.
If you need to store less precise numbers and are worried about the size of memory you're using then use a float.
The only time you need to use float is when you are storing large arrays of numbers. There is generally little difference in speed between the two and natively most things are double anyway.
Use double when you require the range it supports. Refer to Range of floating-point numbers. You should also typically use the native type, so if you're doing graphics or GPU programming, probably better stick to floats.
But whatever you do, please, do not use either to represent currency or money.