IEEE floating points implementation, precision and accumulation of approximations [closed] - c++

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
If I understand IEEE floating points correctly, they are unable to accurately represent some values. They are accurate in very limited cases and pretty much every floating point operation increases the accumulated approximations. Also, another downside - the "minimum step" grows with the exponent.
Wouldn't it be better to offer some more concrete representation?
For example, use 20 bits for the "decimal" part, but not all all 2^20 values, instead only 1000000, giving a full 1/millionth smallest possible representation/resolution, and use the other 44 bits for the integer part, giving quite the range. This way "floating point" numbers can be calculated using integer arithmetic, which may even end up faster. And in the case of multiplication, addition and subtraction there is no accumulation of approximations, the only possible loss is during division.
This concept rests on the fact that 2^n values are not optimal for representing decimal numbers, e.g. 1 does not divide that well into 1024 parts, but it divides pretty well into 1000. Technically, this is omitting to make use of the full precision, but I can think of plenty of cases where LESS can be MORE.
Naturally, this approach will lose both range and precision in a way, but in all the cases where extremities are not required, such a representation sounds like a good idea.

What you describe as a proposition is a fixed point arithmetic. Now, it's not necesserily about better or worse; each representation has advantages and disadvantages that often make one more suitable than the other for some specific purpose. For example:
Fixed point arithmetic does not introduce rouding errors for operations like addition and subtraction, what makes it suitable for financial calculations. You certainly don't want to store money as a floating point values.
Speculation: arguably, fixed point arithmetic is simpler in terms of implementation, which probably leads to smaller, more efficient circuits.
Floating-point representation covers extremely large range: it can be used to store really big numbers (~1040 for 32-bit float, 10308 for 64-bit one) and really small positive ones (~10-320) at the expense of precision, while the fixed-point representation is linearly limited by its size.
Floating-point precision is not distributed uniformly accross the representable range. Instead, most of the values (in terms of number of representable numbers) lies in the unit ball around 0. That makes it very accurate in the range we operate in most often.
You said it yourself:
Technically, this is omitting to make use of the full precision, but I
can think of plenty of cases where LESS can be MORE
Exactly, that's the whole point. Now, depending on the problem at hand, a choice must be made. There is no one-size-fits-all representation, it's always a tradeoff.

Related

Why aren’t posit arithmetic representations commonly used?

I recently found this library that seems to provide its own types and operations on real numbers that are 2 to 3 orders of magnitude faster than normal floating point arithmetic.
The library is based on using a different representation for real numbers. One that is described to be both more efficient and mathematically accurate than floating point - posit.
If this representation is so efficient why isn’t it widely used in all sorts of applications and implemented in hardware, or maybe it is? As far as I know most typical hardware uses some kind of IEEE floating point representation for real numbers.
Is it somehow maybe only applicable to some very specific AI research, as they seem to list mostly that as an example?
If this representation is not only hundreds to thousands of times faster than floating point, but also much more deterministic and designed for use in concurrent systems, why isn’t it implemented in GPUs, which are basically massively concurrent calculators working on real numbers? Wouldn’t it bring huge advances in rendering performance and GPU computation capabilities?
Update: People behind the linked Universal library have released a paper about their design and implementation.
The most objective and convincing reason I know of is that posits were introduced less than 4 years ago. That's not enough time to make inroads in the marketplace (people need time to develop implementations), much less take it over (which, among other things, requires overcoming incompatibilities with existing software).
Whether or not the industry wants to make such a change is a separate issue that tends towards subjectivity.
The reason why the IEEE standard seems to be slower is because the IEEE addresses some topics with an higher importance. For example:
.
.
.
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) defines:
arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special "not a number" values (NaNs)
interchange formats: encodings (bit strings) that may be used to exchange floating-point data in an efficient and compact form
rounding rules: properties to be satisfied when rounding numbers during arithmetic and conversions
operations: arithmetic and other operations (such as trigonometric functions) on arithmetic formats
exception handling: indications of exceptional conditions (such as division by zero, overflow, etc.)
The above is from Wikipedia copied: https://en.wikipedia.org/wiki/IEEE_754
.
.
.
Your linked library, which seems to be called the posit number system advocates the following strengths.
Economical - No bit patterns are redundant. There is one representation for infinity denoted as ± inf and zero. All other bit patterns are valid distinct non-zero real numbers. ± inf serves as a replacement for NaN.
Mathematical Elegant - There is only one representation for zero, and the encoding is symmetric around 1.0. Associative and distributive laws are supported through deferred rounding via the quire, enabling reproducible linear algebra algorithms in any concurrency environment.
Tapered Accuracy - Tapered accuracy is when values with small exponent have more digits of accuracy and values with large exponents have fewer digits of accuracy. This concept was first introduced by Morris (1971) in his paper ”Tapered Floating Point: A New Floating-Point Representation”.
Parameterized precision and dynamic range -- posits are defined by a size, nbits, and the number of exponent bits, es. This enables system designers the freedom to pick the right precision and dynamic range required for the application. For example, for AI applications we may pick 5 or 6 bit posits without any exponent bits to improve performance. For embedded DSP applications, such as 5G base stations, we may select a 16 bit posit with 1 exponent bit to improve performance per Watt.
Simpler Circuitry - There are only two special cases, Not a Real and Zero. No denormalized numbers, overflow, or underflow.
The above is from GitHub copied: https://github.com/stillwater-sc/universal
.
.
.
So, in my opinion, the posit number system prefers performance, while the IEEE Standard for Floating-Point Arithmetic (IEEE 754) prefers technical compatibility and interchangeability.
I strongly challenge the claim of that library being faster than IEEE floating point:
Modern hardware includes circuitry specifically designed to handle IEEE floating point arithmetic. Depending on your CPU model, it can perform roughly 0.5 to 4 floating point operations per clock cycle. Yes, this circuitry does complex things, but because it's built in hardware and aggressively optimized for many years, it achieves this kind of speed.
Any software library that provide a different floating point format must perform the arithmetic in software. It cannot just say "please multiply these two numbers using double precision arithmetic" and see the result appear in the corresponding register two clock cycles later, it must contain code that takes the four different parts of the posit format, handles them separately, and fuses together a result. And that code takes time to execute. Much more time than just two clock cycles.
The "universal" library may have corner cases where its posit number format shines. But speed is not where it can hope to compete.

Is it safe to use double for scientific constants in C++?

I want to do some calculations in C++ using several scientific constants like,
effective mass of electron(m) 9.109e-31 kg
charge of electron 1.602e-19 C
Boltzman constant(k) 1.38×10−23
Time 8.92e-13
And I have calculations like, sqrt((2kT)/m)
Is it safe to use double for these constants and for results?
floating point arithmetic and accuracy is a very tricky subject. Read absolutely the floating-point-gui.de site.
Errors of many floating point operations can accumulate to the point of giving meaningless results. Several catastrophic events (loss of life, billions of dollars crashes) happened because of this. More will happen in the future.
There are some static source analyzers dedicated to detect them, for example Fluctuat (by my CEA colleagues, several now at Ecole Polytechnique, Palaiseau, France) and others. But Rice's theorem applies so that static analysis problem is unsolvable in general.
(but static analysis of floating point accuracy could sometimes practically work on some small programs of a few thousand lines, and do not scale well to large programs)
There are also some programs instrumenting calculations, for example CADNA from LIP6 in Paris, France.
(but instrumention may give a huge over-approximation of the error)
You could design your numerical algorithms to be less sensitive to floating point errors. This is very difficult (and you'll need years of work to acquire the relevant skills and expertise).
(you need both numerical, mathematical, and computer science skills, PhD-level)
You could also use arbitrary-precision arithmetic, or extended precision one (e.g. 128 bit floats or quad-precision). This slows down the computations.
An important consideration is how much effort (time and money) you can allocate to hunt floating point errors, and how much do they matter to your particular problem. But there is No Silver Bullet, and the question of floating point accurary remains a very difficult issue (you could work your entire life on it).
PS. I am not a floating point expert. I just happen to know some.
With the particular example you gave (constants and calculations) : YES
You didn't define 'safe' in your problem. I will assume that you want to keep the same number of correct significant digits.
doubles are correct to 15 significant digits
you have constants that have 4 significant digits
the operations involves use multiplication, division, and one square root
it doesn't seem that your results are going to the 'edge' cases of doubles (for very small or large exponent value, where mantissa loses precision)
In this particular order, the result would be correct to 4 significant digits.
In the general case, however, you have to be careful. (probably not, and this depend on your definition of 'safe' of course).
This is a large and complicated subject. In particular, your result might not be correct to the same number of significant digits if you have :
a lot more operations,
if you have substractions of numbers close to each other
other problematic operations
Obligatory reading : What Every Computer Scientist Should Know About Floating-Point Arithmetic
See the good answer of #Basile Starynkevitch for other references.
Also, for complex calculations, it is relevant to have some notion of the Condition number of a problem.
If you need a yes or no answer, No.

28x slowdown when multiplying small floating point numbers [duplicate]

So I'm trying to learn more about Denormalized numbers as defined in the IEEE 754 standard for Floating Point numbers. I've already read several articles thanks to Google search results, and I've gone through several StackOverFlow posts. However I still have some questions unanswered.
First off, just to review my understanding of what a Denormalized float is:
Numbers which have fewer bits of precision, and are smaller (in
magnitude) than normalized numbers
Essentially, a denormalized float has the ability to represent the SMALLEST (in magnitude) number that is possible to be represented with any floating point value.
Does that sound correct? Anything more to it than that?
I've read that:
using denormalized numbers comes with a performance cost on many
platforms
Any comments on this?
I've also read in one of the articles that
one should "avoid overlap between normalized and denormalized numbers"
Any comments on this?
In some presentations of the IEEE standard, when floating point ranges are presented the denormalized values are excluded and the tables are labeled as an "effective range", almost as if the presenter is thinking "We know that denormalized numbers CAN represent the smallest possible floating point values, but because of certain disadvantages of denormalized numbers, we choose to exclude them from ranges that will better fit common use scenarios" -- As if denormalized numbers are not commonly used.
I guess I just keep getting the impression that using denormalized numbers turns out to not be a good thing in most cases?
If I had to answer that question on my own I would want to think that:
Using denormalized numbers is good because you can represent the smallest (in magnitude) numbers possible -- As long as precision is not important, and you do not mix them up with normalized numbers, AND the resulting performance of the application fits within requirements.
Using denormalized numbers is a bad thing because most applications do not require representations so small -- The precision loss is detrimental, and you can shoot yourself in the foot too easily by mixing them up with normalized numbers, AND the peformance is not worth the cost in most cases.
Any comments on these two answers? What else might I be missing or not understand about denormalized numbers?
Essentially, a denormalized float has the ability to represent the
SMALLEST (in magnitude) number that is possible to be represented with
any floating point value.
That is correct.
using denormalized numbers comes with a performance cost on many platforms
The penalty is different on different processors, but it can be up to 2 orders of magnitude. The reason? The same as for this advice:
one should "avoid overlap between normalized and denormalized numbers"
Here's the key: denormals are a fixed-point "micro-format" within the IEEE-754 floating-point format. In normal numbers, the exponent indicates the position of the binary point. Denormal numbers contain the last 52 bits in the fixed-point notation with an exponent of 2-1074 for doubles.
So, denormals are slow because they require special handling. In practice, they occur very rarely, and chip makers don't like to spend too many valuable resources on rare cases.
Mixing denormals with normals is slow because then you're mixing formats and you have the additional step of converting between the two.
I guess I just keep getting the impression that using denormalized
numbers turns out to not be a good thing in most cases?
Denormals were created for one primary purpose: gradual underflow. It's a way to keep the relative difference between tiny numbers small. If you go straight from the smallest normal number to zero (abrupt underflow), the relative change is infinite. If you go to denormals on underflow, the relative change is still not fully accurate, but at least more reasonable. And that difference shows up in calculations.
To put it a different way. Floating-point numbers are not distributed uniformly. There are always the same amount of numbers between successive powers of two: 252 (for double precision). So without denormals, you always end up with a gap between 0 and the smallest floating-point number that is 252 times the size of the difference between the smallest two numbers. Denormals fill this gap uniformly.
As an example about the effects of abrupt vs. gradual underflow, look at the mathematically equivalent x == y and x - y == 0. If x and y are tiny but different and you use abrupt underflow, then if their difference is less than the minimum cutoff value, their difference will be zero, and so the equivalence is violated.
With gradual underflow, the difference between two tiny but different normal numbers gets to be a denormal, which is still not zero. The equivalence is preserved.
So, using denormals on purpose is not advised, because they were designed only as a backup mechanism in exceptional cases.

Calculating With Very Big Numbers [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm -kind of- a beginner to C.
I made several searches but I haven't seen this question asked.
When I try to calculate very big numbers (let's say... Adding 45235432412321312 to 5495034095872309238) my calculator gives answers which are not true. (The answer of my calculator was -2 for the numbers I've given in the previous sentence).
But both Linux's and Windows's own calculators calculate these numbers precisely.
What causes my calculator written in C/C++ to give these wrong answers with big numbers? What can I do to calculate these?
Digital data is represented as binary information. So, for an 8-bit integer you would have: 0=00000000, 1=00000001, 2=00000010, 3=00000011, etc. As you can imagine, the larger the numbers grow, the more storage is required to represent this information in binary form. What happens with your calculator is called overflow, where the resulting number is simply too large to represent in binary form (there are not enough bits to hold the information).
Now as to why you get accurate results in your computer, it depends on how their software is implemented. Possible explanations are that they either use higher precision arithmetic (they dedicate more bits) the use multiple precision arithmetic, or perform floating point calculations internally. My money would be on multiple precision arithmetic though.
Simply put, the built-in numeric data types that you're using within C and C++, such as float, int, etc., are limited due to them being represented with a finite and fixed amount of bits, such as 32, 64, etc. bits. You can't "stuff" more information into 32 bits than you can, that's the theory of information (read up). Now, when you add two "very big" numbers, due to the machine representation, a so-called "overflow" occurs (read up), which means that a bit sequence is being created as a result of the operation that represents a "meaningless" number; and if the data type is signed, a negative number is likely to appear (again, due to the internal representation).
Now your calculators use so-called "big numbers arithmetic", or "long numbers arithmetic", implemented in the corresponding libraries. With this approach, the number is represented as an array of numbers, and is thus virtually unlimited (of course, there are limits to the length of an array too, but the range that you can represent this way is a lot wider than that of the built-in types.)
To sum up, read on:
theory of information
binary number system and conversions decimal <-> binary
binary arithmetics with signed numbers
big number arithmetics
Short answer (and I'm not sure why you didn't find it, because it's been asked many, many times): you want a multiple-precision arithmetic library, such as GMP.

Why are double preferred over float? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
In most of the code I see around, double is favourite against float, even when a high precision is not needed.
Since there are performance penalties when using double types (CPU/GPU/memory/bus/cache/...), what is the reason of this double overuse?
Example: in computational fluid dynamics all the software I worked with uses doubles. In this case a high precision is useless (because of the errors due to the approximations in the mathematical model), and there is a huge amount of data to be moved around, which could be cut in half using floats.
The fact that today's computers are powerful is meaningless, because they are used to solve more and more complex problems.
Among others:
The savings are hardly ever worth it (number-crunching is not typical).
Rounding errors accumulate, so better go to higher precision than needed from the start (experts may know it is precise enough anyway, and there are calculations which can be done exactly).
Common floating operations using the fpu internally often work on double or higher precision anyway.
C and C++ can implicitly convert from float to double, the other way needs an explicit cast.
Variadic and no-prototype functions always get double, not float. (second one is only in ancient C and actively discouraged)
You may commonly do an operation with more than needed precision, but seldom with less, so libraries generally favor higher precision too.
But in the end, YMMV: Measure, test, and decide for yourself and your specific situation.
BTW: There's even more for performance fanatics: Use the IEEE half precision type. Little hardware or compiler support for it exists, but it cuts your bandwidth requirements in half yet again.
In my opinion the answers so far don't really get the right point across, so here's my crack at it.
The short answer is C++ developers use doubles over floats:
To avoid premature optimization when they don't understand the performance trade-offs well ("they have higher precision, why not?" Is the thought process)
Habit
Culture
To match library function signatures
To match simple-to-write floating point literals (you can write 0.0 instead of 0.0f)
It's true double may be as fast as a float for a single computation because most FPUs have a wider internal representation than either the 32-bit float or 64-bit double represent.
However that's only a small piece of the picture. Now-days operational optimizations don't mean anything if you're bottle necked on cache/memory bandwidth.
Here is why some developers seeking to optimize their code should look into using 32-bit floats over 64-bit doubles:
They fit in half the memory. Which is like having all your caches be twice as large. (big win!!!)
If you really care about performance you'll use SSE instructions. SSE instructions that operate on floating point values have different instructions for 32-bit and 64-bit floating point representations. The 32-bit versions can fit 4 values in the 128-bit register operands, but the 64-bit versions can only fit 2 values. In this scenario you can likely double your FLOPS by using floats over double because each instruction operates on twice as much data.
In general, there is a real lack of knowledge of how floating point numbers really work in the majority of developers I've encountered. So I'm not really surprised most developers blindly use double.
double is, in some ways, the "natural" floating point type in the C language, which also influences C++. Consider that:
an unadorned, ordinary floating-point constant like 13.9 has type double. To make it float, we have to add an extra suffix f or F.
default argument promotion in C converts float function arguments* to double: this takes place when no declaration exists for an argument, such as when a function is declared as variadic (e.g. printf) or no declaration exists (old style C, not permitted in C++).
The %f conversion specifier of printf takes a double argument, not float. There is no dedicated way to print float-s; a float argument default-promotes to double and so matches %f.
On modern hardware, float and double are usually mapped, respectively, to 32 bit and 64 bit IEEE 754 types. The hardware works with the 64 bit values "natively": the floating-point registers are 64 bits wide, and the operations are built around the more precise type (or internally may be even more precise than that). Since double is mapped to that type, it is the "natural" floating-point type.
The precision of float is poor for any serious numerical work, and the reduced range could be a problem also. The IEEE 32 bit type has only 23 bits of mantissa (8 bits are consumed by the exponent field and one bit for the sign). The float type is useful for saving storage in large arrays of floating-point values provided that the loss of precision and range isn't a problem in the given application. For example, 32 bit floating-point values are sometimes used in audio for representing samples.
It is true that the use of a 64 bit type over 32 bit type doubles the raw memory bandwidth. However, that only affects programs which with a large arrays of data, which are accessed in a pattern that shows poor locality. The superior precision of the 64 bit floating-point type trumps issues of optimization. Quality of numerical results is more important than shaving cycles off the running time, in accordance with the principle of "get it right first, then make it fast".
* Note, however, that there is no general automatic promotion from float expressions to double; the only promotion of that kind is integral promotion: char, short and bitfields going to int.
This is mostly hardware dependent, but consider that the most common CPU (x86/x87 based) have internal FPU that operate on 80bits floating point precision (which exceeds both floats and doubles).
If you have to store in memory some intermediate calculations, double is the good average from internal precision and external space. Performance is more or less the same, on single values. It may be affected by the memory bandwidth on large numeric pipes (since they will have double length).
Consider that floats have a precision that approximate 6 decimal digits. On a N-cubed complexity problem (like a matrix inversion or transformation), you lose two or three more in mul and div, remaining with just 3 meaningful digits. On a 1920 pixel wide display they are simply not enough (you need at least 5 to match a pixel properly).
This roughly makes double to be preferable.
It is often relatively easy to determine that double is sufficient, even in cases where it would take significant numerical analysis effort to show that float is sufficient. That saves development cost, and the risk of incorrect results if the analysis is not done correctly.
Also any performance gain by using float is usually relatively slighter than using double,that is because most of the popular processors do all floating point arithmetic in one format that is even wider than double.
I think higher precision is the only reason. Actually most people don't think a lot about it, they just use double.
I think if float precision is good enough for particular task there is no reason to use double.