Should I use bit manipulation on float point numbers - c++

I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.

If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.

You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.

You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5

If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.

I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.

Related

Is there a function that can convert every double to a unique uint64_t, maintaining precision and ORDER? (Why can't I find one?)

My understanding is that
Doubles in C++ are (at least conceptually) encoded as double-precision IEEE 754-encoded floating point numbers.
IEEE 754 says that such numbers can be represented with 64 bits.
So I should expect there exists a function f that can map every double to a unique uint64_t, and that the order should be maintained -- namely, for all double lhs, rhs, lhs < rhs == f(lhs) < f(rhs), except when (lhs or rhs is NaN).
I haven't been able to find such a function in a library or StackOverflow answer, even though such a function is probably useful to avoid instantiating an extra template for doubles in sort algorithms where double is rare as a sort-key.
I know that simply dividing by EPSILON would not work because the precision actually decreases as the numbers get larger (and improves as numbers get very close to zero); I haven't quite worked out the exact details of that scaling, though.
Surely there exists such a function in principle.
Have I not found it because it cannot be written in standard C++? That it would be too slow? That it's not as useful to people as I think?
If the representations of IEEE-754 64-bit floats are treated as 64-bit twos-complement values those values have the same order as the corresponding floating point values. The only adjustment involved is the mental one to see the pattern of bits as representing either a floating-point value or an integer value. In the CPU that’s easy: you have 64 bits of data stored in memory, and if you apply a floating-point operation to those bits you’re doing floating-point operations and if you apply integer operations to those bits you’re doing integer operations.
In C++ the type of the data determines the type of operations you can do. To apply floating-point operations to a 64-bit data object that object has to be a floating-point type. To apply integral operations it has to be an integral type.
To convert bit patterns from floating-point to integral:
std::int64_t to_int(double d) {
std::int64_t res
std::memcpy(&res, &d, sizeof(std::int64_t));
return res;
}
Converting in the other direction is left as an exercise for the reader.

Z3 - Floating point arithmetic API function Z3_mk_fpa_to_ubv

I am playing around with Z3-4.6.0 C++ for the first time. Sorry for the noob questions.
My question has 2 parts.
If I have a floating point number, and I use Z3_mk_fpa_to_ubv(...) function to create an unsigned bit-vector.
How much precision is lost?
If the precision is not lost, can I use this new unsigned bit-vector as a regular bit-vector and apply all operations defined for it for e.g., Z3_mk_bvadd(....)?
I know I can use Z3_mk_fpa_to_ieee_bv(....) for graceful, and IEEE-754 compliant conversion. Afterwards I can add,sub etc the bit-vectors.
Just being curious.
Thank you very much.
I'm afraid you're misinterpreting the role of these functions. A good reference to keep open while working with SMTLib floats is: http://smtlib.cs.uiowa.edu/papers/BTRW15.pdf
mk_fpa_to_ubv
This function corresponds to the FPToUInt function in the cited paper. It's defined as follows:
(The NaN choice above is misleading: It should be read as "undefined.")
Note that the precision loss can be huge here, depending on what the FP value is and the bit-width of the vector. Imagine converting a double-precision floating point value to an 8-bit word: You're smashing values in the range ±2.23×10^−308 to ±1.80×10^308 to a mere 256 different values. This means a large number of conversions simply will go through massive rounding cancelations.
You should think of this as "casting" in C like languages:
unsigned char c;
double f;
c = (char) f;
This is the essence of conversion from double-precision to unsigned byte, which will suffer major precision loss. In the other direction, if you convert to a really large bit-vector (say one that has a thousand bits), then your conversion will still be losing precision per the rounding mode, though you'll be able to cover all the integer values precisely in the range. So, it really depends on what BV-type you convert to and the rounding mode you choose.
mk_fpa_to_ieee_bv
This function has nothing to do with "preserving" the value. So asking "precision loss" here is irrelevant. What it does is that it gives you the underlying bit-vector representation of the floating-point value, per the IEEE-754 spec. The wikipedia article has a good discussion on this representation: https://en.wikipedia.org/wiki/Double-precision_floating-point_format#IEEE_754_double-precision_binary_floating-point_format:_binary64
In particular, if you interpret the output of this function as a two's complement integer value, you'll get a completely irrelevant value that has nothing to do with the value of the floating-point number itself. (Also, this conversion is not unique since NaN has multiple corresponding bit-vector patterns.)
Summary
Long story short, conversions from floats to bit-vectors will suffer from precision loss not only due to losing the "fractional" part due to rounding, but also due to the limited range, unless you pick a very-large bit-vector size. The IEEE-754 representation conversion does not preserve value, and thus doing arithmetic on values converted via this function is more or less meaningless.

Real numbers - how to determine whether float or double is required?

Given a real value, can we check if a float data type is enough to store the number, or a double is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7 or sqrt(2)), you will also want ways of detecting:
If the number is rational, whether it has repeating decimals, or cyclic decimals.
Or, What happens when you have an irrational number?
More over, there are numbers, such as 0.9, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Precision is not very platform-dependent. Although platforms are allowed to be different, float is almost universally IEEE standard single precision and double is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double yields 15.9 digits and long double yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Couldn't you simply store it to a float and a double variable and than compare these two? This should implicitely convert the float back to a double - if there is no difference, the float is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.

Is there a solution for Floating point Arithmetic problems in C++?

I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post # Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.
//convert FLOAT to short
unsigned short sConst = 0xFFFF;
unsigned short shortValue = (unsigned short)(floatValue * sConst);
//Convert SHORT to FLOAT
float floatValue = ((float)shortValue / sConst);
A short must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. An unsigned 16-bit short will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.
Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a float.
I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger short than the other?
If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.
The precise definition of "too much" depends entirely on your application. For example, if you compute a + b on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.
To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:
Floating point comparison functions for C#
Are you looking for standard like this:
Programming Languages C++ - Technical Report of Type 2 on Extensions for the programming language C++ to support decimal floating point arithmetic draft
Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.

Why are c/c++ floating point types so oddly named?

C++ offers three floating point types: float, double, and long double. I infrequently use floating-point in my code, but when I do, I'm always caught out by warnings on innocuous lines like
float PiForSquares = 4.0;
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
For integer types, we have short int, int and long int, which is pretty straightforward. Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
EDIT: It seems the relationship between floating types is similar to that of integers. double must be at least as big as float, and long double is at least as big as double. No other guarantees of precision/range are made.
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented. On early 1970s machines, single precision was significantly more efficient and as today, used half as much memory as double precision. Hence it was a reasonable default for floating-point numbers.
long double was added much later when the IEEE standard made allowances for the Intel 80287 floating-point chip, which used 80-bit floating-point numbers instead of the classic 64-bit double precision.
Questioner is incorrect about guarantees; today almost all languages guarantee to implement IEEE 754 binary floating-point numbers at single precision (32 bits) and double precision (64 bits). Some also offer extended precision (80 bits), which shows up in C as long double. The IEEE floating-point standard, spearheaded by William Kahan, was a triumph of good engineering over expediency: on the machines of the day, it looked prohibitively expensive, but on today's machines it is dirt cheap, and the portability and predictability of IEEE floating-point numbers must save gazillions of dollars every year.
You probably knew this, but you can make literal floats/long doubles
float f = 4.0f;
long double f = 4.0l;
Double is the default because thats what most people use. Long doubles may be overkill or and floats have very bad precision. Double works for almost every application.
Why the naming? One day all we had was 32 bit floating point numbers (well really all we had was fixed point numbers, but I digress). Anyway, when floating point became a popular feature in modern architectures, C was probably the language dujour then, and the name "float" was given. Seemed to make sense.
At the time, double may have been thought of, but not really implemented in the cpu's/fp cpus of the time, which were 16 or 32 bits. Once the double became used in more architectures, C probably got around to adding it. C needed something a name for something twice the size of a float, hence we got a double. Then someone needed even more precision, we thought he was crazy. We added it anyway. The name quadtuple(?) was overkill. Long double was good enough, and nobody made a lot of noise.
Part of the confusion is that good-ole "int" seems to change with the time. It used to be that "int" meant 16 bit integer. Float, however, is bound to the IEEE std as the 32-bit IEEE floating point number. For that reason, C kept float defined as 32 bit and made double and long double to refer to the longer standards.
Literals
The problem is that the literal 4.0 is a double, not a float - Which is irritating.
With constants there is one important difference between integers and floats. While it is relatively easy to decide which integer type to use (you select smallest enough to hold the value, with some added complexity for signed/unsigned), with floats it is not this easy. Many values (including simple ones like 0.1) cannot be exactly represented by float numbers and therefore choice of type affects not only performance, but also result value. It seems C language designers preferred robustness against performance in this case and they therefore decided the default representation should be the more exact one.
History
Why doesn't C just have short float, float and long float? And where on earth did "double" come from?
The terms "single precision" and "double precision" originated in FORTRAN and were already in wide use when C was invented.
First, these names are not specific to C++, but are pretty much common practice for any floating-point datatype that implements IEEE 754.
The name 'double' refers to 'double precision', while float is often said to be 'single precision'.
The two most common floating point formats use 32-bits and 64-bits, the longer one is "double" the size of the first one so it was called a "double".
A double is named such because it is double the "precision" of a float. Really, what this means is that it uses twice the space of a floating point value -- if your float is a 32-bit, then your double will be a 64-bit.
The name double precision is a bit of a misnomer, since a double precision float has a precision of the mantissa of 52-bits, where a single precision float has a mantissa precision of 23-bits (double that is 56). More on floating point here: Floating Point - Wikipedia, including
links at the bottom to articles on single and double precision floats.
The name long double is likely just down the same tradition as the long integer vs. short integer for integral types, except in this case they reversed it since 'int' is equivalent to 'long int'.
In fixed-point representation, there are a fixed number of digits after the radix point (a generalization of the decimal point in decimal representations). Contrast to this to floating-point representations where the radix point can move, or float, within the digits of the number being represented. Thus the name "floating-point representation." This was abbreviated to "float."
In K&R C, float referred to floating-point representations with 32-bit binary representations and double referred to floating-point representations with 64-bit binary representations, or double the size and whence the name. However, the original K&R specification required that all floating-point computations be done in double precision.
In the initial IEEE 754 standard (IEEE 754-1985), the gold standard for floating-point representations and arithmetic, definitions were provided for binary representations of single-precision and double-precision floating point numbers. Double-precision numbers were aptly name as they were represented by twice as many bits as single-precision numbers.
For detailed information on floating-point representations, read David Goldberg's article, What Every Computer Scientist Should Know About Floating-Point Arithmetic.
They're called single-precision and double-precision because they're related to the natural size (not sure of the term) of the processor. So a 32-bit processor's single-precision would be 32 bits long, and its double-precision would be double that - 64 bits long. They just decided to call the single-precision type "float" in C.
double is short for "double precision".
long double, I guess, comes from not wanting to add another keyword when a floating-point type with even higher precision started to appear on processors.
Okay, historically here is the way it used to be:
The original machines used for C had 16 bit words broken into 2 bytes, and a char was one byte. Addresses were 16 bits, so sizeof(foo*) was 2, sizeof(char) was 1. An int was 16 bits, so sizeof(int) was also 2. Then the VAX (extended addressing) machines came along, and an address was 32 bits. A char was still 1 byte, but sizeof(foo*) was now 4.
There was some confusion, which settled down in the Berkeley compilers so that a short was now 2 bytes and an int was 4 bytes, as those were well-suited to efficient code. A long became 8 bytes, because there was an efficient addressing method for 8-byte blocks --- which were called double words. 4 byte blocks were words and sure enugh, 2-byte blocks were halfwords.
The implementation of floating point numbers were such that they fit into single words, or double words. To remain consistent, the doubleword floating point number was then called a "double".
It should be noted that double does NOT have to be able to hold values greater in magnitude than those of float; it only has to be more precise.
hence the %f for a float type, and a %lf for a long float which is the same as double.