I am playing around with Z3-4.6.0 C++ for the first time. Sorry for the noob questions.
My question has 2 parts.
If I have a floating point number, and I use Z3_mk_fpa_to_ubv(...) function to create an unsigned bit-vector.
How much precision is lost?
If the precision is not lost, can I use this new unsigned bit-vector as a regular bit-vector and apply all operations defined for it for e.g., Z3_mk_bvadd(....)?
I know I can use Z3_mk_fpa_to_ieee_bv(....) for graceful, and IEEE-754 compliant conversion. Afterwards I can add,sub etc the bit-vectors.
Just being curious.
Thank you very much.
I'm afraid you're misinterpreting the role of these functions. A good reference to keep open while working with SMTLib floats is: http://smtlib.cs.uiowa.edu/papers/BTRW15.pdf
mk_fpa_to_ubv
This function corresponds to the FPToUInt function in the cited paper. It's defined as follows:
(The NaN choice above is misleading: It should be read as "undefined.")
Note that the precision loss can be huge here, depending on what the FP value is and the bit-width of the vector. Imagine converting a double-precision floating point value to an 8-bit word: You're smashing values in the range ±2.23×10^−308 to ±1.80×10^308 to a mere 256 different values. This means a large number of conversions simply will go through massive rounding cancelations.
You should think of this as "casting" in C like languages:
unsigned char c;
double f;
c = (char) f;
This is the essence of conversion from double-precision to unsigned byte, which will suffer major precision loss. In the other direction, if you convert to a really large bit-vector (say one that has a thousand bits), then your conversion will still be losing precision per the rounding mode, though you'll be able to cover all the integer values precisely in the range. So, it really depends on what BV-type you convert to and the rounding mode you choose.
mk_fpa_to_ieee_bv
This function has nothing to do with "preserving" the value. So asking "precision loss" here is irrelevant. What it does is that it gives you the underlying bit-vector representation of the floating-point value, per the IEEE-754 spec. The wikipedia article has a good discussion on this representation: https://en.wikipedia.org/wiki/Double-precision_floating-point_format#IEEE_754_double-precision_binary_floating-point_format:_binary64
In particular, if you interpret the output of this function as a two's complement integer value, you'll get a completely irrelevant value that has nothing to do with the value of the floating-point number itself. (Also, this conversion is not unique since NaN has multiple corresponding bit-vector patterns.)
Summary
Long story short, conversions from floats to bit-vectors will suffer from precision loss not only due to losing the "fractional" part due to rounding, but also due to the limited range, unless you pick a very-large bit-vector size. The IEEE-754 representation conversion does not preserve value, and thus doing arithmetic on values converted via this function is more or less meaningless.
Related
My understanding is that
Doubles in C++ are (at least conceptually) encoded as double-precision IEEE 754-encoded floating point numbers.
IEEE 754 says that such numbers can be represented with 64 bits.
So I should expect there exists a function f that can map every double to a unique uint64_t, and that the order should be maintained -- namely, for all double lhs, rhs, lhs < rhs == f(lhs) < f(rhs), except when (lhs or rhs is NaN).
I haven't been able to find such a function in a library or StackOverflow answer, even though such a function is probably useful to avoid instantiating an extra template for doubles in sort algorithms where double is rare as a sort-key.
I know that simply dividing by EPSILON would not work because the precision actually decreases as the numbers get larger (and improves as numbers get very close to zero); I haven't quite worked out the exact details of that scaling, though.
Surely there exists such a function in principle.
Have I not found it because it cannot be written in standard C++? That it would be too slow? That it's not as useful to people as I think?
If the representations of IEEE-754 64-bit floats are treated as 64-bit twos-complement values those values have the same order as the corresponding floating point values. The only adjustment involved is the mental one to see the pattern of bits as representing either a floating-point value or an integer value. In the CPU that’s easy: you have 64 bits of data stored in memory, and if you apply a floating-point operation to those bits you’re doing floating-point operations and if you apply integer operations to those bits you’re doing integer operations.
In C++ the type of the data determines the type of operations you can do. To apply floating-point operations to a 64-bit data object that object has to be a floating-point type. To apply integral operations it has to be an integral type.
To convert bit patterns from floating-point to integral:
std::int64_t to_int(double d) {
std::int64_t res
std::memcpy(&res, &d, sizeof(std::int64_t));
return res;
}
Converting in the other direction is left as an exercise for the reader.
I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.
If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.
You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.
You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5
If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.
I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.
Firstly, the problem I'm trying to solve is coming up with a better representation for values that will always remain uniformly distributed in the range:
0.0 <= x < 1.0
The motivation for this is to attempt to reduce the number of bytes used to store this data (the application is heavily memory and I/O bandwidth bound). Currently a 32-bit floating-point representation is used, 16-bit floating-point is proving insufficiently accurate.
My initial thoughts are to try and store the data in a 16-bit integer and to simply use the scheme:
x/(2^16 - 1) [x is an unsigned short]
To keep the algorithms largely the same and to retain use of the same floating-point hardware operations (at least at first), I would ideally like to keep converting this fractional representation into floating-point representation, performing the operation(s), then converting back into fractional representation for storage.
Clearly, there will be a loss of precision going back and forth between these two quite different, imprecise representations, but for our application, I suspect this might be an acceptable tradeoff.
I've done some research looking at what is currently out there that might give us a good starting point. The seminal "What Every Computer Scientist Should Know About Floating-Point Arithmetic" article (http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) led me to look at a few others, "Beyond Floating Point" (home.ccil.org/~cowan/temp/p319-clenshaw.pdf) being one such example.
Can anyone point me to other examples of representations that people have used elsewhere that might satisfy these requirements?
I'm concerned that any potential gain in exactness of representation (we're wasting much of the floating-point format currently by using this specific range) will be completely out-weighed by the requirement to round twice going from fractional representation to floating-point and back again. In which case, it may be required to do arithmetic using this fractional representation directly to get any benefit out of this approach. Any advice on this point would be helpful?
Don't use 2^16-1. Use 2^16. Yes, you will have very slightly less precision and waste your 0xFFFF, but you will guarantee that there is no loss in precision when converting to floating point. (In contrast, when converting away from floating point, you will lose 8 bits of mantissal precision.)
Round-trip conversions between precisions can cause problems with certain operations, in particular progressively summing numbers. If at all possible, treat your fixed-point values as "dirty", and don't use them for further floating-point computations; prefer recalculating from inputs to using intermediate results which are in fixed-point form.
Alternatively, use 24 bits. With this representation, you will lose no precision in either direction as long as your values don't underflow (that is, as long as they're above 2^-24).
Wouldn't 1/x be badly distributed in your range? 1/2 1/3 1/4 .. do you not want to represent numbers above 1/2?
This kind of thing is done in Netcdf quite a lot to encode data for saving space.
const double scale = 1.0/65536;
unsigned short x;
Any number in x is really x*scale
See example in NetCDF for a more general approach using scale and offset: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/tutorial/NetcdfDataset.html
Have a look at "Packed Data Values" section of this page:
https://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#Packed%20Data%20Values
I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post # Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.
//convert FLOAT to short
unsigned short sConst = 0xFFFF;
unsigned short shortValue = (unsigned short)(floatValue * sConst);
//Convert SHORT to FLOAT
float floatValue = ((float)shortValue / sConst);
A short must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. An unsigned 16-bit short will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.
Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a float.
I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger short than the other?
If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.
The precise definition of "too much" depends entirely on your application. For example, if you compute a + b on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.
To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:
Floating point comparison functions for C#
Are you looking for standard like this:
Programming Languages C++ - Technical Report of Type 2 on Extensions for the programming language C++ to support decimal floating point arithmetic draft
Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.
Do the underlying bits just get "reinterpreted" as a floating point value? Or is there a run-time conversion to produce the nearest floating point value?
Is endianness a factor on any platforms (i.e., endianness of floats differs from ints)?
How do different width types behave (e.g., int to float vs. int to double)?
What does the language standard guarantee about the safety of such casts/conversions? By cast, I mean a static_cast or C-style cast.
What about the inverse float to int conversion (or double to int)? If a float holds a small magnitude value (e.g., 2), does the bit pattern have the same meaning when interpreted as an int?
Do the underlying bits just get "reinterpreted" as a floating point value?
No, the value is converted according to the rules in the standard.
is there a run-time conversion to produce the nearest floating point value?
Yes there's a run-time conversion.
For floating point -> integer, the value is truncated, provided that the source value is in range of the integer type. If it is not, behaviour is undefined. At least I think that it's the source value, not the result, that matters. I'd have to look it up to be sure. The boundary case if the target type is char, say, would be CHAR_MAX + 0.5. I think it's undefined to cast that to char, but as I say I'm not certain.
For integer -> floating point, the result is the exact same value if possible, or else is one of the two floating point values either side of the integer value. Not necessarily the nearer of the two.
Is endianness a factor on any platforms (i.e., endianness of floats differs from ints)?
No, never. The conversions are defined in terms of values, not storage representations.
How do different width types behave (e.g., int to float vs. int to double)?
All that matters is the ranges and precisions of the types. Assuming 32 bit ints and IEEE 32 bit floats, it's possible for an int->float conversion to be imprecise. Assuming also 64 bit IEEE doubles, it is not possible for an int->double conversion to be imprecise, because all int values can be exactly represented as a double.
What does the language standard guarantee about the safety of such casts/conversions? By cast, I mean a static_cast or C-style cast.
As indicated above, it's safe except in the case where a floating point value is converted to an integer type, and the value is outside the range of the destination type.
If a float holds a small magnitude value (e.g., 2), does the bit pattern have the same meaning when interpreted as an int?
No, it does not. The IEEE 32 bit representation of 2 is 0x40000000.
For reference, this is what ISO-IEC 14882-2003 says
4.9 Floating-integral conversions
An rvalue of a floating point type can be converted to an rvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [Note:If the destination type is `bool, see 4.12. ]
An rvalue of an integer type or of an enumeration type can be converted to an rvalue of a floating point type. The result is exact if possible. Otherwise, it is an implementation-defined choice of either the next lower or higher representable value. [Note:loss of precision occurs if the integral value cannot be represented exactly as a value of the floating type. ] If the source type is bool, the value falseis converted to zero and the value true is converted to one.
Reference: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Other highly valuable references on the subject of fast float to int conversions:
Know your FPU
Let's Go To The (Floating) Point
Know your FPU: Fixing Floating Fast
Have a good read!
There are normally run-time conversions, as the bit representations are not generally compatible (with the exception that binary 0 is normally both 0 and 0.0). The C and C++ standards deal only with value, not representation, and specify generally reasonable behavior. Remember that a large int value will not normally be exactly representable in a float, and a large float value cannot be represented by an int.
Therefore:
All conversions are by value, not bit patterns. Don't worry about the bit patterns.
Don't worry about endianness, either, since that's a matter of bitwise representation.
Converting int to float can lose precision if the integer value is large in absolute value; it is less likely to with double, since double is more precise, and can represent many more exact numbers. (The details depend on what representations the system is actually using.)
The language definitions say nothing about bit patterns.
Converting from float to int is also a matter of values, not bit patterns. An exact floating-point 2.0 will convert to an integral 2 because that's how the implementation is set up, not because of bit patterns.
When you convert an integer to a float, you are not liable to loose any precision unless you are dealing with extremely large integers.
When you convert a float to an int you are essentially performing the floor() operation. so it just drops the bits after the decimal
For more information on floating point read: http://www.lahey.com/float.htm
The IEEE single-precision format has 24 bits of mantissa, 8 bits of exponent, and a sign bit. The internal floating-point registers in Intel microprocessors such as the Pentium have 64 bits of mantissa, 15 bits of exponent and a sign bit. This allows intermediate calculations to be performed with much less loss of precision than many other implementations. The down side of this is that, depending upon how intermediate values are kept in registers, calculations that look the same can give different results.
So if your integer uses more than 24 bits (excluding the hidden leading bit), then you are likely to loose some precision in conversion.
Reinterpreted? The term "reinterpretation" usually refers to raw memory reinterpretation. It is, of course, impossible to meaningfully reinterpret an integer value as a floating-point value (and vice versa) since their physical representations are generally completely different.
When you cast the types, a run-time conversion is being performed (as opposed to reinterpretation). The conversion is normally not just conceptual, it requires an actual run-time effort, because of the difference in physical representation. There are no language-defined relationships between the bit patterns of source and target values. Endianness plays no role in it either.
When you convert an integer value to a floating-point type, the original value is converted exactly if it can be represented exactly by the target type. Otherwise, the value will be changed by the conversion process.
When you convert a floating-point value to integer type, the fractional part is simply discarded (i.e. not the nearest value is taken, but the number is rounded towards zero). If the result does not fit into the target integer type, the behavior is undefined.
Note also, that floating-point to integer conversions (and the reverse) are standard conversions and formally require no explicit cast whatsoever. People might sometimes use an explicit cast to suppress compiler warnings.
If you cast the value itself, it will get converted (so in a float -> int conversion 3.14 becomes 3).
But if you cast the pointer, then you will actually 'reinterpret' the underlying bits. So if you do something like this:
double d = 3.14;
int x = *reinterpret_cast<int *>(&d);
x will have a 'random' value that is based on the representation of floating point.
Converting FP to integral type is nontrivial and not even completely defined.
Typically your FPU implements a hardware instruction to convert from IEEE format to int. That instruction might take parameters (implemented in hardware) controlling rounding. Your ABI probably specifies round-to-nearest-even. If you're on X86+SSE, it's "probably not too slow," but I can't find a reference with one Google search.
As with anything FP, there are corner cases. It would be nice if infinity were mapped to (TYPE)_MAX but that is alas not typically the case — the result of int x = INFINITY; is undefined.